Difference between revisions of "Text Analytics"

From Sinfronteras
Jump to: navigation, search
(Example 1 - Vectorization - Creating a DTM)
(Replaced content with "~ Migrated")
(Tag: Replaced)
 
Line 1: Line 1:
{{Sidebar}}
+
~ Migrated
<accesscontrol>
 
Autoconfirmed users
 
</accesscontrol>
 
 
 
'''Text Analytics / Mining'''<br />
 
Gartner’s IT Glossary defines Text Analytics as the process of deriving information from text sources for purposes that include summarization, classification, investigation, sentiment analysis (the nature of commentary on a topic), and explication (what drives
 
that commentary). https://www.sas.com/content/dam/SAS/en_us/doc/research2/iia-text-analytics-unlocking-value-unstructured-data-108443.pdf
 
 
 
It’s essential to understand how this definition translates into action and the value you can generate through the various capabilities of Text Analytics. These include search and information retrieval, information extraction through techniques such as natural language processing (NLP), tagging or annotation, lexical analysis to study word frequency and distribution, singular value decomposition (SVD), pattern recognition, data mining techniques including link and association analysis, predictive analysis, segmentation, and visualization. https://www.sas.com/content/dam/SAS/en_us/doc/research2/iia-text-analytics-unlocking-value-unstructured-data-108443.pdf
 
 
 
 
 
<br />
 
'''What is text analytics?''' [ChatGPT]<br />
 
Text analytics, also known as text mining or computational linguistics, is the process of deriving insights and meaningful information from unstructured text data. This involves using various computational techniques and algorithms to analyze, categorize, and extract relevant information from textual data sources such as documents, emails, social media posts, customer feedback, and more.
 
 
 
Text analytics involves several stages, including data preprocessing, text classification, entity extraction, sentiment analysis, and topic modeling. These techniques can help businesses and organizations gain valuable insights from large volumes of text data, such as customer preferences, market trends, and emerging issues.
 
 
 
Text analytics is used in a wide range of applications, such as customer experience management, fraud detection, brand reputation management, and predictive analytics. It has become an essential tool for businesses and organizations seeking to derive insights from the vast amounts of textual data generated every day.
 
 
 
 
 
<br />
 
'''What is NLP?''' [ChatGPT]<br />
 
NLP, or Natural Language Processing, is a branch of artificial intelligence (AI) that deals with the interaction between computers and humans through natural language. It involves developing algorithms and computational models that can process and understand human language, and generate human-like responses. NLP technologies are used in a wide range of applications, such as language translation, '''sentiment analysis''', speech recognition, and chatbots. NLP is a rapidly growing field, with ongoing research and development aimed at improving the accuracy and effectiveness of language-based applications.
 
 
 
 
 
<br />
 
'''Does text analytics encompass NLP or vice versa?''' [ChatGPT]<br />
 
Text analytics and NLP are closely related fields and often used interchangeably, but there are some subtle differences between them.
 
 
 
Text analytics generally refers to the process of analyzing and extracting information from textual data, regardless of the language used. Text analytics may use NLP techniques, but it is not limited to them.
 
 
 
NLP, on the other hand, specifically focuses on the interaction between computers and natural language, with the goal of enabling machines to understand, interpret, and generate human language. NLP techniques are used in text analytics to perform tasks such as sentiment analysis, entity recognition, and summarization.
 
 
 
So, to answer your question, text analytics encompasses NLP in the sense that it can use NLP techniques to analyze textual data. However, NLP is a narrower field that focuses specifically on language processing, while text analytics is a broader field that encompasses a range of techniques for analyzing and deriving insights from text data.
 
 
 
 
 
<br />
 
==Motivation==
 
Somewhere between 80-90% of all potentially usable business information may originate in unstructured form. However, only a tiny fraction of those data have been mined or used in predictive analytics tasks.
 
 
 
The quality of the outcomes of predictive analytics is limited by the data that is used in the analysis. Most predictive analytics tasks use data that is stored in a simple, structured format. One of the reasons why text analytics is not yet considered mainstream is that handling unstructured data is difficult and requires additional expertise and approaches.
 
 
 
 
 
<br />
 
==What is Text Mining==
 
"... a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known." (Hearst 1999)
 
 
 
 
 
"... a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of analysis tools." (Feldman and Sanger 2006)
 
 
 
 
 
"... the discovery and extraction of interesting, non-trivial knowledge from free or unstructured text." (Kao and Poteet 2006)
 
 
 
 
 
<br />
 
==Human language is difficult - Ambiguity Everywhere==
 
"I made her duck"
 
* I cooked waterfowl for her.
 
* I cooked waterfowl belonging to her.
 
* I created the (plaster?) duck she owns.
 
* I caused her to quickly lower her head.
 
* I magically converted her into roast fowl.
 
(Jurafsky and Martin 2009)
 
 
 
 
 
<br />
 
==Components of Text Mining==
 
* '''Information retrieval (IR)''' is the activity of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images, or sounds. https://en.wikipedia.org/wiki/Information_retrieval
 
 
 
 
 
* '''Natural language processing (NLP)''' is used to analyze the text using structures and rules based on human language.
 
 
 
 
 
* '''Information extraction (IE)''' involves structuring the data that the NLP system generates.
 
 
 
 
 
* '''Data Mining (DM)''' is the process of identifying patterns in large sets of data, to find new knowledge.
 
 
 
 
 
<br />
 
==Information Retrieval==
 
 
 
 
 
<br />
 
===Tokenisation===
 
* Tokenisation is a lexical process of breaking the text up into usable units.
 
* Depending on the context, tokens can be words, phrases, or symbols.
 
* What constitutes a token can depend on the language, corpus, and even context.
 
 
 
 
 
<br />
 
===Normalisation===
 
The goal of normalization is to convert different forms of a word to a single normalized form. For example E.U. - > EU and Grafton St. -> Grafton Street.
 
 
 
This can be based on hard-coded rules, such as replacing "St." with "Street", or deleting periods and hyphens. This can be a little hit and miss. Also, the process has to deal with ambiguity. In the "St." example, how can we know if it refers to "Street" or "Saint"?
 
 
 
The most commonplace method of normalization is to create equivalence classes, usually named after one member of the set. For instance, if the tokens «car» and «automobile» are both mapped onto the term automobile, then searches for either term will retrieve documents that contain one or both.
 
 
 
<br />
 
 
 
===Stop Words===
 
 
 
 
 
<br />
 
===Stemming===
 
 
 
 
 
<br />
 
===Lemmatisation===
 
 
 
 
 
<br />
 
===Vectorization===
 
 
 
 
 
<br />
 
====The Bag of Words approach====
 
The bag-of-words approach is a method of representing text based on the frequency of occurrence of words within a document, considering each word count as a feature.
 
 
 
The bag of words approach is based on some naïve assumptions:
 
* Words are independent of each other
 
* Grammar and word order are not important
 
* Documents are exchangeable, so document ordering can be neglected
 
 
 
It is called a "bag" of words because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document. https://machinelearningmastery.com/gentle-introduction-bag-words-model/
 
 
 
While this might be computationally simple, the assumptions do not hold for any real-life document. The method ignores the context in which a word occurs, and in the process loses the meaning.
 
 
 
 
 
<br />
 
====N-Gram Representations====
 
I have also seen the term «bag of n-grams»
 
 
 
 
 
What is a N-Gram? https://en.wikipedia.org/wiki/N-gram
 
 
 
 
 
The n-gram method uses contiguous sequences of 1 or more words to ensure more of the meaning in a document is retained and word dependency is captured. Each word or token is called a "gram".
 
 
 
 
 
"An N-gram is an N-token sequence of words: a 2-gram (more commonly called a bigram) is a two-word sequence of words like 'please turn', 'turn your', or 'your homework', and a 3-gram (more commonly called a trigram) is a three-word sequence of words like'please turn your', or 'turn your homework'." (Jurafsky and Martin 2009)
 
 
 
 
 
The primary disadvantage of the n-gram approach is that the size of the vocabulary is increased to <math> O(V^N)</math>.
 
 
 
 
 
<br />
 
 
 
====N-Grams and language identification====
 
* The top 300 or so N-grams almost always highly correlated to the language and can be used for language identification. (Cavnar and Trenkle 1994)
 
 
 
* The highest-ranking N-grams are mostly uni-grams (N=1), and simply reflect the distribution of the letters of the alphabet in the document's language.
 
 
 
* After this come the function words and very frequent prefixes and suffixes (morphemes).
 
 
 
* Starting at around 300, an N-gram frequency profile begins to show N-grams that are more specific to the subject of the document
 
 
 
 
 
<br />
 
====Binary Term-document matrix====
 
The binary term-document incidence matrix
 
 
 
 
 
[[File:Binary_TermDocumentMatrix.png|700px|thumb|center|]]
 
 
 
 
 
* Consider a corpus of N=106 documents, each with approximately 1000 tokens.
 
* This gives a total of 109 tokens
 
* With an average of 6 bytes per token, the size of document collection would be about <math> 6 \times 109 = 6 GB</math>!
 
* Assume there are <math>M = 500000</math> distinct terms in the collection
 
* <math>M = 500000 \times 106 = 1/2</math> trillion 0s and 1s !!!
 
* But the matrix will have no more than one billion 1s - it is extremely sparse.
 
* A better representation would be to record only the 1s.
 
 
 
 
 
<br />
 
====The inverted index====
 
The inverted index consists of a dictionary, containing the tokens, and a postings list, simply a list of the number of occurrences in each document.
 
 
 
: BRUTUS → <math> {1,2,4,11,31,45,173,174} </math>
 
: CAESAR → <math> {1,2,4,5,6,16,57,132} </math>
 
: CALPURNIA → <math> {2,31,54,101} </math>
 
 
 
For large corpora, the space saved using this approach can be considerable.
 
 
 
 
 
Creating the inverted index:
 
# Collect the documents to be indexed
 
# Tokenize the text, turning each document into a list of tokens.
 
# Do linguistic pre-processing, producing a list of normalized tokens, which are the indexing terms.
 
# Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
 
 
 
 
 
<br />
 
====Term-document matrix====
 
The term-document incidence matrix is similar to its binary equivalent, except each document is represented as a count vector
 
 
 
 
 
[[File:TermDocumentMatrix.png|700px|thumb|center|]]
 
 
 
 
 
<br />
 
====Zipf s Law====
 
Zipf's Law
 
 
 
Zipf's law states that the frequency of occurrence of any word is inversely proportional to its rank in the frequency table:
 
 
 
 
 
<math>
 
f(k;s,N) = \frac{1/k^s}{\sum_{n=1}^N 1/n^2}
 
</math>
 
 
 
 
 
Where <math>N</math> is the size of the vocabulary, <math>k</math> is the rank and <math>s</math> is a language-specivic value.
 
 
 
In plain English, this means that the most frequent word will occur approximately twice as often as the second most frequent word and three times as often as the third most frequent word, and so on.
 
 
 
 
 
Those words that account for largest number of occurrences frequently carry less semantic wight, for instance, the articles "the" and "a", conjunctions such as "and", common verbs such as "be".
 
 
 
Words that rarely occur in documents frequently carry a lot of semantic wights, for example, "transducer" or "terephthalate".
 
 
 
In between the two extremes are the most representatives words, those that should be considered for inclusion in a controlled vocabulary.
 
 
 
 
 
<br />
 
====Weighting Word Importance====
 
'''Problem:'''
 
* Across a corpus of documents, some tokens will carry more information than others about the content of a given document.
 
 
 
* Within a single document, not all tokens are equally informative.
 
 
 
 
 
'''Solution:'''
 
: Use two rules-of-thumb:
 
 
 
:* '''The Terms Frequency:''' The frequency of occurrence within a document.
 
:* '''The Inverse Document Frequency (IDF):''' The rarity of occurrence across a corpus of documents.
 
 
 
 
 
<br />
 
=====The Terms Frequency - TF=====
 
The term frequency <math>tf_{t,d}</math> of terms <math>t</math> in document <math>d</math> is defined as the frequency of the occurrence of <math>t</math> in <math>d</math>.
 
 
 
 
 
The term frequency is used when computing query-document match scores. Raw <math>tf</math> is not what we want because:
 
 
 
* A document with <math>tf = 10</math> is more relevant that a document with <math>tf = 1</math>, <span style="color=green">BUT NOT 10 TIMES!</span>
 
 
 
* Relevance does not increase proportionally with term frequency.
 
 
 
* In our example, the word 'mercy' appears five times in Hamlet, and only once in Macbeth:
 
:* Is Hamlet more relevant than Macbeth to the query 'mercy'?: <span style="color:green">'''YES'''</span>
 
:* Is Hamlet 5 times more relevant than Macbeth to the query 'mercy'?: <span style="color:green">'''NO'''</span>
 
 
 
* <span style="color:green">'''So, we need some method of weighting more frequently occurring terms'''</span>
 
 
 
 
 
<br />
 
======Log Term Frequency======
 
The log frequency weight of term <math>t</math> in <math>d</math> is given as:
 
 
 
 
 
<math>
 
W_{t,d} =
 
\begin{cases}
 
    1 + log_{10}tf_{t,d}, & \text{if }  tf_{t,d} > 0 \\
 
    0, & \text{otherwise}
 
\end{cases}
 
</math>
 
 
 
 
 
So:
 
 
 
<math>
 
\begin{array}{lcl}
 
tf_{t,d} = 0  & \rightarrow &  W_{t,d} = 0 \\
 
1              & \rightarrow &  1          \\
 
2              & \rightarrow &  1.3        \\
 
10            & \rightarrow &  2          \\
 
1000          & \rightarrow &  4          \\
 
etc ...
 
\end{array}
 
</math>
 
 
 
 
 
<br />
 
 
 
=====Document Frequency=====
 
The intuition behind '''Document Frequency''' is that terms that rarely occur are more likely to be informative than those that frequently occur. This is also the rationale underlying the concept of "stop words".
 
 
 
 
 
The document frequency <math>df_t</math> is the number of documents in a corpus that the therm <math>t</math> occurs in (the number of documents that contain <math>t</math>)
 
 
 
 
 
<br />
 
======The Inverse Document Frequency - IDF======
 
We define the <math>idf</math> (Inferse document frequency) of <math>t</math> by:
 
 
 
<math>
 
idf_t = log_{10}\frac{N}{df_t}
 
</math>
 
 
 
 
 
The <math>idf</math> is a measure of the relative informativeness of the term <math>t</math>. As we did with <math>tf</math>, we use the log to dampen the effect of <math>idf</math>.
 
 
 
 
 
{| class="wikitable"
 
|+
 
!Term
 
!<math>df_t
 
 
 
</math>
 
!<math>idf_t
 
 
 
</math>
 
|-
 
|mendacity
 
|1
 
|6
 
|-
 
|animal
 
|100
 
|4
 
|-
 
|sunday
 
|1000
 
|3
 
|-
 
|fly
 
|10000
 
|2
 
|-
 
|under
 
|100000
 
|1
 
|-
 
|the
 
|1000000
 
|0
 
|}
 
 
 
 
 
<blockquote>
 
'''The effects of IDF on ranking'''
 
 
 
Does <math>idf</math> have an effect on ranking for one-term queries, like "iPhone"?:
 
 
 
* <math>idf</math> has no effect on ranking one term queries.
 
* <math>idf</math> only has an impact on the ranking of documents for queries with at least two terms.
 
* For the query "capricious person", <math>idf</math> weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person.
 
</blockquote>
 
 
 
 
 
<br />
 
=====TF-IDF Weighting=====
 
The <math>tf\text{-}idf</math> weight of a term is the product of its <math>tf</math> weight and its <math>idf</math> weight.
 
 
 
<math>
 
tf.idf_t = log_{10}(1 + tf_{t,d}) \times log_{10} \frac{N}{df_t}
 
</math>
 
 
 
 
 
* <math>tf\text{-}idf</math> is the most commonplace weighting scheme in information retrieval. Note that the - in <math>tf\text{-}idf</math> is a hyphen, not a minus sigh. Alternative names are: <math>tf.idf</math> and <math>tf \times idf</math>
 
 
 
* <math>tf\text{-}idf</math> increases with the number of occurrences within a document.
 
 
 
* <math>tf\text{-}idf</math> also increases with the rarity of the term in the collection
 
 
 
 
 
<br />
 
======TF-IDF Final Score======
 
The individual scores for <math>tf.idf</math> are summed up over the documents in the corpora:
 
 
 
 
 
<math>
 
Score(q,d) = \sum_{t \in q \cap d} tf.idf_{t,d}
 
</math>
 
 
 
 
 
There are many variants on the basic principle:
 
* How <math>tf</math> is computed (with/without logs)
 
* Whether the terms in the query are also weighted
 
 
 
 
 
<br />
 
=====Weight Term-Document incidence matrix=====
 
<br />
 
[[File:Weight_Term-Document_incidence_matrix.png|800px|thumb|center|]]
 
 
 
 
 
<br />
 
====Boolean Retrieval====
 
.
 
 
 
 
 
<br />
 
====Boolean Queries====
 
.
 
 
 
 
 
<br />
 
==Natural Language Processing==
 
 
 
 
 
<br />
 
===What is NLP===
 
Natural Language Processing (NLP) is a specialized field of computer science and Artificial Intelligence with the goal of enabling computers to interpret, understand, manipulate, and produce human language. It draws upon theory and research from many disparate disciplines, including computer science, information theory, machine learning, and computational linguistics. The primary areas of research in NLP are natural language understanding, natural language generation, and speech recognition.
 
 
 
 
 
"Human language is highly ambiguous... It is also ever changing and evolving. People are great at producing language and understanding language, and are capable of expressing, perceiving, and interpreting very elaborate and nuanced meanings. At the same time, while we humans are great users of language, we are also very poor at formally understanding and describing the rules that govern language."  (Goldberg and Hirst 2017)
 
 
 
 
 
<br />
 
 
 
===Applications of NLP===
 
* Spelling and grammar checking
 
* Text categorisation
 
* Named-entity recognition
 
* Advanced Information Retrieval
 
* Machine Translation
 
* Sentiment Analysis
 
* Question Generation and Question Answering
 
* Text summarisation
 
* Discourse analysis
 
 
 
 
 
<br />
 
===NLP Challenges===
 
 
 
 
 
<br />
 
====Sentence Boundary Disambiguation====
 
When carrying out Natural Language Processing, it is often necessary to break up a document into sentences prior to carrying out further processing.
 
 
 
Sentence Boundary Disambiguation, also known as sentence splitting, is the process of identifying the beginning and end of sentences.
 
 
 
Determining the boundary between sentences is not quite as simple as it may seem at first, as the sentence delimiter (a period in English) is also used to denote a decimal point, an ellipsis, or an abbreviated word.
 
 
 
 
 
<br />
 
====Part of Speech Tagging====
 
Part of Speech Taggingalso called word-category disambiguation is the process of annotating each word in a text with its part of speech.
 
 
 
The basic parts of speech are articles, nouns, verbs, adjectives, pronouns, prepositions, adverbs, conjunctions, and interjections.
 
 
 
However, verbs have tenses and aspects that increase that number considerably
 
 
 
 
 
Common approaches to part-of-speech tagging include the rules-based Brilltagger, Constraint Grammar (also rules-based), and the Baum-Welch algorithm (used to determine the unknown parameters in a Hidden MarkovModel). Try the Brill tagger online at https://cst.dk/online/pos_tagger/uk/
 
 
 
 
 
<br />
 
====Named-Entity Recognition====
 
Named Entity Recognition is the task of identifying anything that can be given a proper name:
 
*  people and organizations;
 
*  locations;
 
*  biological species;
 
*  products;
 
*  substances;
 
*  etc.
 
 
 
 
 
There are two fundamental approaches to named-entity recognition. The first relies on hand-crafted rules created by computational-linguists. The second is based on statistical models and machine learning using annotated training data. The former is expensive and precise but can fail to generalize to previously unseen data. The latter lacks precision but generalizes better.
 
 
 
 
 
<br />
 
====Word Sense Disambiguation====
 
* Word Sense Disambiguation is the process of determining the exact meaning of a word when it has many possible meanings.
 
 
 
* For example, the dictionary definition of the word “tie” gives two potential meanings as verbs:
 
:*  to attach or fasten with string or similar cord.
 
:*  to restrict or limit (someone) to a particular situation or place.
 
 
 
* and two meanings as nouns:
 
:*  a piece of string, cord, or similar used for fastening or tying something.
 
:*  a rod or beam holding parts of a structure together.
 
 
 
 
 
Solutions include dictionary-based approaches, supervised machine learning, and unsupervised clustering where the meaning of neighboring words is taken into account.
 
 
 
 
 
<br />
 
====Synonym - Antoymn - Hypernym - Hyponym Identification====
 
 
 
A synonym is a word that carries the same (or almost the same)meaning as another word. As an example, the words "freedom" and"liberty" are synonyms. An antonym is a word that carries the opposite (or almost the opposite) meaning as another word. The words "hot" and "cold" are antonyms of each other.
 
 
 
Ahypernymis a word or phrase whose more specific meaning can be found in one or more other words or phrases. For example the word"tool" is a hypernym of the words "hammer", "saw" and "screwdriver".* Ahyponymis a word or phrase whose generic meaning can be found in another word or phrase. For instance, the words "car", "bus" and "bicycle" are all hyponyms of the word "vehicle".
 
 
 
 
 
Approaches to this problem are largely based on the use of thesauri.
 
 
 
 
 
<br />
 
====Anaphora and Cataphora Resolution====
 
Anaphora are words, usually but not exclusively pronouns, whose meaning depends on a word occurring before them, known as the antecedent. In the phrase "Mary studied hard and she passed her exam", the word "her" is the anaphor and the word "Mary" is the antecedent. In other words, "her" refers to "Mary".
 
 
 
Cataphoraare words whose meaning depends on words that occur after them, known as the postcedent. In the phrase "Before his second novel was published, nobody heard of Mike", "Mike" is the postcedent and "his" is the anaphor.
 
 
 
 
 
Approaches to anaphora resolution are many and varied, including rule-based, supervised machine learning, ranking, and statistical methods.
 
 
 
 
 
<br />
 
====Semantic Role Labelling====
 
Semantic Role Labellingis the process of extracting subject-predicate-object triples from a sentence.
 
 
 
It involves of the detection of the semantic arguments (subject and object) associated with a predicate and their classification into one or other of the roles.
 
 
 
For example, in the sentence "Mike gave his bicycle to Paul", the predicate is "to give", the subject is "Mike", the object is the book and the indirect object is "Paul".
 
 
 
 
 
Try the Semantic Role Labeller at http://cogcomp.org/page/demo_view/srl
 
 
 
 
 
<br />
 
===Topic modelling===
 
Topic modelling provides us with methods to organize, summarise, and extract understanding from large collections of textual information. It can be thought of as a method of identifying groups of words in a corpus that best represent the subject matter of the constituent documents.
 
 
 
It is the discovery of latent (hidden) topical patterns that are present across a corpus and then annotating documents in the corpus using these topics. We can then use the annotated documents to search, summarise, and organized documents.
 
 
 
There are numerous methods used to extract topic models. In the slides that follow, an overview of the most widely-used method, Latent Dirichlet Allocation (LDA), is provided.
 
 
 
 
 
<br />
 
====Latent Dirichlet Allocation====
 
Latent Dirichlet Allocation is a generative probabilistic model for collections of discrete data such as text corpora.
 
 
 
The objective of LDA is to find a probabilistic model of a corpus that not only assigns high probability to members of the corpus but also assigns high probability to other "similar" documents. It produces bags of words representative of topics within a text corpus
 
 
 
It is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics.
 
 
 
It assumes that words are generated by topics (fixed conditional distributions) and that these topics are infinitely exchangeable within a document and even within a set of documents.
 
 
 
Words are considered a proxy for "hidden" concepts.
 
 
 
 
 
<br />
 
'''The process works as follows:'''
 
 
 
* For each document, randomly assign each word to one of Ktopics, where the value of K is chosen beforehand
 
 
 
* Although this random assignment gives topic representations of all documents and word distributions of all the topics, it is far from optimal.
 
 
 
* This sub-optimal representation is then improved by iterating through each document, and for each wordcomputing:
 
 
 
:* p(topic t|document d)the proportion of words in document d that are assigned to topic t
 
 
 
:* p(word w|topic t)the proportion of assignments to topic t, over all documents d, that come from word w
 
 
 
* Reassign word w a new topic t', where we choose topic t' with probability p(topic t'|document d) × p(word w|topic t').
 
 
 
* This generative model predicts the probability that topic t' generated word w.
 
 
 
* After repeating the last step a large number of times, a steady state is reached where topic assignments are close to optimal.
 
 
 
* These assignments are then used to determine the topic mixtures for each document.
 
 
 
 
 
<br />
 
'''Definitions:'''
 
 
 
* A '''Topic''' is a probability distribution over a collection of words
 
 
 
* A '''Topic model''' is a formal statistical relationship between a group of observed and latent (unknown) random variables that specifies a probabilistic procedure to generate the topics -i.e. a generative model.
 
 
 
* '''Co-occurrence''' is the occurrence of two words from a corpus either alongside each other or close to each other at a probability greater than pure chance.
 
 
 
 
 
The primary objective of topic modeling is to provide a "thematic summary" of a collection of documents
 
 
 
 
 
<br />
 
'''Suppose we are trying to extract topics from these data:'''
 
 
 
1. I like to eat broccoli and bananas
 
 
 
2. I ate a banana and spinach smoothie for breakfast
 
 
 
3. Chinchillas and kittens are cute
 
 
 
4. My sister adopted a kitten yesterday
 
 
 
5. I like the cute hamster munching on a piece of broccoli
 
 
 
 
 
 
 
'''What latent themes are there and how do we extract them?'''
 
 
 
* Sentences 1 & 2: 100% Topic A
 
 
 
* Sentences 3 & 4: 100% Topic B
 
 
 
* Sentence 5: 60% Topic A, 40% Topic B
 
 
 
* Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, ... (at which point, you could interpret topic A to be about food)
 
 
 
* Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, ... (at which point, you could interpret topic B to be about cute animals)
 
 
 
 
 
<br />
 
===Sentiment Analysis===
 
 
 
 
 
<br />
 
====What is Sentiment Analysis====
 
Sentiment analysis is the computational study of opinions, sentiments, evaluations, and attitudes expressed in the form of text. It is sometimes termed "Opinion Mining"
 
 
 
Sentiment or opinion is often highly subjective, so opinion mining involves summarising the opinions or sentiment of many people.
 
 
 
 
 
<br />
 
Dan Jurafsky:
 
 
 
: Sentiment analysis is the detection of the attitudes of the writer of a piece of text towards a particular subject and involves:
 
 
 
:* the text containing the attitude can be as short as a single sentence or as long as an entire document
 
 
 
:* the holder or source of the attitude
 
 
 
:* the target or aspect of the attitude
 
 
 
:* the class or type of the attitude, from a predefined set of types
 
::* positive, negative, neutral (relatively easy)
 
::* like, love, hate, value, desire, etc. (more difficult)
 
 
 
 
 
<br />
 
=====Formal Definition - Liu 2010=====
 
 
 
An opinion or sentiment is a quintuple:
 
 
 
<math>
 
(e_j, a_{jk}, SO_{ijkl}, h_{i}, t_{l})
 
</math>
 
 
 
 
 
Where:
 
* <math>e_j</math> is a target entity
 
 
 
* <math>a_{jk}</math> is an aspect/feature of the entity <math>e_j</math>
 
 
 
* <math>SO_{ijkl}</math> is the sentiment value of the opinion from the opinion holder <math>h_{i}</math>
 
:* <math>SO_{ijkl}</math> is positive, negative, neutral, or a more granular rating.
 
 
 
* <math>h_{i}</math> is an opinion holder
 
 
 
* <math>t_{l}</math> is the time when the opinion is expressed.
 
 
 
 
 
<br />
 
 
 
====Challenges====
 
 
 
 
 
<br />
 
=====Sarcasm and Subtlety=====
 
Consider the following perfume review...
 
 
 
"If you are reading this because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut." (from Dan Jurafsky)
 
 
 
 
 
We can see that this sentence is laden with negative sentiment, butit is expressed in such a subtle way that a computer will not be able to recognise it.
 
 
 
 
 
<br />
 
=====Ordering Effects and Thwarted Expectations=====
 
Examples from Dan Jurafsky:
 
 
 
"This film should be <span style="color:blue">brilliant</span>. It sounds like a <span style="color:blue">great</span> plot, the actors are <span style="color:blue">first grade</span>, and the supporting cast is <span style="color:blue">good</span> as well, and Stallone is attempting to deliver a <span style="color:blue">good</span> performance. However, it <span style="color:red">can't hold up</span>."
 
 
 
"Well as usual Keanu Reeves is <span style="color:red">nothing special</span>, but surprisingly, the <span style="color:blue">very talented</span> Laurence Fishbourne is <span style="color:red">not so good</span> either, I was surprised."
 
 
 
 
 
<br />
 
=====Negations - Comparisons - Complex Opinions=====
 
* While <span style="color:red">not good</span> could be considered to be almost synonymous with <span style="color:red">bad</span>, is <span style="color:blue">not bad</span> synonymous with <span style="color:blue">good</span>?
 
 
 
* A comparison is more challenging than regular direct or indirect opinions: For example "iPhone is better than Android"
 
 
 
* Complex opinions can also be a challenge to parse. For example "Wifi on my iPhone doesn't work as well as it does on my Android, but everything else is better"
 
 
 
 
 
<br />
 
 
 
====Approaches====
 
 
 
* '''Rule-based (Lexicon-based):''' The outcome of this study is a set of rules (also known as lexicon or sentiment lexicon) according to which the words classified are either positive or negative along with their corresponding intensity measure. https://alphabold.com/sentiment-analysis-the-lexicon-based-approach/#:~:text=Rule%20based%20sentiment%20analysis%20refers,with%20their%20corresponding%20intensity%20measure
 
 
 
: También encontré el término  out-of-the-box (no training needed) https://medium.com/@10e/exploring-out-of-the-box-sentiment-analysis-packages-8cb9931ff5a4
 
 
 
 
 
* '''Supervised machine learning''' (Naive Bayes, Support Vector Machines, etc.), trained on pre-labeled examples.
 
 
 
 
 
* '''Hybrid:''' A combination of the lexical analysis and machine learning, generally involving annotating sentences with lexical information prior to applying supervised learning.
 
 
 
 
 
<br />
 
 
 
====[[Social Media Sentiment Analysis using Twitter Data]]====
 
<br />
 
 
 
<br />
 
===Text Classification===
 
https://monkeylearn.com/text-classification/
 
 
 
Text classification is the process of assigning tags or categories to text according to its content. It’s one of the fundamental tasks in Natural Language Processing (NLP) with broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection.
 
 
 
Unstructured data in the form of text is everywhere: emails, chats, web pages, social media, support tickets, survey responses, and more. Text can be an extremely rich source of information, but extracting insights from it can be hard and time-consuming due to its unstructured nature. Businesses are turning to text classification for structuring text in a fast and cost-efficient way to enhance decision-making and automate processes.
 
 
 
But, what is text classification? How does text classification work? What are the algorithms used for classifying text? What are the most common business applications?
 
 
 
 
 
Text classification (a.k.a. text categorization or text tagging) is the task of assigning a set of predefined categories to free-text. Text classifiers can be used to organize, structure, and categorize pretty much anything. For example, new articles can be organized by topics, support tickets can be organized by urgency, chat conversations can be organized by language, brand mentions can be organized by sentiment, and so on.
 
 
 
As an example, take a look at the following text below:
 
 
 
"The user interface is quite straightforward and easy to use."
 
 
 
A classifier can take this text as an input, analyze its content, and then and automatically assign relevant tags, such as UI and Easy To Use that represent this text:
 
 
 
[[File:Text-classification-what-it-is2.png|700px|thumb|center|]]
 
 
 
 
 
There are many approaches to automatic text classification, which can be grouped into three different types of systems:
 
 
 
*Rule-based systems
 
*Machine Learning based systems
 
*Hybrid systems
 
 
 
 
 
<br />
 
====Machine Learning Based Systems====
 
Instead of relying on manually crafted rules, text classification with machine learning learns to make classifications based on past observations. By using pre-labeled examples as training data, a machine learning algorithm can learn the different associations between pieces of text and that a particular output (i.e. tags) is expected for a particular input (i.e. text).
 
 
 
The first step towards training a classifier with machine learning is feature extraction: a method is used to '''''transform each text into a numerical representation in the form of a vector'''''. One of the most frequently used approaches is bag of words, where a vector represents the frequency of a word in a predefined dictionary of words.
 
 
 
For example, if we have defined our dictionary to have the following words {This, is, the, not, awesome, bad, basketball}, and we wanted to vectorize the text “This is awesome”, we would have the following vector representation of that text: (1, 1, 0, 0, 1, 0, 0).
 
 
 
Then, the machine learning algorithm is fed with training data that consists of pairs of feature sets (vectors for each text example) and tags (e.g. sports, politics) to produce a classification model:
 
 
 
[[File:text-classification-training.png|869x869px|thumb|center]]
 
 
 
Once it’s trained with enough training samples, the machine learning model can begin to make accurate predictions. The same feature extractor is used to transform unseen text to feature sets which can be fed into the classification model to get predictions on tags (e.g. sports, politics):
 
 
 
[[File:text-classification-predictions2.png|870x870px|thumb|center]]
 
 
 
 
 
<br />
 
=====Text Classification Algorithms=====
 
Some of the most popular machine learning algorithms for creating text classification models include the naive bayes family of algorithms, support vector machines, and deep learning.
 
 
 
* Naive Bayes
 
 
 
 
 
* Support vector machines
 
 
 
 
 
<br />
 
====Fake news detection====
 
 
 
 
 
<br />
 
=====[[Supervised Machine Learning for Fake News Detection]]=====
 
 
 
 
 
<br />
 
=====Fake news Challenge=====
 
http://www.fakenewschallenge.org/
 
 
 
Exploring how artificial intelligence technologies could be leveraged to combat fake news.
 
 
 
 
 
<br />
 
======Formal Definition======
 
 
 
*'''Input:''' A headline and a body text - either from the same news article or from two different articles.
 
 
 
*'''Output:''' Classify the stance of the body text relative to the claim made in the headline into one of four categories:
 
**Agrees: The body text agrees with the headline.
 
**Disagrees: The body text disagrees with the headline.
 
**Discusses: The body text discuss the same topic as the headline, but does not take a position
 
**Unrelated: The body text discusses a different topic than the headline
 
 
 
 
 
<br />
 
======Stance Detection dataset for FNC1======
 
https://github.com/FakeNewsChallenge/fnc-1
 
 
 
 
 
<br />
 
======Winner teams======
 
 
 
 
 
<blockquote>
 
'''First place - Team SOLAT in the SWEN'''
 
https://github.com/Cisco-Talos/fnc-1
 
 
 
The data provided is (headline, body, stance) instances, where stance is one of {unrelated, discuss, agree, disagree}. The dataset is provided as two CSVs:
 
 
 
*'''train_bodies.csv''' : This file contains the body text of articles (the articleBody column) with corresponding IDs (Body ID)
 
 
 
*'''train_stances.csv''' : This file contains the labeled stances (the Stance column) for pairs of article headlines (Headline) and article bodies (Body ID, referring to entries in train_bodies.csv).
 
</blockquote>
 
 
 
 
 
<br />
 
======Distribution of the data======
 
The distribution of <code>Stance</code> classes in <code>train_stances.csv</code> is as follows:
 
{| class="wikitable"
 
!rows
 
!unrelated
 
!discuss
 
!agree
 
!disagree
 
|-
 
|49972
 
|0.73131
 
|0.17828
 
|0.0736012
 
|0.0168094
 
|}
 
 
 
 
 
<br />
 
=====[[Paper: Fake news detection on social media - A data mining perspective]]=====
 
 
 
 
 
<br />
 
=====Paper: Automatic Detection of Fake News in Social Media using Contextual Information=====
 
https://brage.bibsys.no/xmlui/bitstream/handle/11250/2559124/18038_FULLTEXT.pdf?sequence=1&isAllowed=y
 
 
 
 
 
<br />
 
======Linguistic approach======
 
The linguistic or textual approach to detecting false information involves using techniques that analyzes frequency, usage, and patterns in the text. Using this gives the ability to find similarities that comply to usage that is known in types of text, such as for fake news, which have a language that is similar to satire and will contain more emotional and an easier language than articles have on the same topic.
 
 
 
<blockquote>
 
'''Support VectorMachines'''
 
A support vector machine(SVM) is a classifier that works by separating a hyperplane(n-dimensional space) containing input. It is based on statistical learning theory[59]. Given labeled training data, the algorithm outputs an optimal hyperplane which classifies new examples. The optimal hyperplane is calculated by finding the divider that minimizes the noise sensitivity andmaximizes the generalization
 
 
 
'''Naive Bayes'''
 
Naive Bayes is a family of linear classifiers that works by using mutually independent features in a dataset for classification[46]. It is known for being easy to implement, being robust, fast and accurate. They are widely used for classification tasks, such as diagnosis of diseases and spam filtering in E-mail.
 
 
 
'''Term frequency inverse document frequency'''
 
Term frequency-inverse document frequency(TF-IDF) is a weight value often used in information retrieval and gives a statistical measure to evaluate the importance of a word in a document collection or a corpus. Basically, the importance of a word increases proportionally with how many times it appears in a document, but is offset by the frequency of the word in the collection or corpus. Thus a word that appears all the time will have a low impact score, while other less used words will have a greater value associated with them[28]
 
 
 
'''N-grams'''
 
 
 
'''Sentiment analysis'''
 
</blockquote>
 
 
 
 
 
<br />
 
======Contextual approach======
 
Contextual approaches incorporate most of the information that is not text. This includes data about users, such as comments, likes, re-tweets, shares and so on. It can also be information regarding the origin, both as who created it and where it was first published. This kind of information has a more predictive approach then linguistic, where you can be more deterministic. The contextual clues give a good indication of how the information is being used, and based on this assumptions can be made.
 
 
 
This approach relies on structured data to be able to make the assumptions, and because of that the usage area is for now limited to Social Media, because of the amount of information that is made public there. You have access to publishers, reactions, origin, shares and even age of the posts.
 
 
 
<span style="background:DarkKhaki">In addition to this, contextual systems are most often used to increase the quality of existing information and augment linguistic systems,</span> by giving more information to work on for these systems, being reputation, trust metrics or other ways of giving indicators on whether the information is statistically leaning towards being fake or not.
 
 
 
Below a series of contextual methods are presented. They are a collection of state of the art methods and old, proven methods.
 
 
 
<blockquote>
 
'''Logistic regression'''
 
 
 
'''Crowdsourcing algorithms'''
 
 
 
'''Network analysis'''
 
 
 
'''Trust Networks'''
 
 
 
'''Trust Metrics'''
 
 
 
'''Content-driven reputation system'''
 
 
 
'''Knowledge Graphs'''
 
</blockquote>
 
 
 
 
 
<br />
 
=====Paper: Fake News Detection using Machine Learning=====
 
https://www.pantechsolutions.net/machine-learning-projects/fake-news-detection-using-machine-learning
 
 
 
 
 
<br />
 
=====Blog: I trained fake news detection AI with >95% accuracy and almost went crazy=====
 
https://towardsdatascience.com/i-trained-fake-news-detection-ai-with-95-accuracy-and-almost-went-crazy-d10589aa57c
 
 
 
 
 
<br />
 
 
 
==[[Text Analytics in Python]]==
 
<br />
 
 
 
<br />
 
==RapidMiner Examples==
 
 
 
We need to install the «Text Processing» package from the MarketPlace
 
 
 
There is a very nice example of Sentiment Analysis in RapidMiner at the RapidMiner' directory:
 
/Samples/Templates/Sentiment Analysis
 
 
 
 
 
<br />
 
===Example 1 - Vectorization - Creating a DTM===
 
In this example, we are creating a DTM after extracting particular parts of the text (nouns in this case). Note that the «Vector creation» parameter of the «Process Documents from Files» operator allows you to configure the DTM type (TF-IDV, Term frequency, Term occurrences, Binary Term occurrences)
 
 
 
 
 
Set the mode of the Tokenize operator to «non-letters» using the Parameters tab. Set the language of the Filter Tokens operator to «English» and the expression to «NN.*» without the quotes. This should extract all nouns.
 
 
 
 
 
<gallery mode=packed-overlay>
 
File:RapidMinerExample-Vectorization-Creating_a_DTM_1.png|Main Process
 
File:RapidMinerExample-Vectorization-Creating_a_DTM_2.png|Process Documents From Files
 
File:RapidMinerExample-Vectorization-Creating_a_DTM_3.png|Results
 
</gallery>
 
<div style="text-align: center;">
 
Screencast at<br />
 
[[File:RapidMinerExample-Vectorization-Creating_a_DTM.mp4]]
 
<br />
 
Data at [[File:RapidMinerExample-Vectorization-Creating_a_DTM-Data.zip]]
 
</div>
 
 
 
 
 
<br />
 
 
 
===Example 2 - Sentiment Analysis===
 
 
 
* '''Install the Aylien Extension:'''
 
: Choose Marketplace (Updates and Extensions) from the Extensions Menu, then search for and install the "Text Analytics By Aylien" extension.
 
: After installing this extension and restart RapidMiner, I got the next message. However, this exercise was successfully completed with this extension.
 
 
 
[[File:Aylien_extension_notice.png|350px|thumb|center|]]
 
 
 
 
 
* '''Aylien ID and API Key:'''
 
: In order to be able to use the AYLIEN API you will need an App ID and API Key. If you haven't already obtained them you can get them from:
 
: https://developer.aylien.com/signup?source=rapidminer
 
: I have created an account using my Gmail email and a...00
 
 
 
 
 
* '''Connecting to Aylien:'''
 
: From «Connections menu > Legacy Connections > Manage Connections (Legacy)» create a new connection of type «Aylien Text Analysis Connection». Then add your application ID and API Key and click on «Save all changes». See how to do so at
 
: [[File:Connecting_to_Aylien_API.mp4]]
 
 
 
 
 
* '''Creating and Run the process as shown at:'''
 
: [[File:RapidMinerExample-Sentiment_Analysis.mp4]]
 
: After running the process you should receive a polarity, subjectivity class, and score for the text.
 
 
 
 
 
<br />
 
 
 
==Some important concepts==
 
 
 
*A collection of texts is also sometimes called '''corpus'''.
 
*'''Tokenization''' is just the term used to describe the process of converting the normal text strings in to a list of tokens (words that we actually want).
 
**Lists of tokens (also known as '''lemmas''').
 
 
 
 
 
<br />
 
==References==
 
* Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation."J. Mach. Learn. Res.3 (March):993–1022. http://dl.acm.org/citation.cfm?id=944919.944937.
 
 
 
* Cavnar, William B., and John M. Trenkle. 1994. "N-Gram-Based Text Categorization." In Proceedings of Sdair-94, 3rdAnnual Symposium on Document Analysis and Information Retrieval, 161–75.
 
 
 
* Feldman, Ronen, and James Sanger. 2006.Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data.New York, NY, USA: Cambridge University Press.
 
 
 
* Goldberg, Yoav, and Graeme Hirst. 2017.Neural Network Methods in Natural Language Processing. Morgan & Claypool publishers.
 
 
 
* Hearst, Marti A. 1999. "Untangling Text Data Mining." InProceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, 3–10. ACL ’99. Stroudsburg, PA, USA: Association for computational linguistics.
 
 
 
* Jurafsky, Daniel, and James H. Martin. 2009.Speech and Language Processing (2nd Edition). Upper Saddle River, NJ, USA:Prentice-Hall, Inc.
 
 
 
* Kao, Anne, and Steve R. Poteet. 2006.Natural Language Processing and Text Mining. Springer Publishing Company,Incorporated.
 
 
 
* Liu, Bing. 2010. "Sentiment Analysis and Subjectivity." In Handbook of Natural Language Processing, Second Edition, edited by Nitin Indurkhya and Fred J. Damerau. Boca Raton, FL: CRC Press, Taylor; Francis Group.
 
 
 
* Luhn, H. P. 1960. "Key Word-in-context Index for Technical Literature (Kwic Index)."American Documentation11 (4): 288–95.
 
 
 
* Rijsbergen, C. J. van, S. E. Robertson, and M. F. Porter. 1980. "New Models in Probabilistic Information Retrieval."
 
 
 
 
 
<br />
 

Latest revision as of 16:39, 27 February 2026

~ Migrated