Difference between revisions of "Text Analytics"

From Sinfronteras
Jump to: navigation, search
(Example 1 - Vectorization - Creating a DTM)
(Replaced content with "~ Migrated")
(Tag: Replaced)
 
(24 intermediate revisions by the same user not shown)
Line 1: Line 1:
<accesscontrol>
+
~ Migrated
Autoconfirmed users
 
</accesscontrol>
 
 
 
Text Analytics / Mining
 
 
 
 
 
<br />
 
We need to install the «Text Processing» package from the MarketPlace
 
 
 
There is a very nice example of Sentiment Analysis in RapidMiner at the RapidMiner' directory:
 
/Samples/Templates/Sentiment Analysis
 
 
 
 
 
<br />
 
==Motivation==
 
Somewhere between 80-90% of all potentially usable business information may originate in unstructured form. However, only a tiny fraction of those data have been mined or used in predictive analytics tasks.
 
 
 
The quality of the outcomes of predictive analytics is limited by the data that is used in the analysis. Most predictive analytics tasks use data that is stored in a simple, structured format. One of the reasons why text analytics is not yet considered mainstream is that handling unstructured data is difficult and requires additional expertise and approaches.
 
 
 
 
 
<br />
 
==What is Text Mining==
 
"... a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known." (Hearst 1999)
 
 
 
 
 
"... a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of analysis tools." (Feldman and Sanger 2006)
 
 
 
 
 
"... the discovery and extraction of interesting, non-trivial knowledge from free or unstructured text." (Kao and Poteet 2006)
 
 
 
 
 
<br />
 
==Human language is difficult - Ambiguity Everywhere==
 
"I made her duck"
 
* I cooked waterfowl for her.
 
* I cooked waterfowl belonging to her.
 
* I created the (plaster?) duck she owns.
 
* I caused her to quickly lower her head.
 
* I magically converted her into roast fowl.
 
(Jurafsky and Martin 2009)
 
 
 
 
 
<br />
 
==Components of Text Mining==
 
* '''Information retrieval (IR)''' is the activity of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images, or sounds. https://en.wikipedia.org/wiki/Information_retrieval
 
 
 
 
 
* '''Natural language processing (NLP)''' is used to analyze the text using structures and rules based on human language.
 
 
 
 
 
* '''Information extraction (IE)''' involves structuring the data that the NLP system generates.
 
 
 
 
 
* '''Data Mining (DM)''' is the process of identifying patterns in large sets of data, to find new knowledge.
 
 
 
 
 
<br />
 
==Information Retrieval==
 
 
 
 
 
<br />
 
===Tokenisation===
 
* Tokenisation is a lexical process of breaking the text up into usable units.
 
* Depending on the context, tokens can be words, phrases, or symbols.
 
* What constitutes a token can depend on the language, corpus, and even context.
 
 
 
 
 
<br />
 
===Normalisation===
 
The goal of normalization is to convert different forms of a word to a single normalized form. For example E.U. - > EU and Grafton St. -> Grafton Street.
 
 
 
This can be based on hard-coded rules, such as replacing "St." with "Street", or deleting periods and hyphens. This can be a little hit and miss. Also, the process has to deal with ambiguity. In the "St." example, how can we know if it refers to "Street" or "Saint"?
 
 
 
The most commonplace method of normalization is to create equivalence classes, usually named after one member of the set. For instance, if the tokens «car» and «automobile» are both mapped onto the term automobile, then searches for either term will retrieve documents that contain one or both.
 
 
 
<br />
 
 
 
===Stop Words===
 
 
 
 
 
<br />
 
===Stemming===
 
 
 
 
 
<br />
 
===Lemmatisation===
 
 
 
 
 
<br />
 
===Vectorization===
 
 
 
 
 
<br />
 
====The Bag of Words approach====
 
The bag-of-words approach is a method of representing text based on the frequency of occurrence of words within a document, considering each word count as a feature.
 
 
 
The bag of words approach is based on some naïve assumptions:
 
* Words are independent of each other
 
* Grammar and word order are not important
 
* Documents are exchangeable, so document ordering can be neglected
 
 
 
It is called a "bag" of words because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document. https://machinelearningmastery.com/gentle-introduction-bag-words-model/
 
 
 
While this might be computationally simple, the assumptions do not hold for any real-life document. The method ignores the context in which a word occurs, and in the process loses the meaning.
 
 
 
 
 
<br />
 
====N-Gram Representations====
 
I have also seen the term «bag of n-grams»
 
 
 
 
 
What is a N-Gram? https://en.wikipedia.org/wiki/N-gram
 
 
 
 
 
The n-gram method uses contiguous sequences of 1 or more words to ensure more of the meaning in a document is retained and word dependency is captured. Each word or token is called a "gram".
 
 
 
 
 
"An N-gram is an N-token sequence of words: a 2-gram (more commonly called a bigram) is a two-word sequence of words like 'please turn', 'turn your', or 'your homework', and a 3-gram (more commonly called a trigram) is a three-word sequence of words like'please turn your', or 'turn your homework'." (Jurafsky and Martin 2009)
 
 
 
 
 
The primary disadvantage of the n-gram approach is that the size of the vocabulary is increased to <math> O(V^N)</math>.
 
 
 
 
 
<br />
 
 
 
====N-Grams and language identification====
 
* The top 300 or so N-grams almost always highly correlated to the language and can be used for language identification. (Cavnar and Trenkle 1994)
 
 
 
* The highest-ranking N-grams are mostly uni-grams (N=1), and simply reflect the distribution of the letters of the alphabet in the document's language.
 
 
 
* After this come the function words and very frequent prefixes and suffixes (morphemes).
 
 
 
* Starting at around 300, an N-gram frequency profile begins to show N-grams that are more specific to the subject of the document
 
 
 
 
 
<br />
 
====Binary Term-document matrix====
 
The binary term-document incidence matrix
 
 
 
 
 
[[File:Binary_TermDocumentMatrix.png|700px|thumb|center|]]
 
 
 
 
 
* Consider a corpus of N=106 documents, each with approximately 1000 tokens.
 
* This gives a total of 109 tokens
 
* With an average of 6 bytes per token, the size of document collection would be about <math> 6 \times 109 = 6 GB</math>!
 
* Assume there are <math>M = 500000</math> distinct terms in the collection
 
* <math>M = 500000 \times 106 = 1/2</math> trillion 0s and 1s !!!
 
* But the matrix will have no more than one billion 1s - it is extremely sparse.
 
* A better representation would be to record only the 1s.
 
 
 
 
 
<br />
 
====The inverted index====
 
The inverted index consists of a dictionary, containing the tokens, and a postings list, simply a list of the number of occurrences in each document.
 
 
 
: BRUTUS → <math> {1,2,4,11,31,45,173,174} </math>
 
: CAESAR → <math> {1,2,4,5,6,16,57,132} </math>
 
: CALPURNIA → <math> {2,31,54,101} </math>
 
 
 
For large corpora, the space saved using this approach can be considerable.
 
 
 
 
 
Creating the inverted index:
 
# Collect the documents to be indexed
 
# Tokenize the text, turning each document into a list of tokens.
 
# Do linguistic pre-processing, producing a list of normalized tokens, which are the indexing terms.
 
# Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.
 
 
 
 
 
<br />
 
====Term-document matrix====
 
The term-document incidence matrix is similar to its binary equivalent, except each document is represented as a count vector
 
 
 
 
 
[[File:TermDocumentMatrix.png|700px|thumb|center|]]
 
 
 
 
 
<br />
 
====Zipf s Law====
 
Zipf's Law
 
 
 
Zipf's law states that the frequency of occurrence of any word is inversely proportional to its rank in the frequency table:
 
 
 
 
 
<math>
 
f(k;s,N) = \frac{1/k^s}{\sum_{n=1}^N 1/n^2}
 
</math>
 
 
 
 
 
Where <math>N</math> is the size of the vocabulary, <math>k</math> is the rank and <math>s</math> is a language-specivic value.
 
 
 
In plain English, this means that the most frequent word will occur approximately twice as often as the second most frequent word and three times as often as the third most frequent word, and so on.
 
 
 
 
 
Those words that account for largest number of occurrences frequently carry less semantic wight, for instance, the articles "the" and "a", conjunctions such as "and", common verbs such as "be".
 
 
 
Words that rarely occur in documents frequently carry a lot of semantic wights, for example, "transducer" or "terephthalate".
 
 
 
In between the two extremes are the most representatives words, those that should be considered for inclusion in a controlled vocabulary.
 
 
 
 
 
<br />
 
====Weighting Word Importance====
 
'''Problem:'''
 
* Across a corpus of documents, some tokens will carry more information than others about the content of a given document.
 
 
 
* Within a single document, not all tokens are equally informative.
 
 
 
 
 
'''Solution:'''
 
: Use two rules-of-thumb:
 
 
 
:* '''The Terms Frequency:''' The frequency of occurrence within a document.
 
:* '''The Inverse Document Frequency (IDF):''' The rarity of occurrence across a corpus of documents.
 
 
 
 
 
<br />
 
=====The Terms Frequency - TF=====
 
The term frequency <math>tf_{t,d}</math> of terms <math>t</math> in document <math>d</math> is defined as the frequency of the occurrence of <math>t</math> in <math>d</math>.
 
 
 
 
 
The term frequency is used when computing query-document match scores. Raw <math>tf</math> is not what we want because:
 
 
 
* A document with <math>tf = 10</math> is more relevant that a document with <math>tf = 1</math>, <span style="color=green">BUT NOT 10 TIMES!</span>
 
 
 
* Relevance does not increase proportionally with term frequency.
 
 
 
* In our example, the word 'mercy' appears five times in Hamlet, and only once in Macbeth:
 
:* Is Hamlet more relevant than Macbeth to the query 'mercy'?: <span style="color:green">'''YES'''</span>
 
:* Is Hamlet 5 times more relevant than Macbeth to the query 'mercy'?: <span style="color:green">'''NO'''</span>
 
 
 
* <span style="color:green">'''So, we need some method of weighting more frequently occurring terms'''</span>
 
 
 
 
 
<br />
 
======Log Term Frequency======
 
The log frequency weight of term <math>t</math> in <math>d</math> is given as:
 
 
 
 
 
<math>
 
W_{t,d} =
 
\begin{cases}
 
    1 + log_{10}tf_{t,d}, & \text{if }  tf_{t,d} > 0 \\
 
    0, & \text{otherwise}
 
\end{cases}
 
</math>
 
 
 
 
 
So:
 
 
 
<math>
 
\begin{array}{lcl}
 
tf_{t,d} = 0  & \rightarrow &  W_{t,d} = 0 \\
 
1              & \rightarrow &  1          \\
 
2              & \rightarrow &  1.3        \\
 
10            & \rightarrow &  2          \\
 
1000          & \rightarrow &  4          \\
 
etc ...
 
\end{array}
 
</math>
 
 
 
 
 
<br />
 
 
 
=====Document Frequency=====
 
The intuition behind '''Document Frequency''' is that terms that rarely occur are more likely to be informative than those that frequently occur. This is also the rationale underlying the concept of "stop words".
 
 
 
 
 
The document frequency <math>df_t</math> is the number of documents in a corpus that the therm <math>t</math> occurs in (the number of documents that contain <math>t</math>)
 
 
 
 
 
<br />
 
======The Inverse Document Frequency - IDF======
 
We define the <math>idf</math> (Inferse document frequency) of <math>t</math> by:
 
 
 
<math>
 
idf_t = log_{10}\frac{N}{df_t}
 
</math>
 
 
 
 
 
The <math>idf</math> is a measure of the relative informativeness of the term <math>t</math>. As we did with <math>tf</math>, we use the log to dampen the effect of <math>idf</math>.
 
 
 
 
 
{| class="wikitable"
 
|+
 
!Term
 
!<math>df_t
 
 
 
</math>
 
!<math>idf_t
 
 
 
</math>
 
|-
 
|mendacity
 
|1
 
|6
 
|-
 
|animal
 
|100
 
|4
 
|-
 
|sunday
 
|1000
 
|3
 
|-
 
|fly
 
|10000
 
|2
 
|-
 
|under
 
|100000
 
|1
 
|-
 
|the
 
|1000000
 
|0
 
|}
 
 
 
 
 
<blockquote>
 
'''The effects of IDF on ranking'''
 
 
 
Does <math>idf</math> have an effect on ranking for one-term queries, like "iPhone"?:
 
 
 
* <math>idf</math> has no effect on ranking one term queries.
 
* <math>idf</math> only has an impact on the ranking of documents for queries with at least two terms.
 
* For the query "capricious person", <math>idf</math> weighting makes occurrences of capricious count for much more in the final document ranking than occurrences of person.
 
</blockquote>
 
 
 
 
 
<br />
 
=====TF-IDF Weighting=====
 
The <math>tf\text{-}idf</math> weight of a term is the product of its <math>tf</math> weight and its <math>idf</math> weight.
 
 
 
<math>
 
tf.idf_t = log_{10}(1 + tf_{t,d}) \times log_{10} \frac{N}{df_t}
 
</math>
 
 
 
 
 
* <math>tf\text{-}idf</math> is the most commonplace weighting scheme in information retrieval. Note that the - in <math>tf\text{-}idf</math> is a hyphen, not a minus sigh. Alternative names are: <math>tf.idf</math> and <math>tf \times idf</math>
 
 
 
* <math>tf\text{-}idf</math> increases with the number of occurrences within a document.
 
 
 
* <math>tf\text{-}idf</math> also increases with the rarity of the term in the collection
 
 
 
 
 
<br />
 
======TF-IDF Final Score======
 
The individual scores for <math>tf.idf</math> are summed up over the documents in the corpora:
 
 
 
 
 
<math>
 
Score(q,d) = \sum_{t \in q \cap d} tf.idf_{t,d}
 
</math>
 
 
 
 
 
There are many variants on the basic principle:
 
* How <math>tf</math> is computed (with/without logs)
 
* Whether the terms in the query are also weighted
 
 
 
 
 
<br />
 
=====Weight Term-Document incidence matrix=====
 
<br />
 
[[File:Weight_Term-Document_incidence_matrix.png|800px|thumb|center|]]
 
 
 
 
 
<br />
 
====Boolean Retrieval====
 
.
 
 
 
 
 
<br />
 
====Boolean Queries====
 
.
 
 
 
 
 
<br />
 
==Natural Language Processing==
 
 
 
 
 
<br />
 
===What is NLP===
 
Natural Language Processing (NLP)is a specialized field of computer science and Artificial Intelligence with the goal of enabling computers to interpret, understand, manipulate, and produce human language. It draws upon theory and research from many disparate disciplines, including computer science, information theory, machine learning, and computational linguistics. The primary areas of research in NLP are natural language understanding, natural language generation, and speech recognition.
 
 
 
 
 
"Human language is highly ambiguous... It is also ever changing and evolving. People are great at producing language and understanding language, and are capable of expressing, perceiving, and interpreting very elaborate and nuanced meanings. At the same time, while we humans are great users of language, we are also very poor at formally understanding and describing the rules that govern language."  (Goldberg and Hirst 2017)
 
 
 
 
 
<br />
 
===Applications of NLP===
 
* Spelling and grammar checking
 
* Text categorisation
 
* Named-entity recognition
 
* Advanced Information Retrieval
 
* Machine Translation
 
* Sentiment Analysis
 
* Question Generation and Question Answering
 
* Text summarisation
 
* Discourse analysis
 
 
 
 
 
<br />
 
===NLP Challenges===
 
 
 
 
 
<br />
 
====Sentence Boundary Disambiguation====
 
When carrying out Natural Language Processing, it is often necessary to break up a document into sentences prior to carrying out further processing.
 
 
 
Sentence Boundary Disambiguation, also known as sentence splitting, is the process of identifying the beginning and end of sentences.
 
 
 
Determining the boundary between sentences is not quite as simple as it may seem at first, as the sentence delimiter (a period in English) is also used to denote a decimal point, an ellipsis, or an abbreviated word.
 
 
 
 
 
<br />
 
====Part of Speech Tagging====
 
Part of Speech Taggingalso called word-category disambiguation is the process of annotating each word in a text with its part of speech.
 
 
 
The basic parts of speech are articles, nouns, verbs, adjectives, pronouns, prepositions, adverbs, conjunctions, and interjections.
 
 
 
However, verbs have tenses and aspects that increase that number considerably
 
 
 
 
 
Common approaches to part-of-speech tagging include the rules-based Brilltagger, Constraint Grammar (also rules-based), and the Baum-Welch algorithm (used to determine the unknown parameters in a Hidden MarkovModel). Try the Brill tagger online at https://cst.dk/online/pos_tagger/uk/
 
 
 
 
 
<br />
 
====Named-Entity Recognition====
 
Named Entity Recognition is the task of identifying anything that can be given a proper name:
 
*  people and organizations;
 
*  locations;
 
*  biological species;
 
*  products;
 
*  substances;
 
*  etc.
 
 
 
 
 
There are two fundamental approaches to named-entity recognition. The first relies on hand-crafted rules created by computational-linguists. The second is based on statistical models and machine learning using annotated training data. The former is expensive and precise but can fail to generalize to previously unseen data. The latter lacks precision but generalizes better.
 
 
 
 
 
<br />
 
====Word Sense Disambiguation====
 
* Word Sense Disambiguation is the process of determining the exact meaning of a word when it has many possible meanings.
 
 
 
* For example, the dictionary definition of the word “tie” gives two potential meanings as verbs:
 
:*  to attach or fasten with string or similar cord.
 
:*  to restrict or limit (someone) to a particular situation or place.
 
 
 
* and two meanings as nouns:
 
:*  a piece of string, cord, or similar used for fastening or tying something.
 
:*  a rod or beam holding parts of a structure together.
 
 
 
 
 
Solutions include dictionary-based approaches, supervised machine learning, and unsupervised clustering where the meaning of neighboring words is taken into account.
 
 
 
 
 
<br />
 
====Synonym - Antoymn - Hypernym - Hyponym Identification====
 
 
 
A synonym is a word that carries the same (or almost the same)meaning as another word. As an example, the words "freedom" and"liberty" are synonyms. An antonym is a word that carries the opposite (or almost the opposite) meaning as another word. The words "hot" and "cold" are antonyms of each other.
 
 
 
Ahypernymis a word or phrase whose more specific meaning can be found in one or more other words or phrases. For example the word"tool" is a hypernym of the words "hammer", "saw" and "screwdriver".* Ahyponymis a word or phrase whose generic meaning can be found in another word or phrase. For instance, the words "car", "bus" and "bicycle" are all hyponyms of the word "vehicle".
 
 
 
 
 
Approaches to this problem are largely based on the use of thesauri.
 
 
 
 
 
<br />
 
====Anaphora and Cataphora Resolution====
 
Anaphora are words, usually but not exclusively pronouns, whose meaning depends on a word occurring before them, known as the antecedent. In the phrase "Mary studied hard and she passed her exam", the word "her" is the anaphor and the word "Mary" is the antecedent. In other words, "her" refers to "Mary".
 
 
 
Cataphoraare words whose meaning depends on words that occur after them, known as the postcedent. In the phrase "Before his second novel was published, nobody heard of Mike", "Mike" is the postcedent and "his" is the anaphor.
 
 
 
 
 
Approaches to anaphora resolution are many and varied, including rule-based, supervised machine learning, ranking, and statistical methods.
 
 
 
 
 
<br />
 
====Semantic Role Labelling====
 
Semantic Role Labellingis the process of extracting subject-predicate-object triples from a sentence.
 
 
 
It involves of the detection of the semantic arguments (subject and object) associated with a predicate and their classification into one or other of the roles.
 
 
 
For example, in the sentence "Mike gave his bicycle to Paul", the predicate is "to give", the subject is "Mike", the object is the book and the indirect object is "Paul".
 
 
 
 
 
Try the Semantic Role Labeller at http://cogcomp.org/page/demo_view/srl
 
 
 
 
 
<br />
 
===Topic modelling===
 
Topic modelling provides us with methods to organize, summarise, and extract understanding from large collections of textual information. It can be thought of as a method of identifying groups of words in a corpus that best represent the subject matter of the constituent documents.
 
 
 
It is the discovery of latent (hidden) topical patterns that are present across a corpus and then annotating documents in the corpus using these topics. We can then use the annotated documents to search, summarise, and organized documents.
 
 
 
There are numerous methods used to extract topic models. In the slides that follow, an overview of the most widely-used method, Latent Dirichlet Allocation (LDA), is provided.
 
 
 
 
 
<br />
 
====Latent Dirichlet Allocation====
 
Latent Dirichlet Allocation is a generative probabilistic model for collections of discrete data such as text corpora.
 
 
 
The objective of LDA is to find a probabilistic model of a corpus that not only assigns high probability to members of the corpus but also assigns high probability to other "similar" documents. It produces bags of words representative of topics within a text corpus
 
 
 
It is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics.
 
 
 
It assumes that words are generated by topics (fixed conditional distributions) and that these topics are infinitely exchangeable within a document and even within a set of documents.
 
 
 
Words are considered a proxy for "hidden" concepts.
 
 
 
 
 
<br />
 
'''The process works as follows:'''
 
 
 
* For each document, randomly assign each word to one of Ktopics, where the value of K is chosen beforehand
 
 
 
* Although this random assignment gives topic representations of all documents and word distributions of all the topics, it is far from optimal.
 
 
 
* This sub-optimal representation is then improved by iterating through each document, and for each wordcomputing:
 
 
 
:* p(topic t|document d)the proportion of words in document d that are assigned to topic t
 
 
 
:* p(word w|topic t)the proportion of assignments to topic t, over all documents d, that come from word w
 
 
 
* Reassign word w a new topic t', where we choose topic t' with probability p(topic t'|document d) × p(word w|topic t').
 
 
 
* This generative model predicts the probability that topic t' generated word w.
 
 
 
* After repeating the last step a large number of times, a steady state is reached where topic assignments are close to optimal.
 
 
 
* These assignments are then used to determine the topic mixtures for each document.
 
 
 
 
 
<br />
 
'''Definitions:'''
 
 
 
* A '''Topic''' is a probability distribution over a collection of words
 
 
 
* A '''Topic model''' is a formal statistical relationship between a group of observed and latent (unknown) random variables that specifies a probabilistic procedure to generate the topics -i.e. a generative model.
 
 
 
* '''Co-occurrence''' is the occurrence of two words from a corpus either alongside each other or close to each other at a probability greater than pure chance.
 
 
 
 
 
The primary objective of topic modeling is to provide a "thematic summary" of a collection of documents
 
 
 
 
 
<br />
 
'''Suppose we are trying to extract topics from these data:'''
 
 
 
1. I like to eat broccoli and bananas
 
 
 
2. I ate a banana and spinach smoothie for breakfast
 
 
 
3. Chinchillas and kittens are cute
 
 
 
4. My sister adopted a kitten yesterday
 
 
 
5. I like the cute hamster munching on a piece of broccoli
 
 
 
 
 
 
 
'''What latent themes are there and how do we extract them?'''
 
 
 
* Sentences 1 & 2: 100% Topic A
 
 
 
* Sentences 3 & 4: 100% Topic B
 
 
 
* Sentence 5: 60% Topic A, 40% Topic B
 
 
 
* Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, ... (at which point, you could interpret topic A to be about food)
 
 
 
* Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, ... (at which point, you could interpret topic B to be about cute animals)
 
 
 
 
 
<br />
 
===Sentiment Analysis===
 
 
 
 
 
<br />
 
====What is Sentiment Analysis====
 
Sentiment analysis is the computational study of opinions, sentiments, evaluations, and attitudes expressed in the form of text. It is sometimes termed "Opinion Mining"
 
 
 
Sentiment or opinion is often highly subjective, so opinion mining involves summarising the opinions or sentiment of many people.
 
 
 
 
 
<br />
 
Dan Jurafsky:
 
 
 
: Sentiment analysis is the detection of the attitudes of the writer of a piece of text towards a particular subject and involves:
 
 
 
:* the text containing the attitude can be as short as a single sentence or as long as an entire document
 
 
 
:* the holder or source of the attitude
 
 
 
:* the target or aspect of the attitude
 
 
 
:* the class or type of the attitude, from a predefined set of types
 
::* positive, negative, neutral (relatively easy)
 
::* like, love, hate, value, desire, etc. (more difficult)
 
 
 
 
 
<br />
 
=====Formal Definition - Liu 2010=====
 
 
 
An opinion or sentiment is a quintuple:
 
 
 
<math>
 
(e_j, a_{jk}, SO_{ijkl}, h_{i}, t_{l})
 
</math>
 
 
 
 
 
Where:
 
* <math>e_j</math> is a target entity
 
 
 
* <math>a_{jk}</math> is an aspect/feature of the entity <math>e_j</math>
 
 
 
* <math>SO_{ijkl}</math> is the sentiment value of the opinion from the opinion holder <math>h_{i}</math>
 
:* <math>SO_{ijkl}</math> is positive, negative, neutral, or a more granular rating.
 
 
 
* <math>h_{i}</math> is an opinion holder
 
 
 
* <math>t_{l}</math> is the time when the opinion is expressed.
 
 
 
 
 
<br />
 
 
 
====Challenges====
 
 
 
 
 
<br />
 
=====Sarcasm and Subtlety=====
 
Consider the following perfume review...
 
 
 
"If you are reading this because it is your darling fragrance, please wear it at home exclusively, and tape the windows shut." (from Dan Jurafsky)
 
 
 
 
 
We can see that this sentence is laden with negative sentiment, butit is expressed in such a subtle way that a computer will not be able to recognise it.
 
 
 
 
 
<br />
 
=====Ordering Effects and Thwarted Expectations=====
 
Examples from Dan Jurafsky:
 
 
 
"This film should be <span style="color:blue">brilliant</span>. It sounds like a <span style="color:blue">great</span> plot, the actors are <span style="color:blue">first grade</span>, and the supporting cast is <span style="color:blue">good</span> as well, and Stallone is attempting to deliver a <span style="color:blue">good</span> performance. However, it <span style="color:red">can't hold up</span>."
 
 
 
"Well as usual Keanu Reeves is <span style="color:red">nothing special</span>, but surprisingly, the <span style="color:blue">very talented</span> Laurence Fishbourne is <span style="color:red">not so good</span> either, I was surprised."
 
 
 
 
 
<br />
 
=====Negations - Comparisons - Complex Opinions=====
 
* While <span style="color:red">not good</span> could be considered to be almost synonymous with <span style="color:red">bad</span>, is <span style="color:blue">not bad</span> synonymous with <span style="color:blue">good</span>?
 
 
 
* A comparison is more challenging than regular direct or indirect opinions: For example "iPhone is better than Android"
 
 
 
* Complex opinions can also be a challenge to parse. For example "Wifi on my iPhone doesn't work as well as it does on my Android, but everything else is better"
 
 
 
 
 
<br />
 
 
 
====Approaches====
 
Approaches to mining sentiment are in three broad categories:
 
 
 
* Lexical analysis, using Natural Language Processing techniques.
 
 
 
* Supervised machine learning (Naive Bayes, Support Vector Machines, etc.), trained on pre-labeled examples.
 
 
 
* Hybrid: A combination of the lexical analysis and machine learning, generally involving annotating sentences with lexical information prior to applying supervised learning.
 
 
 
 
 
<br />
 
 
 
==RapidMiner Examples==
 
 
 
 
 
<br />
 
===Example 1 - Vectorization - Creating a DTM===
 
In this example, we are creating a DTM after extracting particular parts of the text (nous in this case). Note that the «Vector creation» parameter of the «Process Documents from Files» operator allows you to configure the DTM type (TF-IDV, Term frequency, Term occurrences, Binary Term occurrences)
 
 
 
Set the mode of the Tokenize operator to non-letters using the Parameters tab. Set the language of the Filter Tokens operator to English and the expression to "NN.*" without the quotes.  This should extract all nouns.
 
 
 
 
 
<gallery mode=packed-overlay>
 
File:RapidMinerExample-Vectorization-Creating_a_DTM_1.png|Main Process
 
File:RapidMinerExample-Vectorization-Creating_a_DTM_2.png|Process Documents From Files
 
File:RapidMinerExample-Vectorization-Creating_a_DTM_3.png|Results
 
</gallery>
 
<div style="text-align: center;">
 
Screencast at [[File:RapidMinerExample-Vectorization-Creating_a_DTM.mp4]]
 
</div>
 
 
 
 
 
<br />
 
 
 
==References==
 
* Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. "Latent Dirichlet Allocation."J. Mach. Learn. Res.3 (March):993–1022. http://dl.acm.org/citation.cfm?id=944919.944937.
 
 
 
* Cavnar, William B., and John M. Trenkle. 1994. "N-Gram-Based Text Categorization." In Proceedings of Sdair-94, 3rdAnnual Symposium on Document Analysis and Information Retrieval, 161–75.
 
 
 
* Feldman, Ronen, and James Sanger. 2006.Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data.New York, NY, USA: Cambridge University Press.
 
 
 
* Goldberg, Yoav, and Graeme Hirst. 2017.Neural Network Methods in Natural Language Processing. Morgan & Claypool publishers.
 
 
 
* Hearst, Marti A. 1999. "Untangling Text Data Mining." InProceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, 3–10. ACL ’99. Stroudsburg, PA, USA: Association for computational linguistics.
 
 
 
* Jurafsky, Daniel, and James H. Martin. 2009.Speech and Language Processing (2nd Edition). Upper Saddle River, NJ, USA:Prentice-Hall, Inc.
 
 
 
* Kao, Anne, and Steve R. Poteet. 2006.Natural Language Processing and Text Mining. Springer Publishing Company,Incorporated.
 
 
 
* Liu, Bing. 2010. "Sentiment Analysis and Subjectivity." In Handbook of Natural Language Processing, Second Edition, edited by Nitin Indurkhya and Fred J. Damerau. Boca Raton, FL: CRC Press, Taylor; Francis Group.
 
 
 
* Luhn, H. P. 1960. "Key Word-in-context Index for Technical Literature (Kwic Index)."American Documentation11 (4): 288–95.
 
 
 
* Rijsbergen, C. J. van, S. E. Robertson, and M. F. Porter. 1980. "New Models in Probabilistic Information Retrieval."
 
 
 
 
 
<br />
 

Latest revision as of 16:39, 27 February 2026

~ Migrated