Difference between revisions of "Text Analytics in Python"

From Sinfronteras
Jump to: navigation, search
(Replaced content with "~ Migrated")
(Tag: Replaced)
 
Line 1: Line 1:
El curso a partir del cual está basada esta pagina se refería a Natural language processing, no Text Analytics. Sin embargo creo que se tratan aspectos que no son sólo de Natural language processing.
+
~ Migrated
 
 
 
 
https://en.wikipedia.org/wiki/Natural_language_processing
 
 
 
'''Natural language processing''' ('''NLP''') is concerned with the interactions between computers and human (natural) languages, in particular how to process and analyze large amounts of natural language data.
 
 
 
 
 
Challenges in Natural Language Processing frequently involve '''<code>text classification</code>''', '''<code>speech recognition</code>''', '''<code>natural language understanding</code>''', and '''<code>natural language generation</code>'''.
 
 
 
 
 
Natural Language Processing basically consists of combining machine learning techniques with text, and using math and statistics to get that text in a format that the machine learning algorithms can understand.
 
 
 
 
 
<br />
 
== Some Resources ==
 
* [http://www.nltk.org/book/ NLTK Book Online]
 
* [https://www.kaggle.com/c/word2vec-nlp-tutorial Kaggle Walkthrough]
 
* [https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html SciKit Learn's Tutorial]
 
 
 
 
 
* https://www.youtube.com/watch?v=O_B7XLfx0ic
 
* https://www.youtube.com/watch?v=xvqsFTUsOmc  (at 1:10)
 
 
 
 
 
* https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184
 
* https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python
 
* https://www.udemy.com/course/emotion-and-sentiment-analysis/
 
* https://www.udemy.com/course/data-science-natural-language-processing-in-python/
 
 
 
 
 
<br />
 
==NLTK==
 
https://www.nltk.org/
 
 
 
Online book: http://www.nltk.org/book/
 
 
 
https://en.wikipedia.org/wiki/Natural_Language_Toolkit
 
 
 
 
 
The '''Natural Language Toolkit''', or more commonly '''NLTK''', is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language.
 
 
 
 
 
<br />
 
===Installation===
 
http://www.nltk.org/install.html
 
 
 
 
 
<syntaxhighlight lang="shell">
 
conda install nltk  # Installs nltk
 
</syntaxhighlight>
 
 
 
 
 
<br />
 
====Installing NLTK Data====
 
https://www.nltk.org/data.html
 
 
 
 
 
<code>NLTK</code> comes with many corpora, toy grammars, trained models, etc. A complete list is posted at: http://nltk.org/nltk_data/
 
<syntaxhighlight lang="shell">
 
import nltk        # Imports the library
 
nltk.download()    # Download the necessary datasets
 
    Those are some of the important datasets that can be installed:
 
    => all-corpora
 
    => all-nltk
 
</syntaxhighlight>
 
 
 
After compleated the downloding process as above, the data will be located at $HOME. I have relocated the data to this location:
 
<code>/home/adelo/.nltk/nltk_data</code>
 
 
 
 
 
You will need to set the <code>NLTK_DATA</code> environment variable to specify the location of the data (NOTE: I haven't implement it this way. I actually think I did it and it didn't work but don't rememeber well)
 
<code>/home/adelo/.bashrc:</code>
 
<syntaxhighlight lang="bash">
 
# Installing NLTK Data
 
export NLTK_DATA=/home/adelo/.nltk/nltk_data
 
</syntaxhighlight>
 
 
 
What I do is to add the path where I have located the nltk_data to «nltk.data.path». So in the Python script:
 
<syntaxhighlight lang="bash">
 
nltk.data.path.append('/home/adelo/.nltk/nltk_data')
 
 
 
 
 
# or to to it generally to every system:
 
import os
 
HOME = os.environ['HOME']
 
nltk.data.path.append(HOME+'/.nltk/nltk_data')
 
</syntaxhighlight>
 
 
 
 
 
Then, we can test that the data has been installed and the variable properly set as follows (This assumes you downloaded the Brown Corpus):
 
<syntaxhighlight lang="python3">
 
from nltk.corpus import brown
 
brown.words()
 
# Output:
 
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]
 
 
 
 
 
from nltk.corpus import stopwords
 
stopwords.words('english')
 
</syntaxhighlight>
 
 
 
 
 
<br />
 
 
 
==First example==
 
 
 
 
 
<br />
 
===Our test dataset===
 
 
 
*We'll be using the SMS Spam Collection DataSet. This dataset can be downloaded from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection#
 
 
 
 
 
*Text file: <code>'''smsspamcollection'''</code>
 
 
 
 
 
*The SMS Spam Collection v.1 (hereafter the corpus) is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.
 
 
 
 
 
*The SMS Spam Collection v.1  has a total of 4,827 SMS legitimate messages (86.6%) and a total of 747 (13.4%) spam messages.
 
 
 
 
 
*The files contain one message per line. Each line is composed by two columns: one with label (ham or spam) and other with the raw text. Here are some examples:
 
<blockquote>
 
<syntaxhighlight lang="shell">
 
ham  What you doing?how are you?
 
ham  Ok lar... Joking wif u oni...
 
ham  dun say so early hor... U c already then say...
 
ham  MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H*
 
ham  Siva is in hostel aha:-.
 
ham  Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor.
 
spam  FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop
 
spam  Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B
 
spam  URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU
 
</syntaxhighlight>
 
</blockquote>
 
 
 
 
 
*Note: messages are not chronologically sorted.
 
 
 
 
 
*<span style="color:#FF0000">Using these labeled ham and spam examples, we'll train a machine learning model to learn to discriminate between ham/spam automatically. Then, with a trained model, we'll be able to classify arbitrary unlabeled messages as ham or spam.</span>
 
 
 
 
 
<br />
 
 
 
===Importing the data===
 
<syntaxhighlight lang="python">
 
messages = [line.rstrip() for line in open('smsspamcollection/SMSSpamCollection')]
 
print(len(messages))
 
 
 
# Output:
 
5574
 
</syntaxhighlight>
 
 
 
 
 
Let's print the first ten messages and number them using enumerate:
 
<syntaxhighlight lang="python">
 
for message_no, message in enumerate(messages[:10]):
 
    print(message_no, message)
 
    print('\n')
 
 
 
# Output:
 
0 ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
 
 
 
 
 
1 ham Ok lar... Joking wif u oni...
 
 
 
 
 
2 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
 
 
 
 
 
3 ham U dun say so early hor... U c already then say...
 
 
 
 
 
4 ham Nah I don't think he goes to usf, he lives around here though
 
 
 
 
 
5 spam FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
 
 
 
 
 
6 ham Even my brother is not like to speak with me. They treat me like aids patent.
 
 
 
 
 
7 ham As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
 
 
 
 
 
8 spam WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.
 
 
 
 
 
9 spam Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030
 
</syntaxhighlight>
 
 
 
 
 
Due to the spacing we can tell that this is a [https://en.wikipedia.org/wiki/Tab-separated_values TSV] ("tab separated values") file, where the first column is a label saying whether the given message is a normal message (commonly known as "ham") or "spam". The second column is the message itself. (Note our numbers aren't part of the file, they are just from the enumerate call).
 
 
 
 
 
Instead of parsing TSV manually using Python, we can just take advantage of pandas! Let's go ahead and import it into a <code>DataFrame</code>:
 
<syntaxhighlight lang="python3">
 
import pandas as pd
 
 
 
messages = pd.read_csv('smsspamcollection/SMSSpamCollection', sep='\t',
 
                          names=["label", "message"])
 
 
 
type(messages)
 
# Output:
 
pandas.core.frame.DataFrame
 
 
 
messages.head()
 
# output:
 
    label                                              message
 
0    ham    Go until jurong point, crazy.. Available only ...
 
1    ham                        Ok lar... Joking wif u oni...
 
2    spam    Free entry in 2 a wkly comp to win FA Cup fina...
 
3    ham    U dun say so early hor... U c already then say...
 
4    ham    Nah I don't think he goes to usf, he lives aro...
 
</syntaxhighlight>
 
 
 
 
 
<br />
 
===Exploratory Data Analysis===
 
{| class="wikitable"
 
! style="width: 15%" |
 
! style="width: 15%" |Method / Operator / Description
 
! style="width: 50%" |Example
 
|-
 
! style="vertical-align:top;" |<h4 style="text-align:left">Describe the data</h4>
 
|<code>import pandas</code>
 
 
 
 
 
<code>df.describe()</code>
 
|<syntaxhighlight lang="python3">
 
import pandas
 
 
 
messages.describe()
 
 
 
# Output:
 
        label                message
 
count  5572                    5572
 
unique    2                    5169
 
top      ham  Sorry, I'll call later
 
freq    4825                      30
 
</syntaxhighlight>
 
|-
 
! style="vertical-align:top;" |<h4 style="text-align:left">Describe by group</h4>
 
|<code>import pandas</code>
 
 
 
 
 
<code>df.groupby('label').describe()</code>
 
 
 
 
 
Let's use '''groupby''' to use describe by label.
 
 
 
This way we can begin to think about the features that separate ham and spam!
 
|<syntaxhighlight lang="python3">
 
import pandas
 
 
 
messages.groupby('label').describe()
 
 
 
# Output:
 
                                                            message
 
label     
 
ham      count                                                4825
 
        unique                                                4516
 
          top                              Sorry, I'll call later
 
          freq                                                  30
 
spam    count                                                  747
 
        unique                                                  653
 
          top    Please call our customer service representativ...
 
          freq                                                    4
 
</syntaxhighlight>
 
|-
 
! style="vertical-align:top;" |<h4 style="text-align:left">Text length</h4>
 
|<code>df['length'] = df['colums'].apply(len)</code>
 
 
 
 
 
Let's make a new column to detect how long the text messages are.
 
|<syntaxhighlight lang="shell">
 
messages['length'] = messages['message'].apply(len)
 
messages.head()
 
 
 
# Output:
 
  label                                              message  length
 
0    ham    Go until jurong point, crazy.. Available only ...      111
 
1    ham                        Ok lar... Joking wif u oni...      29
 
2  spam    Free entry in 2 a wkly comp to win FA Cup fina...      155
 
3    ham    U dun say so early hor... U c already then say...      49
 
4    ham    Nah I don't think he goes to usf, he lives aro...      61
 
</syntaxhighlight>
 
|-
 
! rowspan="4" style="vertical-align:top;" |<h4 style="text-align:left">Histogram</h4>
 
|Play around with the bin size!
 
 
 
 
 
'''From the <code>Histogram</code>, it looks like text length may be a good feature to think about!''' Let's try to explain why the x-axis of the Histogram goes all the way to 1000ish. This must mean that there is some really long message!
 
|<syntaxhighlight lang="python3">
 
import matplotlib.pyplot as plt
 
import seaborn as sns
 
 
 
%matplotlib inline
 
plt.style.use('bmh')
 
 
 
messages['length'].plot(bins=50, kind='hist', edgecolor="k")
 
</syntaxhighlight>[[File:Nlp1.png|center]]
 
|-
 
|Using <code>describe()</code> over the <code>length we can see</code> that there is a message of  910 characters. This is why the x-axis of the Histogram above goes all the way to 1000ish.
 
Let's use masking to find this message.
 
|<syntaxhighlight lang="python3">
 
messages.length.describe()
 
 
 
# Output:
 
count    5572.000000
 
mean      80.489950
 
std        59.942907
 
min        2.000000
 
25%        36.000000
 
50%        62.000000
 
75%      122.000000
 
max      910.000000
 
</syntaxhighlight>
 
|-
 
|This way we can find the message of 910 characters.
 
|<syntaxhighlight lang="python3">
 
messages[messages['length'] == 910]['message'].iloc[0]
 
 
 
# Output:
 
"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later.."
 
</syntaxhighlight>
 
|-
 
|'''Let's focus back on the idea of trying to see if message length is a distinguishing feature between ham and spam.'''
 
 
 
 
 
Very interesting! Through just basic EDA we've been able to discover a trend that spam messages tend to have more characters.
 
|<syntaxhighlight lang="python3">
 
messages.hist(column='length', by='label', bins=50,figsize=(12,4), edgecolor="k")
 
</syntaxhighlight>[[File:Nlp2.png|center]]
 
|}
 
 
 
 
 
<br />
 
===Text Pre-processing===
 
Our main issue with our data is that it is all in text format (strings). The classification algorithms that we've learned about so far will need some sort of numerical feature vector in order to perform the classification task. '''There are actually many methods to convert a corpus to a vector format. The simplest is the the bag-of-words approach''', where each unique word in a text will be represented by one number.
 
<br />
 
 
 
*In this section we'll convert the raw messages (sequence of characters) into vectors (sequences of numbers).
 
 
 
<br />
 
 
 
*As a first step, let's write a function that will split a message into its individual words and return a list. We'll also remove very common words, ('the', 'a', etc..). To do this we will take advantage of the <code>'''NLTK'''</code> library. It's pretty much the standard library in Python for processing text and has a lot of useful features. We'll only use some of the basic ones here.
 
 
 
<br />
 
 
 
*Let's create a function that will process the string in the message column, then we can just use '''<code>apply()</code>''' in pandas do process all the text in the <code>DataFrame</code>.
 
 
 
<br />
 
 
 
*First removing punctuation. We can just take advantage of Python's built-in '''<code>string</code>''' library to get a quick list of all the possible punctuation.
 
 
 
<br />
 
{| class="wikitable"
 
!
 
!
 
!Example
 
|-
 
! style="vertical-align:top;" |<h4 style="text-align:left">Removing punctuation</h4>
 
|We can just take advantage of Python's built-in '''string''' library to get a quick list of all the possible punctuation: <code>string.punctuation</code>
 
|<syntaxhighlight lang="python3">
 
import string
 
 
 
mess = 'Sample message! Notice: it has punctuation.'
 
 
 
# Check characters to see if they are in punctuation
 
nopunc = [char for char in mess if char not in string.punctuation]
 
 
 
# Join the characters again to form the string.
 
nopunc = ''.join(nopunc)
 
print(nopunc)
 
 
 
# Output:
 
Sample message Notice it has punctuation
 
</syntaxhighlight>
 
|-
 
! style="vertical-align:top;" |<h4 style="text-align:left">Remove stopwords</h4>
 
|Stopwords are very common words ('the', 'a', etc..).
 
We can import a list of english stopwords from NLTK (check the documentation for more languages and info).
 
|<syntaxhighlight lang="python3">
 
from nltk.corpus import stopwords
 
stopwords.words('english')[0:10] # Show some stop words
 
# Output:
 
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your']
 
 
 
nopunc.split()
 
# Output:
 
['Sample', 'message', 'Notice', 'it', 'has', 'punctuation']
 
 
 
# Now just remove any stopwords
 
clean_mess = [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
 
clean_mess
 
# Output:
 
['Sample', 'message', 'Notice', 'punctuation']
 
</syntaxhighlight>
 
|-
 
! style="vertical-align:top; text-align:left;" |<h4 style="text-align:left">Making a function to apply a set of pre-procesteps steps and '''tokarize''' the data</h4>
 
 
 
*'''<code>Remove ''Punctuation''</code>'''
 
*'''<code>''Remove Stopwords''</code>'''
 
*'''<code>Tokenize</code>'''
 
|We can make a function to to remove <code>'''''Punctuation'''''</code>, <code>'''''Stopwords'''''</code> and <code>'''Tokenize'''</code> our messages. This function will be applied to our DataFrame.
 
 
 
 
 
'''<code>Tokenization</code>''' is just the term used to describe the process of converting the normal text strings in to a list of tokens (words that we actually want).
 
 
 
 
 
Notice that this function is returning a <code>'''list'''</code> of words without Punctuation or Stopwords.
 
|<syntaxhighlight lang="python3">
 
def text_process(mess):
 
    """
 
    Takes in a string of text, then performs the following:
 
    1. Remove all punctuation
 
    2. Remove all stopwords
 
    3. Returns a list of the cleaned text
 
    """
 
    # Check characters to see if they are in punctuation
 
    nopunc = [char for char in mess if char not in string.punctuation]
 
 
 
    # Join the characters again to form the string.
 
    nopunc = ''.join(nopunc)
 
   
 
    # Now just remove any stopwords
 
    return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
 
</syntaxhighlight>
 
|-
 
! style="vertical-align:top;" |<h4 style="text-align:left">Applying the function over our DataFrace</h4>
 
|'''Note:''' We may get some warnings or errors for symbols we didn't account for or that weren't in Unicode (like a British pound symbol)
 
|<syntaxhighlight lang="python3">
 
# Show original dataframe
 
messages.head()
 
# Output:
 
  label                                              message  length
 
0    ham    Go until jurong point, crazy.. Available only ...      111
 
1    ham                        Ok lar... Joking wif u oni...      29
 
2  spam    Free entry in 2 a wkly comp to win FA Cup fina...      155
 
3    ham    U dun say so early hor... U c already then say...      49
 
4    ham    Nah I don't think he goes to usf, he lives aro...      61
 
 
 
 
 
# Applying the function
 
messages['message'].head(5).apply(text_process)
 
# Output
 
0    [Go, jurong, point, crazy, Available, bugis, n...
 
1                      [Ok, lar, Joking, wif, u, oni]
 
2    [Free, entry, 2, wkly, comp, win, FA, Cup, fin...
 
3        [U, dun, say, early, hor, U, c, already, say]
 
4    [Nah, dont, think, goes, usf, lives, around, t...
 
</syntaxhighlight>
 
|-
 
! style="vertical-align:top;" |<h4 style="text-align:left">Continuing Normalization</h4>
 
| colspan="2" |There are a lot of ways to continue normalizing this text. Such as '''<code>[[wikipedia:Stemming|Stemming]]</code>''' or '''<code>[http://www.nltk.org/book/ch05.html distinguishing by part of speech]</code>'''.
 
 
 
 
 
NLTK has lots of built-in tools and great documentation on a lot of these methods. Sometimes they don't work well for text-messages due to the way a lot of people tend to use abbreviations or shorthand, For example:
 
 
 
 
 
<code>'Nah dawg, IDK! Wut time u headin to da club?'</code>
 
 
 
Vs.
 
 
 
<code>'No dog, I don't know! What time are you heading to the club?'</code>
 
|}
 
 
 
 
 
<br />
 
 
 
===Vectorization===
 
Usually, after pre-processing, we have the messages as '''<code>lists of tokens</code>''' (also known as '''<code>lemmas</code>''').
 
 
 
Now we'll convert each message, represented as a list of tokens (lemmas) into a Numeric Vector that machine learning models can understand.
 
 
 
To be able to run a Machine Learning algorithm, we first need to transform each text document into a numerical representation in the form of a vector. This matrix will be the numerical representation that a Machine Learning algorithm is able to understand.
 
 
 
We'll do that in three steps using the '''<code>bag-of-words</code>''' model:
 
 
 
#Create the '''<code>Document Term Matrix (DTM)</code>''' (Also know as '''<code>Term Frequency(TF)</code>''')''':''' Count how many times does a word occur in each text document.
 
#'''<code>Term weighting</code>''': Weigh the counts, so that frequent tokens get lower weight (Inverse Document Frequency).
 
#<code>'''Normalization'''</code>: Normalize the vectors to unit length, to abstract from the original text length (L2 Norm).
 
 
 
 
 
<br />
 
====Document Term Matrix====
 
We will convert a collection of text documents to a matrix of token counts:
 
 
 
*We can imagine a matrix of token counts as a 2-Dimensional matrix. Where the 1-dimension is the entire vocabulary (1 row per word) and the other dimension are the actual documents, in this case a column per text message.
 
 
 
*Since there are so many messages, we can expect a lot of zero counts for the presence of that word in that document. Because of this, <code>'''SciKit-Learn'''</code> will output a [[wikipedia:Sparse_matrix|Sparse Matrix]].
 
 
 
*Each columns (or row depending on the approach) of this matrix represent a word in the training data. Thus, each document is defined by the frequency of the words that are in the dictionary composed for all the terms in our data.
 
 
 
 
 
{| style="border-spacing: 2px; width: 100%;"
 
|+
 
|<figure id="fig:DocumentTermMatrix">
 
<!-- <pdf width="2000" height="630">File:Document-Term Matrix.pdf</pdf> -->
 
[[File:Document-Term Matrix.png|700px|thumb|center|
 
<caption>
 
The Document-Term Matrix <br/>
 
[[File:Document-Term Matrix.pdf]] [[File:Document-Term Matrix.ods]] [[File:Using_text_mining_and_Machine_Learning_to_classify_documents.pdf]]
 
</caption>
 
]]
 
</figure>
 
|[[File:Nlp3.png|center|343x343px]]
 
|}
 
 
 
 
 
<br />
 
=====Using Scikit-learn CountVectorizer method to create a DTM=====
 
In Python, we can use '''<code>Scikit-learn</code>'''<nowiki/>'s '''<code>CountVectorizer</code>''' method to create a <code>'''DTM'''</code>. Let's see how to do so in our example:
 
 
 
<syntaxhighlight lang="python3">
 
from sklearn.feature_extraction.text import CountVectorizer
 
 
 
 
 
# This create a «Bag-of-Words (bow) transformed object» (It is not the resulting DTM yet)
 
# There are a lot of arguments and parameters that can be passed to the CountVectorizer. In this case we will just specify the analyzer to be our own previously defined function «text_process»:
 
# Might take a while...
 
bow_transformer = CountVectorizer(analyzer=text_process).fit(messages['message'])
 
 
 
 
 
# Print total number of vocab words:
 
print(len(bow_transformer.vocabulary_))
 
# Output:
 
11425
 
 
 
 
 
# Let's take one text message and get its bag-of-words counts as a vector, putting to use our new bow_transformer:
 
message4 = messages['message'][3]
 
print(message4)
 
# Output:
 
U dun say so early hor... U c already then say...
 
 
 
 
 
# Now let's see its vector representation:
 
bow4 = bow_transformer.transform([message4])
 
print(bow4)
 
print(bow4.shape)
 
# Output:
 
(0, 4068)  2
 
(0, 4629)  1
 
(0, 5261)  1
 
(0, 6204)  1
 
(0, 6222)  1
 
(0, 7186)  1
 
(0, 9554)  2
 
(1, 11425)
 
# This means that there are seven unique words in message number 4 (after removing common stop words). Two of them appear twice, the rest only once.
 
 
 
 
 
# Let's go ahead and check and confirm which ones appear twice:
 
print(bow_transformer.get_feature_names()[4068])
 
print(bow_transformer.get_feature_names()[9554])
 
# Output:
 
U
 
say
 
 
 
 
 
# Now we can use «.transform» on our «Bag-of-Words (bow) transformed object» and transform the entire DataFrame of messages. Let's go ahead and check out how the bag-of-words counts for the entire SMS corpus is a large, sparse matrix:
 
messages_bow = bow_transformer.transform(messages['message'])
 
 
 
 
 
print('Shape of Sparse Matrix: ', messages_bow.shape)
 
print('Amount of Non-Zero occurences: ', messages_bow.nnz)
 
# Output:
 
Shape of Sparse Matrix:  (5572, 11444)
 
Amount of Non-Zero occurences:  50795
 
 
 
 
 
sparsity = (100.0 * messages_bow.nnz / (messages_bow.shape[0] * messages_bow.shape[1]))
 
print('sparsity: {}'.format(round(sparsity)))
 
# Output:
 
sparsity: 0
 
</syntaxhighlight>
 
 
 
 
 
<br />
 
====Term weighting and Normalization using TF-IDF====
 
<blockquote>
 
In general terms, the process of '''<code>weighting</code>''' involves emphasizing the contribution of particular aspects of a phenomenon (or of a set of data) over others to a final outcome or result; thereby highlighting those aspects in comparison to others in the analysis. That is, rather than each variable in the data set contributing equally to the final result, some of the data is adjusted to make a greater contribution than others. https://en.wikipedia.org/wiki/Weighting
 
 
 
 
 
'''<code>TF-IDF</code>''', short for '''Term Frequency–Inverse Document Frequency''', and the '''<code>TF-IDF Weight</code>''', is a statistical measure used to evaluate '''''how important a word is to a document in a collection or corpus'''''. It has many uses, most importantly in automated text analysis. It is often used as a '''<code>weighting factor</code>''' in machine learning algorithms for Natural Language Processing.
 
 
 
 
 
 
 
Typically, the '''<code>TF-IDF Weight</code>''' is computed by the product of the '''<code>TF</code>''' and the '''<code>IDF</code>'''
 
 
 
*<code>'''The normalized Term Frequency (TF)'''</code>, which is the number of times a word appears in a document, divided by the total number of words in that document.
 
**Why <code>Normalization?</code>: Since every document is different in length, it is probably that a term would appear much more times in long documents than shorter ones. Thus, the Term Frequency is often divided by the document length (total number of words in that document) as a way of <code>Normalization</code>:
 
**<math>TF(t) = \frac{\text{Number of times term  } t \text{  appears in a document}}{\text{Total number of terms in that document}}</math>
 
 
 
 
 
*'''<code>Inverse Document Frequency (IDF)</code>'''. The <code>IDF</code> measures how important a term is. While computing <code>TF</code>, all terms are considered equally important. However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance. '''Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:'''
 
**The <code>IDF</code> is computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.
 
**<math>IDF(t) = \ln\Bigl(\frac{\text{Total number of documents}}{\text{Number of documents with term }  t  \text{ in it}}\Bigr) </math>
 
 
 
 
 
*<math>IF\text{-}IDF(t) = TF(t) * IDF(t)</math>
 
 
 
 
 
 
 
'''Example:'''
 
 
 
*Consider a document containing 100 words wherein the word cat appears 3 times.
 
*Then, the normalized Term Frequency for cat is:
 
 
 
:*<math>TF(cat) = \frac{3}{100} = 0.03</math>
 
 
 
 
 
*Now, assume we have 10 million documents and the word cat appears in one thousand of these.
 
*Then, the Inverse Document Frequency for cat is:
 
 
 
:*<math>IDF(cat) = \ln \Bigl( \frac{10000000}{1000} \Bigr) = 4</math>
 
 
 
 
 
*Finally, the '''<code>TF-IDF Weight</code>''' is the product of these quantities:
 
 
 
:*<math>IF\text{-}IDF(cat) = TF(cat) * IDF(cat) = 0.03 * 4 = 0.12</math>
 
 
 
</blockquote>
 
 
 
 
 
<br />
 
=====Using TfidfTransformer method from Scikit-learn to compute the  TF-IDF=====
 
<blockquote>
 
<code>'''Term weighting'''</code> and <code>'''Normalization'''</code> can be done with '''<code>TF-IDF</code>''', using '''<code>scikit-learn</code>'''<nowiki/>'s <code>'''TfidfTransformer'''</code>.
 
 
 
 
 
<syntaxhighlight lang="python3">
 
from sklearn.feature_extraction.text import TfidfTransformer
 
 
 
tfidf_transformer = TfidfTransformer().fit(messages_bow)
 
tfidf4 = tfidf_transformer.transform(bow4)
 
print(tfidf4)
 
# Output:
 
(0, 9554)    0.5385626262927564
 
(0, 7186)    0.4389365653379857
 
(0, 6222)    0.3187216892949149
 
(0, 6204)    0.29953799723697416
 
(0, 5261)    0.29729957405868723
 
(0, 4629)    0.26619801906087187
 
(0, 4068)    0.40832589933384067
 
</syntaxhighlight>
 
 
 
 
 
We'll go ahead and check what is the IDF (inverse document frequency) of the word <code>"u"</code> and of word <code>"university"</code>?
 
<syntaxhighlight lang="python3">
 
print(tfidf_transformer.idf_[bow_transformer.vocabulary_['u']])
 
print(tfidf_transformer.idf_[bow_transformer.vocabulary_['university']])
 
# Output:
 
3.28005242674
 
8.5270764989
 
</syntaxhighlight>
 
 
 
 
 
To transform the entire bag-of-words corpus into TF-IDF corpus at once:
 
<syntaxhighlight lang="python3">
 
messages_tfidf = tfidf_transformer.transform(messages_bow)
 
print(messages_tfidf.shape)
 
# Output:
 
(5572, 11425)
 
</syntaxhighlight>
 
 
 
 
 
<br />
 
=== Training the model ===
 
With messages represented as vectors, we can finally train our spam/ham classifier. Now we can actually use almost any sort of classification algorithms. For a [http://www.inf.ed.ac.uk/teaching/courses/inf2b/learnnotes/inf2b-learn-note07-2up.pdf variety of reasons], the Naive Bayes classifier algorithm is a good choice.
 
 
 
 
 
<br />
 
==== Naive Bayes classifier using scikit-learn ====
 
<syntaxhighlight lang="python3">
 
from sklearn.naive_bayes import MultinomialNB
 
spam_detect_model = MultinomialNB().fit(messages_tfidf, messages['label'])
 
</syntaxhighlight>
 
 
 
 
 
Let's try classifying our single random message and checking how we do:
 
<syntaxhighlight lang="python3">
 
print('predicted:', spam_detect_model.predict(tfidf4)[0])
 
print('expected:', messages.label[3])
 
# Output:
 
predicted: ham
 
expected: ham
 
</syntaxhighlight>
 
 
 
 
 
<br />
 
 
 
=== Model Evaluation ===
 
<syntaxhighlight lang="python3">
 
all_predictions = spam_detect_model.predict(messages_tfidf)
 
print(type(all_predictions))
 
print(len(all_predictions))
 
print(all_predictions)
 
 
 
# Output:
 
<class 'numpy.ndarray'>
 
5572
 
['ham' 'ham' 'spam' ... 'ham' 'ham' 'ham']
 
</syntaxhighlight>
 
 
 
 
 
We can use <code>SciKit-Learn</code>'s built-in classification report, which returns [[wikipedia:Precision_and_recall|precision]], [[wikipedia:Precision_and_recall|recall]], [[wikipedia:F1_score|f1-score]], and a column for support (meaning how many cases supported that classification). Check out the links for more detailed info on each of these metrics and the figure below:
 
 
 
[[File:Precisionrecall.svg|thumb|900|Precision and recall]]
 
 
 
<syntaxhighlight lang="python3">
 
from sklearn.metrics import classification_report
 
print (classification_report(messages['label'], all_predictions))
 
 
 
# Output:
 
              precision    recall  f1-score  support
 
 
 
        ham        0.98      1.00      0.99      4825
 
      spam        1.00      0.85      0.92      747
 
 
 
avg / total        0.98      0.98      0.98      5572
 
</syntaxhighlight>
 
 
 
 
 
There are quite a few possible metrics for evaluating model performance. Which one is the most important depends on the task and the business effects of decisions based off of the model. For example, the cost of mis-predicting "spam" as "ham" is probably much lower than mis-predicting "ham" as "spam".
 
 
 
In the above "evaluation", we evaluated accuracy on the same data we used for training. '''You should never actually evaluate on the same dataset you train on!'''
 
 
 
A proper way is to split the data into a training/test set, where the model only ever sees the '''training data''' during its model fitting and parameter tuning. The '''test data''' is never used in any way. This is then our final evaluation on test data is representative of true predictive performance.
 
 
 
 
 
<br />
 
=== Train Test Split ===
 
<syntaxhighlight lang="python3">
 
from sklearn.model_selection import train_test_split
 
 
 
msg_train, msg_test, label_train, label_test = \
 
train_test_split(messages['message'], messages['label'], test_size=0.2)
 
 
 
print(len(msg_train), len(msg_test), len(msg_train) + len(msg_test))
 
 
 
# Output
 
4457 1115 5572
 
</syntaxhighlight>
 
 
 
The test size is 20% of the entire dataset (1115 messages out of total 5572), and the training is the rest (4457 out of 5572). Note the default split would have been 30/70.
 
 
 
 
 
<br />
 
=== Creating a Data Pipeline ===
 
Let's run our model again and then predict off the test set. We will use SciKit Learn's pipeline capabilities to store a pipeline of workflow. This will allow us to set up all the transformations that we will do to the data for future use. Let's see an example of how it works:
 
 
 
<syntaxhighlight lang="python3">
 
from sklearn.pipeline import Pipeline
 
 
 
pipeline = Pipeline([
 
    ('bow', CountVectorizer(analyzer=text_process)),  # strings to token integer counts
 
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
 
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
 
])
 
</syntaxhighlight>
 
 
 
Now we can directly pass message text data and the pipeline will do our pre-processing for us! We can treat it as a model/estimator API:
 
<syntaxhighlight lang="python3">
 
pipeline.fit(msg_train,label_train)
 
 
 
# Output:
 
Pipeline(steps=[('bow', CountVectorizer(analyzer=<function text_process at 0x11e795bf8>, binary=False,
 
        decode_error='strict', dtype=<class 'numpy.int64'>,
 
        encoding='utf-8', input='content', lowercase=True, max_df=1.0,
 
        max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None,...f=False, use_idf=True)), ('classifier', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])
 
</syntaxhighlight>
 
 
 
<syntaxhighlight lang="python3">
 
predictions = pipeline.predict(msg_test)
 
</syntaxhighlight>
 
 
 
<syntaxhighlight lang="python3">
 
print(classification_report(predictions,label_test))
 
 
 
# Output:
 
                precision    recall  f1-score  support
 
 
 
        ham          1.00      0.96      0.98      1001
 
      spam          0.75      1.00      0.85      114
 
 
 
avg / total          0.97      0.97      0.97      1115
 
</syntaxhighlight>
 
 
 
Now we have a classification report for our model on a true testing set!
 
 
 
 
 
There is a lot more to Natural Language Processing than what we've covered here, and its vast expanse of topics could fill up several college courses!
 
 
 
 
 
<br />
 
==Sentiment Analysis==
 
 
 
 
 
<br />
 
===Rule-based sentiment analysis===
 
Two of the most popular Sentiment Analysis solutions for Python are '''TextBlob''' and '''Vader Sentiment'''.
 
 
 
 
 
https://pythonprogramming.net/sentiment-analysis-python-textblob-vader/
 
 
 
https://medium.com/analytics-vidhya/rule-based-sentiment-analysis-with-python-for-turkeys-stock-market-839f85d7daaf
 
 
 
https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f
 
 
 
https://rpubs.com/RRoger_Yu/Validating_Out_of_the_Box_Algorithms_Sentiment_Analysis
 
 
 
https://medium.com/@10e/exploring-out-of-the-box-sentiment-analysis-packages-8cb9931ff5a4
 
 
 
https://www.linkedin.com/pulse/out-box-sentiment-analysis-katuru-venkata-sai/
 
 
 
https://www.iflexion.com/blog/sentiment-analysis-python
 
 
 
 
 
<br />
 
====TextBlob====
 
 
 
 
 
<br />
 
====Vader Sentiment====
 
 
 
 
 
<br />
 
 
 
===Emotion Lexicon===
 
*https://towardsdatascience.com/basic-nlp-on-the-texts-of-harry-potter-sentiment-analysis-1b474b13651d
 
 
 
:https://github.com/raffg/harry_potter_nlp/blob/master/sentiment_analysis.ipynb
 
 
 
:http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
 
 
 
 
 
* http://jonathansoma.com/lede/algorithms-2017/classes/more-text-analysis/nrc-emotional-lexicon/
 
 
 
 
 
<br />
 

Latest revision as of 20:11, 27 February 2026

~ Migrated