Difference between revisions of "Página de pruebas 4"

From Sinfronteras
Jump to: navigation, search
Line 1: Line 1:
 +
===Applying Bayes' Theorem - Example 1===
 +
https://stats.stackexchange.com/questions/66079/naive-bayes-classifier-gives-a-probability-greater-than-1
 +
 +
Let's say that we are training a Span classifier.
 +
 +
We need information about the frequency of words in spam or ham (non-spam) emails. Let's assume these numbers:
 +
 +
In this first example, we will assume that we know the frequency of only one word (Viagra) in spam or non-spam emails. Let's assume these numbers:
 +
 +
 +
{| class="wikitable" style="border: 0px"
 +
|+
 +
|style="background:white; border: 0px"|
 +
! colspan="2" |Viagra
 +
! colspan="2" |Money
 +
! colspan="2" |Groceries
 +
! colspan="2" |Unsubscribe
 +
|style="background:white; border: 0px"|
 +
|-
 +
|style="background:white; border: 0px"|
 +
|'''Yes'''
 +
|'''No'''
 +
|'''Yes'''
 +
|'''No'''
 +
|'''Yes'''
 +
|'''No'''
 +
|'''Yes'''
 +
|'''No'''
 +
|'''Total'''
 +
|-
 +
|'''Spam'''
 +
|4/20
 +
|16/20
 +
|10/20
 +
|10/20
 +
|0/20
 +
|20/20
 +
|12/20
 +
|8/20
 +
|'''20'''
 +
|-
 +
|'''Ham'''
 +
|1/80
 +
|79/80
 +
|14/80
 +
|66/80
 +
|8/80
 +
|72/80
 +
|23/80
 +
|57/80
 +
|'''80'''
 +
|-
 +
|'''Total'''
 +
|5/100
 +
|95/100
 +
|24/100
 +
|76/100
 +
|8/100
 +
|92/100
 +
|35/100
 +
|65/100
 +
|'''100'''
 +
|}
 +
 +
 +
{|
 +
|
 +
* <math>\bold{P(Viagra|Spam)}</math>
 +
|
 +
|(Likelihood)
 +
|<div style="margin:  5pt">:</div>
 +
|The probability that a spam message contains the term <math>Viagra</math>
 +
|<div style="margin: 10pt"><math>\rightarrow</math></div>
 +
|<math>4/20 = 0.20 = 20%</math>
 +
|-
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|-
 +
|
 +
* <math>\bold{P(Viagra)}</math>
 +
|
 +
|(Marginal likelihood)
 +
|<div style="margin:  5pt">:</div>
 +
|The probability that the word <math>Viagra</math> appeared in any email (spam or ham)
 +
|<div style="margin: 10pt"><math>\rightarrow</math></div>
 +
|<math>5/100 = 0.05 = 5%</math>
 +
|-
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|-
 +
|
 +
* <math>\bold{P(Spam)}</math>
 +
|
 +
|(Prior probability)
 +
|<div style="margin:  5pt">:</div>
 +
|The probability that an email is Spam
 +
|<div style="margin: 10pt"><math>\rightarrow</math></div>
 +
|<math>20/100 = 0.20 = 20%</math>
 +
|-
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|-
 +
|
 +
* <math>\bold{P(Spam|Viagra)}</math>
 +
|
 +
|(Posterior probability)
 +
|<div style="margin:  5pt">:</div>
 +
|The probability that an email is Spam given that contain the word <math>Viagra</math>
 +
|<div style="margin: 10pt"><math>\rightarrow</math></div>
 +
|<math>\frac{P(Viagra|spam)P(spam)}{P(Viagra)} = \frac{0.2 \times 0.2}{0.05} = 80%</math>
 +
|-
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|
 +
|-
 +
| colspan="7" |
 +
* '''The probability that a message is spam, given that it contains the word "Viagra" is <math>{\bold{80%}}</math>. Therefore, any message containing this term should be filtered.'''
 +
|}
 +
 +
 +
<br />
 +
 
===Applying Bayes' Theorem - Example 2===
 
===Applying Bayes' Theorem - Example 2===
  

Revision as of 16:22, 6 February 2021

Applying Bayes' Theorem - Example 1

https://stats.stackexchange.com/questions/66079/naive-bayes-classifier-gives-a-probability-greater-than-1

Let's say that we are training a Span classifier.

We need information about the frequency of words in spam or ham (non-spam) emails. Let's assume these numbers:

In this first example, we will assume that we know the frequency of only one word (Viagra) in spam or non-spam emails. Let's assume these numbers:


Viagra Money Groceries Unsubscribe
Yes No Yes No Yes No Yes No Total
Spam 4/20 16/20 10/20 10/20 0/20 20/20 12/20 8/20 20
Ham 1/80 79/80 14/80 66/80 8/80 72/80 23/80 57/80 80
Total 5/100 95/100 24/100 76/100 8/100 92/100 35/100 65/100 100


(Likelihood)
:
The probability that a spam message contains the term
(Marginal likelihood)
:
The probability that the word appeared in any email (spam or ham)
(Prior probability)
:
The probability that an email is Spam
(Posterior probability)
:
The probability that an email is Spam given that contain the word
  • The probability that a message is spam, given that it contains the word "Viagra" is . Therefore, any message containing this term should be filtered.



Applying Bayes' Theorem - Example 2

  • Let's extend our spam filter by adding a few additional terms to be monitored: "money", "groceries", and "unsubscribe".
  • We will assume that the Naïve Bayes learner was trained by constructing a likelihood table for the appearance of these four words in 100 emails, as shown in the following table:


ApplyingBayesTheorem-Example.png


As new messages are received, the posterior probability must be calculated to determine whether the messages are more likely to be spam or ham, given the likelihood of the words found in the message text.


We can define the problem as shown in the equation below, which captures the probability that a message is spam, given that the words 'Viagra' and Unsubscribe are present and that the words 'Money' and 'Groceries' are not.



For a number of reasons, this is computationally difficult to solve. As additional features are added, tremendous amounts of memory are needed to store probabilities for all of the possible intersecting events. Therefore, Class-Conditional independence can be assumed to simplify the problem.



Class-Conditional independence

The work becomes much easier if we can exploit the fact that Naïve Bayes assumes independence among events. Specifically, Naïve Bayes assumes class-conditional independence, which means that events are independent so long as they are conditioned on the same class value.

Assuming conditional independence allows us to simplify the equation using the probability rule for independent events . This results in a much easier-to-compute formulation:



For example, suppose that a message contains the terms Viagra and Unsubscribe, but does not contain either Money or Groceries:



The presentation shows this example this way. I think there are mistakes in this presentation:

ApplyingBayesTheorem-ClassConditionalIndependance.png

Using the values in the likelihood table, we can start filling numbers in these equations. Because the denominatero si the same in both cases, it can be ignored for now. The overall likelihood of spam is then:



While the likelihood of ham given the occurrence of these words is:



Because 0.012/0.002 = 6, we can say that this message is six times more likely to be spam than ham. However, to convert these numbers to probabilities, we need one last step.


The probability of spam is equal to the likelihood that the message is spam divided by the likelihood that the message is either spam or ham:



The probability that the message is spam is 0.857. As this is over the threshold of 0.5, the message is classified as spam.