Difference between revisions of "Supervised Machine Learning for Fake News Detection"

From Sinfronteras
Jump to: navigation, search
(Gofaas Web App)
(Summary of Results)
Line 27: Line 27:
 
===Procedure===
 
===Procedure===
  
* The Dataset
+
*The Dataset
  
* Splitting the data into Train and Test data
+
*Splitting the data into Train and Test data
  
* Cleaning the data
+
*Cleaning the data
  
* Building the Document-Term Matrix
+
*Building the Document-Term Matrix
  
* Model Building
+
*Model Building
  
* Cross validation
+
*Cross validation
  
* Making predictions from the model created and displaying a Confusion matrix
+
*Making predictions from the model created and displaying a Confusion matrix
  
 
<br />
 
<br />
Line 50: Line 50:
  
 
{| class="wikitable"
 
{| class="wikitable"
! rowspan="3" |[[Establishing an authenticity of sport news by Machine Learning Models#Algorithms|Algorithms]]
+
! rowspan="4" |[[Establishing an authenticity of sport news by Machine Learning Models#Algorithms|Algorithms]]
! rowspan="3" |Author
+
! rowspan="4" |Author
! rowspan="3" |Package
+
! rowspan="4" |Package
! rowspan="3" |Keyword
+
! rowspan="4" |Keyword
! colspan="6" |Accuracy
+
! colspan="8" |Accuracy
! colspan="3" |Running time for the entire data
 
 
|-
 
|-
 
! colspan="2" |[[Establishing an authenticity of sport news by Machine Learning Models#Description of the Kaggle fake news dataset|Kaggle fake news dataset]]
 
! colspan="2" |[[Establishing an authenticity of sport news by Machine Learning Models#Description of the Kaggle fake news dataset|Kaggle fake news dataset]]
! colspan="2" |[[Establishing an authenticity of sport news by Machine Learning Models#Description of the Fake news challenge dataset|Fake news challenge dataset]]
+
20,800 rows
 
! colspan="2" |[[Establishing an authenticity of sport news by Machine Learning Models#Description of the Fake news Detector dataset|Fake news Detector dataset]]
 
! colspan="2" |[[Establishing an authenticity of sport news by Machine Learning Models#Description of the Fake news Detector dataset|Fake news Detector dataset]]
! colspan="3" |'''Using a GoogleCloud VM: 64 vCPUs, 240 GB memory (Ubuntu 18.04)'''
+
10,000 rows
 +
! colspan="4" |Gofaaas Fake News Dataset
 +
500 rows
 +
|-
 +
! rowspan="2" |Test data (70% of the dataset)
 +
! rowspan="2" |Cross validation
 +
! rowspan="2" |Test data (70% of the dataset)
 +
! rowspan="2" |Cross validation
 +
! rowspan="2" |Test data (70% of the dataset)
 +
! rowspan="2" |Cross validation
 +
! rowspan="2" |Using the Kaggle Model^
 +
! rowspan="2" |Using the Detector Model^^
 
|-
 
|-
!1000 rows
 
!The entire data:
 
20,800 rows
 
!1000 rows
 
!The entire data: 49,972 rows<br />
 
!1000 rows
 
!The entire data:<br />
 
![[Establishing an authenticity of sport news by Machine Learning Models#Description of the Kaggle fake news dataset|Kaggle fake news dataset]]
 
![[Establishing an authenticity of sport news by Machine Learning Models#Description of the Fake news challenge dataset|Fake news challenge dataset]]
 
![[Establishing an authenticity of sport news by Machine Learning Models#Description of the Fake news challenge dataset|Fake news Detector dataset]]
 
 
|-
 
|-
 
|[[Establishing an authenticity of sport news by Machine Learning Models#Extreme Gradient Boosting|Naive Bayes]]
 
|[[Establishing an authenticity of sport news by Machine Learning Models#Extreme Gradient Boosting|Naive Bayes]]
 
|[[wikipedia:Thomas_Bayes|Bayes, Thomas]]
 
|[[wikipedia:Thomas_Bayes|Bayes, Thomas]]
|We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/e1071/index.html e1071]
+
|[https://cran.r-project.org/web/packages/e1071/index.html e1071]
 
|NB*
 
|NB*
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_naivebayes|<math>{\color{blue}69,03%}</math>]]
 
 
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_naivebayes|<math>{\color{blue}76.41%}</math>]]
 
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_naivebayes|<math>{\color{blue}76.41%}</math>]]
|[[Establishing an authenticity of sport news by Machine Learning Models#fnc_1000_naivebayes|<math>{\color{blue}73.91%}</math>]]
 
|[[Establishing an authenticity of sport news by Machine Learning Models#fnc_naivebayes|<math>{\color{blue}93.95%}</math>]]
 
 
|
 
|
 
|
 
|
|<math>43 secs</math>
+
|
|<math>36 secs</math>
+
|
 +
|
 +
|
 
|
 
|
 
|-
 
|-
Line 91: Line 90:
 
|We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/e1071/index.html e1071]
 
|We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/e1071/index.html e1071]
 
|SVM*
 
|SVM*
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_svm|<math>{\color{blue}81%}</math>]]
 
 
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_svm|<math>{\color{blue}95.42%}</math>]]
 
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_svm|<math>{\color{blue}95.42%}</math>]]
 
|
 
|
Line 97: Line 95:
 
|
 
|
 
|
 
|
|<math>59 mins</math>
+
|
 
|
 
|
 
|
 
|
Line 103: Line 101:
 
|[[Establishing an authenticity of sport news by Machine Learning Models#Random forest|Random forest]]
 
|[[Establishing an authenticity of sport news by Machine Learning Models#Random forest|Random forest]]
 
|Liawand Wiener, 2002
 
|Liawand Wiener, 2002
|We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/randomForest/index.html randomForest]
+
|[https://cran.r-project.org/web/packages/randomForest/index.html randomForest]
 
|RF
 
|RF
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_rf|<math>{\color{blue}84%}</math>]]
 
 
|
 
|
 
|
 
|
Line 111: Line 108:
 
|
 
|
 
|
 
|
|<math>0 secs</math>
 
|
 
|
 
|-
 
|[[Establishing an authenticity of sport news by Machine Learning Models#General linearized models|General linearized models]]
 
|Friedman et al., 2010
 
|We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/glmnet/index.html wglmnet]
 
|GLMNET*
 
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_glmnet|<math>{\color{blue}91%}</math>]]
 
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_glmnet|<math>{\color{blue}94.58%}</math>]]
 
 
|
 
|
|
 
|
 
|
 
|<math>15 secs</math>
 
|
 
|
 
|-
 
|[[Establishing an authenticity of sport news by Machine Learning Models#Maximum entropy|Maximum entropy]]
 
|Jurka, 2012
 
|We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/maxent/index.html maxent]
 
|MAXENT*
 
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_maxent|<math>{\color{blue}87.33%}</math>]]
 
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_maxent|<math>{\color{blue}96.09%}</math>]]
 
|
 
|
 
|
 
|
 
|<math>54 secs</math>
 
 
|
 
|
 
|
 
|
Line 147: Line 116:
 
|[[Establishing an authenticity of sport news by Machine Learning Models#The XGBoost package|xgboost]]
 
|[[Establishing an authenticity of sport news by Machine Learning Models#The XGBoost package|xgboost]]
 
|XGBOOST*
 
|XGBOOST*
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_xgboost|<math>{\color{red}93%}</math>]]
 
 
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_xgboost|<math>{\color{red}97.42%}</math>]]
 
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_xgboost|<math>{\color{red}97.42%}</math>]]
|
 
|[[Establishing an authenticity of sport news by Machine Learning Models#fnc_xgboost|<math>{\color{red}95.36%}</math>]]
 
|
 
|
 
|<math>9 mins</math>
 
|<math>21 mins</math>
 
|
 
|-
 
|[[Establishing an authenticity of sport news by Machine Learning Models#Classification or regression tree|Classification or regression tree]]
 
|Ripley., 2012
 
|We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/tree/index.html tree]
 
|TREE
 
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_tree|<math>{\color{blue}91.33%}</math>]]
 
 
|
 
|
 
|
 
|
Line 167: Line 122:
 
|
 
|
 
|
 
|
|<math>0 secs</math>
 
 
|
 
|
 
|
 
|
 
|-
 
|-
|[[Establishing an authenticity of sport news by Machine Learning Models#Boosting|Boosting]]
+
|[[Establishing an authenticity of sport news by Machine Learning Models#General linearized models|General linearized models]]
|Tuszynski, 2012
+
|Friedman et al., 2010
|We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/caTools/index.html caTools]
+
|We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/glmnet/index.html wglmnet]
|BOOSTING
+
|GLMNET*
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_boosting|<math>{\color{blue}92.67%}</math>]]
+
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_glmnet|<math>{\color{blue}94.58%}</math>]]
 
|
 
|
 
|
 
|
Line 181: Line 135:
 
|
 
|
 
|
 
|
|<math>0 secs</math>
 
 
|
 
|
 
|
 
|
 
|-
 
|-
|[[Establishing an authenticity of sport news by Machine Learning Models#Neural networks|Neural networks]]
+
|[[Establishing an authenticity of sport news by Machine Learning Models#Maximum entropy|Maximum entropy]]
|Venables and Ripley, 2002
+
|Jurka, 2012
|We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on
+
|We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/maxent/index.html maxent]
[https://cran.r-project.org/web/packages/nnet/index.html nnet]
+
|MAXENT*
|NNET
+
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_maxent|<math>{\color{blue}96.09%}</math>]]
|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_nnet|<math>{\color{blue}00.00%}</math>]]
 
 
|
 
|
 
|
 
|
Line 196: Line 148:
 
|
 
|
 
|
 
|
|<math>0 secs</math>
 
 
|
 
|
 
|
 
|
 
|-
 
|-
|Bagging
+
| colspan="12" |* Low-memory algorithm
|Peters and Hothorn, 2012
 
|We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/ipred/index.html ipred]
 
|BAGGING**
 
|
 
|
 
|
 
|
 
|
 
|
 
|
 
|
 
|
 
|-
 
|Scaled linear discriminant analysis
 
|Peters and Hothorn, 2012
 
|We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/ipred/index.html ipred]
 
|SLDA**
 
|
 
|
 
|
 
|
 
|
 
|
 
|
 
|
 
|
 
|-
 
| colspan="13" |* Low-memory algorithm
 
 
<nowiki>**</nowiki> Very high-memory algorithm
 
<nowiki>**</nowiki> Very high-memory algorithm
 +
 +
^
 +
 +
^^
 
|}
 
|}
  
Line 341: Line 268:
  
 
<br />
 
<br />
 
  
  

Revision as of 19:18, 27 April 2019

Declaration


Acknowledgement

Thanks for Muhammad, Graham and Mark


Abstract


Introduction


Chapter 1


Chapter 2 - Training a Supervised Machine Learning Model for fake news detection

Supervised text Classification for fake news detection Using Machine Learning Models


Procedure

  • The Dataset
  • Splitting the data into Train and Test data
  • Cleaning the data
  • Building the Document-Term Matrix
  • Model Building
  • Cross validation
  • Making predictions from the model created and displaying a Confusion matrix


Results


Summary of Results

Algorithms Author Package Keyword Accuracy
Kaggle fake news dataset

20,800 rows

Fake news Detector dataset

10,000 rows

Gofaaas Fake News Dataset

500 rows

Test data (70% of the dataset) Cross validation Test data (70% of the dataset) Cross validation Test data (70% of the dataset) Cross validation Using the Kaggle Model^ Using the Detector Model^^
Naive Bayes Bayes, Thomas e1071 NB*
Support vector machine Meyer et al., 2012 We used RTextTools, which depends on e1071 SVM*
Random forest Liawand Wiener, 2002 randomForest RF
Extreme Gradient Boosting Chen & Guestrin, 2016 xgboost XGBOOST*
General linearized models Friedman et al., 2010 We used RTextTools, which depends on wglmnet GLMNET*
Maximum entropy Jurka, 2012 We used RTextTools, which depends on maxent MAXENT*
* Low-memory algorithm

** Very high-memory algorithm

^

^^



Evaluation of Results

We evaluate our approach in different settings. First, weperform cross-validation on our noisy training set; second,and more importantly, we train models on the training setand validate them against a manually created gold standard.17Moreover, we evaluate two variants, i.e., including and exclud-ing user features. [smb:home/adelo/1-system/1-disco_local/1-mis_archivos/1-pe/1-ciencia/1-computacion/2-data_analysis-machine_learning/gofaaaz-machine_learning/5-References/7-Weakly_supervised_searning_for_fake_news_detection_on_twitter.pdf]


The Gofaaas-Fake News Detector R Package


Installation


Functions


Datasets used


Kaggle Fake News Dataset

https://www.kaggle.com/c/fake-news/data


Distribution of the data:

The distribution of Stance classes in train_stances.csv is as follows:

rows unrelated discuss agree disagree
49972 0.73131 0.17828 0.0736012 0.0168094



Fake News Detector Dataset


Gofaaas Fake News Dataset


Algorithms


Naive Bayes

Naïve Bayes is based on the Bayesian theorem, there in order to understand Naïve Bayes it is important to first understand the Bayesian theorem.

Bayesian theorem is a mathematical formula for determining conditional probability which is the probability of something, happening given that something else has already occurred.


Image.png
  • P(c|x) is the posterior probability of class (target) given predictor (attribute).
  • P(c) is the prior probability of class.
  • P(x|c) is the likelihood which is the probability of predictor given class.
  • P(x) is the prior probability of predictor.


Prior probability, in Bayesian statistical inference, is the probability of an event before new data is collected.

Posterior probability is the revised probability of an event occurring after taking into consideration new information.

In statistical terms, the posterior probability is the probability of event A occurring given that event B has occurred.



Support vector machine


Random forest


Extreme Gradient Boosting


The RTextTools package

RTextTools - A Supervised Learning Package for Text Classification:


Chapter 3 - Gofaas Web App

A way to interact, test and display the model results


Conclusion