Difference between revisions of "Supervised Machine Learning for Fake News Detection"

Revision as of 18:36, 27 April 2019

Declaration

Acknowledgement

Thanks for Muhammad, Graham and Mark

Abstract

Introduction

Chapter 1 Chapter 2 - Training a Supervised Machine Learning Model for fake news detection

Supervised text Classification for fake news detection Using Machine Learning Models

Procedure

The Dataset

Splitting the data into Train and Test data

Cleaning the data

Building the Document-Term Matrix

Model Building

Cross validation

Making predictions from the model created and displaying a Confusion matrix

Results

Summary of Results

Algorithms	Author	Package	Keyword	Accuracy
				Kaggle fake news dataset	Fake news challenge dataset	Fake news Detector dataset
				The entire data: 20,800 rows	The entire data: 49,972 rows	The entire data:
Naive Bayes	Bayes, Thomas	We used RTextTools, which depends on e1071	NB*	${\color {blue}76.41\%}$	${\color {blue}93.95\%}$
Support vector machine	Meyer et al., 2012	We used RTextTools, which depends on e1071	SVM*	${\color {blue}95.42\%}$
Random forest	Liawand Wiener, 2002	We used RTextTools, which depends on randomForest	RF
General linearized models	Friedman et al., 2010	We used RTextTools, which depends on wglmnet	GLMNET*	${\color {blue}94.58\%}$
Maximum entropy	Jurka, 2012	We used RTextTools, which depends on maxent	MAXENT*	${\color {blue}96.09\%}$
Extreme Gradient Boosting	Chen & Guestrin, 2016	xgboost	XGBOOST*	${\color {red}97.42\%}$	${\color {red}95.36\%}$
Classification or regression tree	Ripley., 2012	We used RTextTools, which depends on tree	TREE
Boosting	Tuszynski, 2012	We used RTextTools, which depends on caTools	BOOSTING
Neural networks	Venables and Ripley, 2002	We used RTextTools, which depends on nnet	NNET
Bagging	Peters and Hothorn, 2012	We used RTextTools, which depends on ipred	BAGGING**
Scaled linear discriminant analysis	Peters and Hothorn, 2012	We used RTextTools, which depends on ipred	SLDA**
* Low-memory algorithm ** Very high-memory algorithm

Algorithms	Author	Package	Keyword	Accuracy
				Kaggle fake news dataset 20,800 rows		Fake news Detector dataset 10,000 rows		Gofaaas Fake News Dataset 500 rows
				Test data (70% of the dataset)	Cross validation	Test data (70% of the dataset)	Cross validation	Test data (70% of the dataset)	Cross validation	Using the Kaggle Model^	Using the Detector Model^^
				Test data (70% of the dataset)	Cross validation	Test data (70% of the dataset)	Cross validation	Test data (70% of the dataset)	Cross validation	Using the Kaggle Model^	Using the Detector Model^^	Naive Bayes	Bayes, Thomas	e1071	NB*	${\color {blue}76.41\%}$
Support vector machine	Meyer et al., 2012	We used RTextTools, which depends on e1071	SVM*	${\color {blue}95.42\%}$
Random forest	Liawand Wiener, 2002	randomForest	RF
Extreme Gradient Boosting	Chen & Guestrin, 2016	xgboost	XGBOOST*	${\color {red}97.42\%}$
General linearized models	Friedman et al., 2010	We used RTextTools, which depends on wglmnet	GLMNET*	${\color {blue}94.58\%}$
Maximum entropy	Jurka, 2012	We used RTextTools, which depends on maxent	MAXENT*	${\color {blue}96.09\%}$
* Low-memory algorithm ** Very high-memory algorithm ^ ^^

Evaluation of Results

We evaluate our approach in different settings. First, weperform cross-validation on our noisy training set; second,and more importantly, we train models on the training setand validate them against a manually created gold standard.17Moreover, we evaluate two variants, i.e., including and exclud-ing user features. [smb:home/adelo/1-system/1-disco_local/1-mis_archivos/1-pe/1-ciencia/1-computacion/2-data_analysis-machine_learning/gofaaaz-machine_learning/5-References/7-Weakly_supervised_searning_for_fake_news_detection_on_twitter.pdf]

The Gofaaas-Fake News Detector R Package

Installation

Functions

Datasets used

Kaggle Fake News Dataset

https://www.kaggle.com/c/fake-news/data

Distribution of the data:

The distribution of Stance classes in train_stances.csv is as follows:

rows	unrelated	discuss	agree	disagree
49972	0.73131	0.17828	0.0736012	0.0168094

Fake News Detector Dataset

Gofaaas Fake News Dataset

Algorithms

Naive Bayes

Naïve Bayes is based on the Bayesian theorem, there in order to understand Naïve Bayes it is important to first understand the Bayesian theorem.

Bayesian theorem is a mathematical formula for determining conditional probability which is the probability of something, happening given that something else has already occurred.

P(c|x) is the posterior probability of class (target) given predictor (attribute).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.

Prior probability, in Bayesian statistical inference, is the probability of an event before new data is collected.

Posterior probability is the revised probability of an event occurring after taking into consideration new information.

In statistical terms, the posterior probability is the probability of event A occurring given that event B has occurred.

@@ Line 57: / Line 57: @@
 ! rowspan="3" |Package
 ! rowspan="3" |Keyword
-! colspan="6" |Accuracy
+! colspan="3" |Accuracy
-! colspan="3" |Running time for the entire data
 |-
-! colspan="2" |[[Establishing an authenticity of sport news by Machine Learning Models#Description of the Kaggle fake news dataset|Kaggle fake news dataset]]
+![[Establishing an authenticity of sport news by Machine Learning Models#Description of the Kaggle fake news dataset|Kaggle fake news dataset]]
-! colspan="2" |[[Establishing an authenticity of sport news by Machine Learning Models#Description of the Fake news challenge dataset|Fake news challenge dataset]]
+![[Establishing an authenticity of sport news by Machine Learning Models#Description of the Fake news challenge dataset|Fake news challenge dataset]]
-! colspan="2" |[[Establishing an authenticity of sport news by Machine Learning Models#Description of the Fake news Detector dataset|Fake news Detector dataset]]
+![[Establishing an authenticity of sport news by Machine Learning Models#Description of the Fake news Detector dataset|Fake news Detector dataset]]
-! colspan="3" |'''Using a GoogleCloud VM: 64 vCPUs, 240 GB memory (Ubuntu 18.04)'''
 |-
-!1000 rows
 !The entire data:
 ,800 rows
-!1000 rows
 !The entire data: 49,972 rows<br />
-!1000 rows
 !The entire data:<br />
-![[Establishing an authenticity of sport news by Machine Learning Models#Description of the Kaggle fake news dataset|Kaggle fake news dataset]]
-![[Establishing an authenticity of sport news by Machine Learning Models#Description of the Fake news challenge dataset|Fake news challenge dataset]]
-![[Establishing an authenticity of sport news by Machine Learning Models#Description of the Fake news challenge dataset|Fake news Detector dataset]]
 |-
 |[[Establishing an authenticity of sport news by Machine Learning Models#Extreme Gradient Boosting|Naive Bayes]]
@@ Line 80: / Line 72: @@
 |We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/e1071/index.html e1071]
 |NB*
-|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_naivebayes|<math>{\color{blue}69,03%}</math>]]
 |[[Establishing an authenticity of sport news by Machine Learning Models#fnd_naivebayes|<math>{\color{blue}76.41%}</math>]]
-|[[Establishing an authenticity of sport news by Machine Learning Models#fnc_1000_naivebayes|<math>{\color{blue}73.91%}</math>]]
 |[[Establishing an authenticity of sport news by Machine Learning Models#fnc_naivebayes|<math>{\color{blue}93.95%}</math>]]
-|
-|
-|<math>43 secs</math>
-|<math>36 secs</math>
 |
 |-
@@ Line 94: / Line 80: @@
 |We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/e1071/index.html e1071]
 |SVM*
-|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_svm|<math>{\color{blue}81%}</math>]]
 |[[Establishing an authenticity of sport news by Machine Learning Models#fnd_svm|<math>{\color{blue}95.42%}</math>]]
-|
-|
-|
-|
-|<math>59 mins</math>
 |
 |
@@ Line 108: / Line 88: @@
 |We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/randomForest/index.html randomForest]
 |RF
-|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_rf|<math>{\color{blue}84%}</math>]]
-|
 |
-|
-|
-|
-|<math>0 secs</math>
 |
 |
@@ Line 122: / Line 96: @@
 |We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/glmnet/index.html wglmnet]
 |GLMNET*
-|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_glmnet|<math>{\color{blue}91%}</math>]]
 |[[Establishing an authenticity of sport news by Machine Learning Models#fnd_glmnet|<math>{\color{blue}94.58%}</math>]]
-|
-|
-|
-|
-|<math>15 secs</math>
 |
 |
@@ Line 136: / Line 104: @@
 |We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/maxent/index.html maxent]
 |MAXENT*
-|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_maxent|<math>{\color{blue}87.33%}</math>]]
 |[[Establishing an authenticity of sport news by Machine Learning Models#fnd_maxent|<math>{\color{blue}96.09%}</math>]]
-|
-|
-|
-|
-|<math>54 secs</math>
 |
 |
@@ Line 150: / Line 112: @@
 |[[Establishing an authenticity of sport news by Machine Learning Models#The XGBoost package|xgboost]]
 |XGBOOST*
-|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_xgboost|<math>{\color{red}93%}</math>]]
 |[[Establishing an authenticity of sport news by Machine Learning Models#fnd_xgboost|<math>{\color{red}97.42%}</math>]]
-|
 |[[Establishing an authenticity of sport news by Machine Learning Models#fnc_xgboost|<math>{\color{red}95.36%}</math>]]
-|
-|
-|<math>9 mins</math>
-|<math>21 mins</math>
 |
 |-
@@ Line 164: / Line 120: @@
 |We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/tree/index.html tree]
 |TREE
-|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_tree|<math>{\color{blue}91.33%}</math>]]
 |
-|
-|
-|
-|
-|<math>0 secs</math>
 |
 |
@@ Line 178: / Line 128: @@
 |We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/caTools/index.html caTools]
 |BOOSTING
-|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_boosting|<math>{\color{blue}92.67%}</math>]]
-|
-|
-|
 |
-|
-|<math>0 secs</math>
 |
 |
@@ Line 193: / Line 137: @@
 [https://cran.r-project.org/web/packages/nnet/index.html nnet]
 |NNET
-|[[Establishing an authenticity of sport news by Machine Learning Models#fnd_1000_nnet|<math>{\color{blue}00.00%}</math>]]
-|
-|
-|
-|
 |
-|<math>0 secs</math>
 |
 |
@@ Line 207: / Line 145: @@
 |We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/ipred/index.html ipred]
 |BAGGING**
-|
-|
-|
-|
-|
-|
 |
 |
@@ Line 221: / Line 153: @@
 |We used [[Establishing an authenticity of sport news by Machine Learning Models#The RTextTools package|RTextTools]], which depends on [https://cran.r-project.org/web/packages/ipred/index.html ipred]
 |SLDA**
-|
-|
-|
-|
-|
-|
 |
 |
 |
 |-
-| colspan="13" |* Low-memory algorithm
+| colspan="7" |* Low-memory algorithm
 <nowiki>**</nowiki> Very high-memory algorithm
 |}

Difference between revisions of "Supervised Machine Learning for Fake News Detection"

Revision as of 18:36, 27 April 2019

Contents

Declaration

Acknowledgement

Abstract

Introduction

Chapter 1

Chapter 2 - Training a Supervised Machine Learning Model for fake news detection

Procedure

Results

Summary of Results

Summary of Results

Evaluation of Results

The Gofaaas-Fake News Detector R Package

Installation

Functions

Datasets used

Kaggle Fake News Dataset

Fake News Detector Dataset

Gofaaas Fake News Dataset

Algorithms

Naive Bayes

Support vector machine

Random forest

Extreme Gradient Boosting

The XGBoost R package

The RTextTools package

Chapter 3 - Gofaas Web App

Conclusion

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Tools