Difference between revisions of "Supervised Machine Learning for Fake News Detection"

From Sinfronteras
Jump to: navigation, search
(Created page with "==Introduction== <br /> ==Chapter 1 - Project proposal== <br /> ==Chapter 2 - Training a Supervised Machine Learning Model for fake news detection== Supervised text Classif...")
 
(Results for the Kaggle fake news dataset)
Line 211: Line 211:
  
 
<br />
 
<br />
 
====Results for the Kaggle fake news dataset====
 
 
 
<br />
 
=====Using a small portion of the dataset - 1000 rows=====
 
 
 
======Using GLMNET<span id="fnd_1000_glmnet">======
 
[[File:Fnd_1000rows_GLMNET.png|760x760px|thumb|center]]
 
 
======Using SVM<span id="fnd_1000_svm">======
 
[[File:Fnd_1000rows_SVM.png|760px|thumb|center|]]
 
 
======Using MAXENT<span id="fnd_1000_maxent">======
 
[[File:Fnd_1000rows_MAXENT.png|760px|thumb|center|]]
 
 
======Using TREE<span id="fnd_1000_tree">======
 
[[File:Fnd_1000rows_TREE.png|760px|thumb|center|]]
 
 
======Using RF<span id="fnd_1000_rf">======
 
[[File:Fnd_1000rows_RF.png|760px|thumb|center|]]
 
 
======Using BOOSTING<span id="fnd_1000_boosting">======
 
[[File:Fnd_1000rows_BOOSTING.png|760px|thumb|center|]]
 
 
======Using NNET<span id="fnd_1000_nnet">======
 
[[File:Fnd_1000rows_NNET.png|760px|thumb|center|]]
 
 
======Using XGBOOST<span id="fnd_1000_xgboost">======
 
[[File:Fnd_1000rows_XGBOOST_10000.png|760px|thumb|center|]]
 
 
======Using Naive Bayes<span id="fnd_1000_naivebayes">======
 
[[File:NaiveBayes kaggle1000r.png|center|thumb|471x471px]]
 
 
 
=====Using the entire dataset - 20800 rows=====
 
 
======Using GLMNET<span id="fnd_glmnet">======
 
[[File:Fnd_GLMNET.png|760px|thumb|center|]]
 
 
======Using SVM<span id="fnd_svm">======
 
[[File:Fnd_SVM.png|760px|thumb|center|]]
 
 
======Using MAXENT<span id="fnd_maxent">======
 
[[File:Fnd_MAXENT.png|760px|thumb|center|]]
 
 
======Using TREE<span id="fnd_tree">======
 
[[File:Fnd_TREE.png|760px|thumb|center|]]
 
 
======Using RF<span id="fnd_rf">======
 
[[File:Fnd_RF.png|760px|thumb|center|]]
 
 
======Using BOOSTING<span id="fnd_boosting">======
 
[[File:Fnd_BOOSTING.png|760px|thumb|center|]]
 
 
======Using NNET======
 
<span id="fnd_nnet">[[File:Fnd_NNET.png|760px|thumb|center|]]
 
 
======Using XGBOOST======
 
<span id="fnd_xgboost">[[File:Fnd_XGBOOST_10000.png|760px|thumb|center|]]
 
 
======Using Naive Bayes<span id="fnd_naivebayes">======
 
[[File:NaiveBayes kagglefull.png|center|thumb|431x431px]]
 
 
  
 
====Results for the Fake news challenge dataset====
 
====Results for the Fake news challenge dataset====

Revision as of 17:39, 27 April 2019

Introduction


Chapter 1 - Project proposal


Chapter 2 - Training a Supervised Machine Learning Model for fake news detection

Supervised text Classification for fake news detection Using Machine Learning Models


Procedure


Evaluation

We evaluate our approach in different settings. First, weperform cross-validation on our noisy training set; second,and more importantly, we train models on the training setand validate them against a manually created gold standard.17Moreover, we evaluate two variants, i.e., including and exclud-ing user features. [smb:home/adelo/1-system/1-disco_local/1-mis_archivos/1-pe/1-ciencia/1-computacion/2-data_analysis-machine_learning/gofaaaz-machine_learning/5-References/7-Weakly_supervised_searning_for_fake_news_detection_on_twitter.pdf]


Results


Summary of Results

Algorithms Author Package Keyword Accuracy Running time for the entire data
Kaggle fake news dataset Fake news challenge dataset Fake news Detector dataset Using a GoogleCloud VM: 64 vCPUs, 240 GB memory (Ubuntu 18.04)
1000 rows The entire data:

20,800 rows

1000 rows The entire data: 49,972 rows
1000 rows The entire data:
Kaggle fake news dataset Fake news challenge dataset Fake news Detector dataset
Naive Bayes Bayes, Thomas We used RTextTools, which depends on e1071 NB*
Support vector machine Meyer et al., 2012 We used RTextTools, which depends on e1071 SVM*
Random forest Liawand Wiener, 2002 We used RTextTools, which depends on randomForest RF
General linearized models Friedman et al., 2010 We used RTextTools, which depends on wglmnet GLMNET*
Maximum entropy Jurka, 2012 We used RTextTools, which depends on maxent MAXENT*
Extreme Gradient Boosting Chen & Guestrin, 2016 xgboost XGBOOST*
Classification or regression tree Ripley., 2012 We used RTextTools, which depends on tree TREE
Boosting Tuszynski, 2012 We used RTextTools, which depends on caTools BOOSTING
Neural networks Venables and Ripley, 2002 We used RTextTools, which depends on

nnet

NNET
Bagging Peters and Hothorn, 2012 We used RTextTools, which depends on ipred BAGGING**
Scaled linear discriminant analysis Peters and Hothorn, 2012 We used RTextTools, which depends on ipred SLDA**
* Low-memory algorithm

** Very high-memory algorithm



Results for the Fake news challenge dataset


Using a portion of the dataset - 1000 rows
Using GLMNET
Using SVM
Using SVM
Using MAXENT
Using TREE
Using RF
Using BOOSTING
Using NNET
Using Naive Bayes
NaiveBayes fnc1000.png



Using the entire dataset - 50000 rows
Using GLMNET
Using SVM
Using SVM
Using MAXENT
Using TREE
Using RF
Using BOOSTING
Using NNET
Using XGBOOST
Fnc XGBOOST 5000.png


Using Naive Bayes
NaiveBayes4.png

Datasets used

Kaggle fake news competition

https://www.kaggle.com/c/fake-news/data


Description of the Kaggle fake news dataset



Fake news Challenge

http://www.fakenewschallenge.org/

Exploring how artificial intelligence technologies could be leveraged to combat fake news.


Formal Definition
  • Input: A headline and a body text - either from the same news article or from two different articles.
  • Output: Classify the stance of the body text relative to the claim made in the headline into one of four categories:
    • Agrees: The body text agrees with the headline.
    • Disagrees: The body text disagrees with the headline.
    • Discusses: The body text discuss the same topic as the headline, but does not take a position
    • Unrelated: The body text discusses a different topic than the headline


Winner team
First place - Team SOLAT in the SWEN

https://github.com/Cisco-Talos/fnc-1

The data provided is (headline, body, stance) instances, where stance is one of {unrelated, discuss, agree, disagree}. The dataset is provided as two CSVs:

  • train_bodies.csv : This file contains the body text of articles (the articleBody column) with corresponding IDs (Body ID)
  • train_stances.csv : This file contains the labeled stances (the Stance column) for pairs of article headlines (Headline) and article bodies (Body ID, referring to entries in train_bodies.csv).


Description of the Fake news challenge dataset

The Stance Detection dataset for FNC1 can be fount at: https://github.com/FakeNewsChallenge/fnc-1


Distribution of the data

The distribution of Stance classes in train_stances.csv is as follows:

rows unrelated discuss agree disagree
49972 0.73131 0.17828 0.0736012 0.0168094



Description of the Fake news Detector dataset



Algorithms


General linearized models


Support vector machine


Maximum entropy


Classification or regression tree


Random forest


Boosting


Neural networks


Extreme Gradient Boosting


Naive Bayes

Naïve Bayes is based on the Bayesian theorem, there in order to understand Naïve Bayes it is important to first understand the Bayesian theorem.

Bayesian theorem is a mathematical formula for determining conditional probability which is the probability of something, happening given that something else has already occurred.


Image.png
  • P(c|x) is the posterior probability of class (target) given predictor (attribute).
  • P(c) is the prior probability of class.
  • P(x|c) is the likelihood which is the probability of predictor given class.
  • P(x) is the prior probability of predictor.


Prior probability, in Bayesian statistical inference, is the probability of an event before new data is collected.

Posterior probability is the revised probability of an event occurring after taking into consideration new information.

In statistical terms, the posterior probability is the probability of event A occurring given that event B has occurred.


The RTextTools package

RTextTools - A Supervised Learning Package for Text Classification:


Gofaaas Fake News detector Web App


Conclusion