Difference between revisions of "Supervised Machine Learning for Fake News Detection"
Adelo Vieira (talk | contribs) (→Gofaaas Fake News detector Web App) |
Adelo Vieira (talk | contribs) |
||
Line 1: | Line 1: | ||
+ | ==Front Page== | ||
+ | |||
+ | <br /> | ||
+ | |||
+ | ==Declaration== | ||
+ | |||
+ | <br /> | ||
+ | |||
+ | ==Acknowledgement== | ||
+ | Thanks for Muhammad, Graham and Mark | ||
+ | |||
+ | <br /> | ||
+ | |||
+ | ==Abstract== | ||
+ | |||
+ | <br /> | ||
+ | |||
==Introduction== | ==Introduction== | ||
+ | |||
<br /> | <br /> | ||
Revision as of 17:54, 27 April 2019
Contents
- 1 Front Page
- 2 Declaration
- 3 Acknowledgement
- 4 Abstract
- 5 Introduction
- 6 Chapter 1 - Project proposal
- 7 Chapter 2 - Training a Supervised Machine Learning Model for fake news detection
- 8 Gofaas Web App
- 9 Conclusion
Front Page
Declaration
Acknowledgement
Thanks for Muhammad, Graham and Mark
Abstract
Introduction
Chapter 1 - Project proposal
Chapter 2 - Training a Supervised Machine Learning Model for fake news detection
Supervised text Classification for fake news detection Using Machine Learning Models
Procedure
Evaluation
We evaluate our approach in different settings. First, weperform cross-validation on our noisy training set; second,and more importantly, we train models on the training setand validate them against a manually created gold standard.17Moreover, we evaluate two variants, i.e., including and exclud-ing user features. [smb:home/adelo/1-system/1-disco_local/1-mis_archivos/1-pe/1-ciencia/1-computacion/2-data_analysis-machine_learning/gofaaaz-machine_learning/5-References/7-Weakly_supervised_searning_for_fake_news_detection_on_twitter.pdf]
Results
Summary of Results
Algorithms | Author | Package | Keyword | Accuracy | Running time for the entire data | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Kaggle fake news dataset | Fake news challenge dataset | Fake news Detector dataset | Using a GoogleCloud VM: 64 vCPUs, 240 GB memory (Ubuntu 18.04) | |||||||||
1000 rows | The entire data:
20,800 rows |
1000 rows | The entire data: 49,972 rows |
1000 rows | The entire data: |
Kaggle fake news dataset | Fake news challenge dataset | Fake news Detector dataset | ||||
Naive Bayes | Bayes, Thomas | We used RTextTools, which depends on e1071 | NB* | |||||||||
Support vector machine | Meyer et al., 2012 | We used RTextTools, which depends on e1071 | SVM* | |||||||||
Random forest | Liawand Wiener, 2002 | We used RTextTools, which depends on randomForest | RF | |||||||||
General linearized models | Friedman et al., 2010 | We used RTextTools, which depends on wglmnet | GLMNET* | |||||||||
Maximum entropy | Jurka, 2012 | We used RTextTools, which depends on maxent | MAXENT* | |||||||||
Extreme Gradient Boosting | Chen & Guestrin, 2016 | xgboost | XGBOOST* | |||||||||
Classification or regression tree | Ripley., 2012 | We used RTextTools, which depends on tree | TREE | |||||||||
Boosting | Tuszynski, 2012 | We used RTextTools, which depends on caTools | BOOSTING | |||||||||
Neural networks | Venables and Ripley, 2002 | We used RTextTools, which depends on | NNET | |||||||||
Bagging | Peters and Hothorn, 2012 | We used RTextTools, which depends on ipred | BAGGING** | |||||||||
Scaled linear discriminant analysis | Peters and Hothorn, 2012 | We used RTextTools, which depends on ipred | SLDA** | |||||||||
* Low-memory algorithm
** Very high-memory algorithm |
Datasets used
Kaggle fake news competition
https://www.kaggle.com/c/fake-news/data
Description of the Kaggle fake news dataset
Fake news Challenge
http://www.fakenewschallenge.org/
Exploring how artificial intelligence technologies could be leveraged to combat fake news.
Formal Definition
- Input: A headline and a body text - either from the same news article or from two different articles.
- Output: Classify the stance of the body text relative to the claim made in the headline into one of four categories:
- Agrees: The body text agrees with the headline.
- Disagrees: The body text disagrees with the headline.
- Discusses: The body text discuss the same topic as the headline, but does not take a position
- Unrelated: The body text discusses a different topic than the headline
Winner team
First place - Team SOLAT in the SWEN
https://github.com/Cisco-Talos/fnc-1
The data provided is (headline, body, stance) instances, where stance is one of {unrelated, discuss, agree, disagree}. The dataset is provided as two CSVs:
- train_bodies.csv : This file contains the body text of articles (the articleBody column) with corresponding IDs (Body ID)
- train_stances.csv : This file contains the labeled stances (the Stance column) for pairs of article headlines (Headline) and article bodies (Body ID, referring to entries in train_bodies.csv).
Description of the Fake news challenge dataset
The Stance Detection dataset for FNC1 can be fount at: https://github.com/FakeNewsChallenge/fnc-1
Distribution of the data
The distribution of Stance
classes in train_stances.csv
is as follows:
rows | unrelated | discuss | agree | disagree |
---|---|---|---|---|
49972 | 0.73131 | 0.17828 | 0.0736012 | 0.0168094 |
Description of the Fake news Detector dataset
Algorithms
Naive Bayes
Naïve Bayes is based on the Bayesian theorem, there in order to understand Naïve Bayes it is important to first understand the Bayesian theorem.
Bayesian theorem is a mathematical formula for determining conditional probability which is the probability of something, happening given that something else has already occurred.
- P(c|x) is the posterior probability of class (target) given predictor (attribute).
- P(c) is the prior probability of class.
- P(x|c) is the likelihood which is the probability of predictor given class.
- P(x) is the prior probability of predictor.
Prior probability, in Bayesian statistical inference, is the probability of an event before new data is collected.
Posterior probability is the revised probability of an event occurring after taking into consideration new information.
In statistical terms, the posterior probability is the probability of event A occurring given that event B has occurred.
Support vector machine
Random forest
Extreme Gradient Boosting
The XGBoost R package
XGBoost - Extreme Gradient Boosting:
- https://xgboost.readthedocs.io/en/latest/
- https://cran.r-project.org/web/packages/xgboost/index.html
The RTextTools package
RTextTools - A Supervised Learning Package for Text Classification:
- https://journal.r-project.org/archive/2013/RJ-2013-001/RJ-2013-001.pdf
- http://www.rtexttools.com/
- https://cran.r-project.org/web/packages/RTextTools/index.html
Gofaas Web App
A way to interact, test and display the model results
Conclusion