Difference between revisions of "Supervised Machine Learning for Fake News Detection"

Revision as of 17:54, 27 April 2019

Front Page

Declaration

Acknowledgement

Thanks for Muhammad, Graham and Mark

Abstract

Introduction

Chapter 1 - Project proposal

Chapter 2 - Training a Supervised Machine Learning Model for fake news detection

Supervised text Classification for fake news detection Using Machine Learning Models

Procedure

Evaluation

We evaluate our approach in different settings. First, weperform cross-validation on our noisy training set; second,and more importantly, we train models on the training setand validate them against a manually created gold standard.17Moreover, we evaluate two variants, i.e., including and exclud-ing user features. [smb:home/adelo/1-system/1-disco_local/1-mis_archivos/1-pe/1-ciencia/1-computacion/2-data_analysis-machine_learning/gofaaaz-machine_learning/5-References/7-Weakly_supervised_searning_for_fake_news_detection_on_twitter.pdf]

Results

Summary of Results

Algorithms	Author	Package	Keyword	Accuracy						Running time for the entire data
				Kaggle fake news dataset		Fake news challenge dataset		Fake news Detector dataset		Using a GoogleCloud VM: 64 vCPUs, 240 GB memory (Ubuntu 18.04)
				1000 rows	The entire data: 20,800 rows	1000 rows	The entire data: 49,972 rows	1000 rows	The entire data:	Kaggle fake news dataset	Fake news challenge dataset	Fake news Detector dataset
Naive Bayes	Bayes, Thomas	We used RTextTools, which depends on e1071	NB*	${\color {blue}69,03\%}$	${\color {blue}76.41\%}$	${\color {blue}73.91\%}$	${\color {blue}93.95\%}$			$43secs$	$36secs$
Support vector machine	Meyer et al., 2012	We used RTextTools, which depends on e1071	SVM*	${\color {blue}81\%}$	${\color {blue}95.42\%}$					$59mins$
Random forest	Liawand Wiener, 2002	We used RTextTools, which depends on randomForest	RF	${\color {blue}84\%}$						$0secs$
General linearized models	Friedman et al., 2010	We used RTextTools, which depends on wglmnet	GLMNET*	${\color {blue}91\%}$	${\color {blue}94.58\%}$					$15secs$
Maximum entropy	Jurka, 2012	We used RTextTools, which depends on maxent	MAXENT*	${\color {blue}87.33\%}$	${\color {blue}96.09\%}$					$54secs$
Extreme Gradient Boosting	Chen & Guestrin, 2016	xgboost	XGBOOST*	${\color {red}93\%}$	${\color {red}97.42\%}$		${\color {red}95.36\%}$			$9mins$	$21mins$
Classification or regression tree	Ripley., 2012	We used RTextTools, which depends on tree	TREE	${\color {blue}91.33\%}$						$0secs$
Boosting	Tuszynski, 2012	We used RTextTools, which depends on caTools	BOOSTING	${\color {blue}92.67\%}$						$0secs$
Neural networks	Venables and Ripley, 2002	We used RTextTools, which depends on nnet	NNET	${\color {blue}00.00\%}$						$0secs$
Bagging	Peters and Hothorn, 2012	We used RTextTools, which depends on ipred	BAGGING**
Scaled linear discriminant analysis	Peters and Hothorn, 2012	We used RTextTools, which depends on ipred	SLDA**
* Low-memory algorithm ** Very high-memory algorithm

Datasets used

Kaggle fake news competition

https://www.kaggle.com/c/fake-news/data

Description of the Kaggle fake news dataset

Fake news Challenge

http://www.fakenewschallenge.org/

Exploring how artificial intelligence technologies could be leveraged to combat fake news.

Formal Definition

Input: A headline and a body text - either from the same news article or from two different articles.

Output: Classify the stance of the body text relative to the claim made in the headline into one of four categories:
- Agrees: The body text agrees with the headline.
- Disagrees: The body text disagrees with the headline.
- Discusses: The body text discuss the same topic as the headline, but does not take a position
- Unrelated: The body text discusses a different topic than the headline

Winner team

First place - Team SOLAT in the SWEN

https://github.com/Cisco-Talos/fnc-1

The data provided is (headline, body, stance) instances, where stance is one of {unrelated, discuss, agree, disagree}. The dataset is provided as two CSVs:

train_bodies.csv : This file contains the body text of articles (the articleBody column) with corresponding IDs (Body ID)

train_stances.csv : This file contains the labeled stances (the Stance column) for pairs of article headlines (Headline) and article bodies (Body ID, referring to entries in train_bodies.csv).

Description of the Fake news challenge dataset

The Stance Detection dataset for FNC1 can be fount at: https://github.com/FakeNewsChallenge/fnc-1

Distribution of the data

The distribution of Stance classes in train_stances.csv is as follows:

rows	unrelated	discuss	agree	disagree
49972	0.73131	0.17828	0.0736012	0.0168094

Description of the Fake news Detector dataset

Algorithms

Naive Bayes

Naïve Bayes is based on the Bayesian theorem, there in order to understand Naïve Bayes it is important to first understand the Bayesian theorem.

Bayesian theorem is a mathematical formula for determining conditional probability which is the probability of something, happening given that something else has already occurred.

P(c|x) is the posterior probability of class (target) given predictor (attribute).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.

Prior probability, in Bayesian statistical inference, is the probability of an event before new data is collected.

Posterior probability is the revised probability of an event occurring after taking into consideration new information.

In statistical terms, the posterior probability is the probability of event A occurring given that event B has occurred.

@@ Line 1: / Line 1: @@
+==Front Page==
+<br />
+==Declaration==
+<br />
+==Acknowledgement==
+Thanks for Muhammad, Graham and Mark
+<br />
+==Abstract==
+<br />
 ==Introduction==
 <br />