Difference between revisions of "Paper: Fake news detection on social media - A data mining perspective"

Latest revision as of 23:27, 15 June 2020

En esta página he escrito los puntos que me interesan más sobre este paper: https://www.kdd.org/exploration_files/19-1-Article2.pdf

Fake news detection

In the previous section, we introduced the conceptual characterization of traditional fake news and fake news in social media. Based on this characterization, we further explore the problem definition and proposed approaches for fake news detection.

Problem Definition

In this subsection, we present the details of mathematical formulation of fake news detection on social media. Specifically, we will introduce the definition of key components of fake news and then present the formal definition of fake news detection.

The basic notations are defined below:

Let $a$ $a$ refer to a News Article. It consists of two major components: Publisher and Content:
- Publisher ${\vec {p_{a}}}$ includes a set of profile features to describe the original author, such as name, domain, age, among other attributes.
- Content ${\vec {c_{a}}}$ consists of a set of attributes that represent the news article and includes headline, text, image, etc.

We also define Social News Engagements as a set of tuples $\varepsilon \{e_{it}\}$ to represent the process of how news spread over time among n users $U=\{u_{1},u_{2},...,u_{n}\}$ and their corresponding posts $P=\{p_{1},p_{2},...,p_{n}\}$ on social media regarding news article $a$ . Each engagement $e_{it}=\{u_{i},p_{i},t\}$ represents that a user $u_{i}$ spreads news article $a$ using $p_{i}$ at time $t$ . Note that we set $t=Null$ if the article $a$ does not have any engagement yet and thus $u_{i}$ represents the publisher.

Definition 2 (Fake News Detection): Given the social news engagements $\varepsilon$ among $n$ users for news article $a,$ the task of fake news detection is to predict whether the news article $a$ is a fake news piece or not, i.e., $F:\varepsilon \rightarrow \{0,1\}$ such that:

$F(a)={\begin{cases}1,{\text{ if }}a{\text{ is a piece of fake news}}\\0,{\text{ otherwise}}\end{cases}}$

where $F$ is the prediction function we want to learn. Note that we define fake news detection as a binary classification problem for the following reason: fake news is essentially a distortion bias on information manipulated by the publisher. According to previous research about media bias theory [26], distortion bias is usually modeled as a binary classification problem.

Next, we propose a general data mining framework for fake news detection which includes two phases:

(i) Feature extraction: The feature extraction phase aims to represent news content and related auxiliary information in a formal mathematical structure.
(ii) Model construction: The model construction phase further builds machine learning models to better differentiate fake news and real news based on the feature representations.

Feature Extraction

Fake news detection on traditional news media mainly relies on news content, while in social media, extra social context auxiliary information can be used to as additional information to help detect fake news. Thus, we will present the details of how to extract and represent useful features from news content and social context.

News Content Features

News content features ${\vec {c_{a}}}$ describe the meta information related to a piece of news. A list of representative news content attributes are listed below:

Source: Author or publisher of the news article
Headline: Short title text that aims to catch the attention of readers and describes the main topic of the article
Body Text: Main text that elaborates the details of the news story; there is usually a major claim that is specifically highlighted and that shapes the angle of the publisher
Image/Video: Part of the body content of a news article that provides visual cues to frame the story

Based on these raw content attributes, different kinds of feature representations can be built to extract discriminative characteristics of fake news. Typically, the news content we are looking at will mostly be linguistic-based and visual-based, described in more detail below.

Linguistic-based

Since fake news pieces are intentionally created for financial or political gain rather than to report objective claims, they often contain opinionated and inflammatory language, crafted as "clickbait" (i.e., to entice users to click on the link to read the full article) or to incite confusion [13]. Thus, it is reasonable to exploit linguistic features that capture the different writing styles and sensational headlines to detect fake news.

Linguistic-based features are extracted from the text content in terms of document organizations from different levels, such as characters, words, sentences, and documents. In order to capture the different aspects of fake news and real news, existing work utilized both common linguistic features and domain-specific linguistic features.

Common linguistic features

Common linguistic features Are often used to represent documents for various tasks in natural language processing. Typical common linguistic features are:

(i) Lexical features: Including character-level and word-level features, such as total words, characters per word, frequency of large words, and unique words.

(ii) Syntactic features: Including sentence-level features, such as frequency of function words and phrases (i.e., "n-grams" and bag-of-words approaches [24]) or punctuation and parts-of-speech (POS) tagging.

Domain-specific linguistic features

These are specifically aligned to news domain, such as quoted words, external links, number of graphs, and the average length of graphs, etc [62].

Moreover, other features can be specifically designed to capture the deceptive cues in writing styles to differentiate fake news, such as lying-detection features [1].

Visual-based

Visual cues have been shown to be an important manipulator for fake news propaganda. As we have characterized, fake news exploits the individual vulnerabilities of people and thus often relies on sensational or even fake images to provoke anger or other emotional response of consumers. Visual-based features are extracted from visual elements (e.g. images and videos) to capture the different characteristics for fake news.

Faking images were identified based on various user-level and tweet-level hand-crafted features using classification framework [28]. Recently, various visual and statistical features has been extracted for news verification [38]:

Visual features include:

Clarity score
Coherence score
Similarity distribution histogram
Diversity score
Clustering score.

Statistical features include count, image ratio, multi-image ratio, hot image ratio, long image ratio, etc.

Social Context Features

In addition to features related directly to the content of the news articles, additional social context features can also be derived from the user-driven social engagements of news consumption on social media platform. Social engagements represent the news proliferation process over time, which provides useful auxiliary information to infer the veracity of news articles. Note that few papers exist in the literature that detect fake news using social context features. However, because we believe this is a critical aspect of successful fake news detection, we introduce a set of common features utilized in similar research areas, such as: rumor veracity classification on social media.

Generally, there are three major aspects of the social media context that we want to represent:

Users,
Generated posts, and
Networks.

Below, we investigate how we can extract and represent social context features from these three aspects to support fake news detection:

User-based

As we mentioned in Section 2.3, fake news pieces are likely to be created and spread by non-human accounts, such as social bots or cyborgs. Thus, capturing users’ profiles and characteristics by user-based features can provide useful information for fake news detection.

Post-based

People express their emotions or opinions towards fake news through social media posts, such as skeptical opinions, sensational reactions, etc. Thus, it is reasonable to extract post-based features to help find potential fake news via reactions from the general public as expressed in posts.

Network-based

Users form different networks on social media in terms of interests, topics, and relations. As mentioned before, fake news dissemination processes tend to form an echo chamber cycle, highlighting the value of extracting network-based features to represent these types of network patterns for fake news detection. Network-based features are extracted via constructing specific networks among the users who published related social media posts.

Model Construction

In the previous section, we introduced features extracted from different sources, i.e., news content and social context, for fake news detection. In this section, we discuss the details of the model construction process for several existing approaches. Specifically we categorize existing methods based on their main input sources as: News Content Models and Social Context Models.

News Content Models

In this subsection, we focus on news content models. which mainly rely on news content features and existing factual sources to classify fake news. Specifically, existing approaches can be categorized as Knowledge-based and Style-based.

Knowledge-based

Knowledge-based approaches aim to use external sources to fact-check proposed claims in news content. The goal of fact-checking is to assign a truth value to a claim in a particular context [83]. Fact-checking has attracted increasing attention, and many efforts have been made to develop a feasible automated fact-checking system.

Existing fact-checking approaches can be categorized as: expert-oriented, crowdsourcing-oriented, and computational-oriented.

Expert-oriented approaches

Expert-oriented fact-checking heavily relies on human domain experts to investigate relevant data and documents to construct the verdicts of claim veracity, for example PolitiFact [11], Snopes [12], etc. However, expert-oriented fact-checking is an intellectually demanding and time-consuming process, which limits the potential for high efficiency and scalability.

Crowdsourcing-oriented approaches

Crowdsourcing-oriented fact-checking exploits the "wisdom of crowd" to enable normal people to annotate news content; these annotations are then aggregated to produce an overall assessment of the news veracity. For example, Fiskkit [13] allows users to discuss and annotate the accuracy of specific parts of a news article. As another example, an anti-fake news bot named "For real" is a public account in the instant communication mobile application LINE [14], which allows people to report suspicious news content which is then further checked by editors.

Computational-oriented approaches

This approaches aims to provide an automatic scalable system to classify true and false claims. Previous computational-oriented fact checking methods try to solve two majors issues: (i) identifying check-worthy claims (identificar las frases que deben ser comprobadas) and (ii) discriminating the veracity of fact claims.

. . .

Style-based

Style-based approaches try to detect fake news by capturing the manipulators in the writing style of news content. There are mainly two typical categories of style-based methods: Deception-oriented and Objectivity-oriented:

Deception-oriented

These stylometric methods capture the deceptive (engañoso) statements or claims from news content. The motivation of deception detection originates from forensic psychology (i.e., Undeutsch Hypothesis) [82] and various forensic tools including Criteria-based Content Analysis [84] and Scientific-based Content Analysis [45] have been developed.

More recently, advanced natural language processing models are applied to spot deception phases from the following perspectives: Deep syntax and Rhetorical structure.

Deep syntax models have been implemented using Probabilistic context-free grammar (PCFG), with which sentences can be transformed into rules that describe the syntax structure. Based on the PCFG, different rules can be developed for deception detection, such as unlexicalized/lexicalized production rules and grandparent rules [22].

Rhetorical structure theory can be utilized to capture the differences between deceptive and truthful sentences [68].

Deep network models, such as convolutional neural networks (CNN), have also been applied to classify fake news veracity [90].

Objectivity-oriented

Objectivity-oriented approaches capture style signals that can indicate a decreased objectivity of news content and thus the potential to mislead consumers, such as: hyperpartisan styles and yellow-journalism.

Hyperpartisan styles represent extreme behavior in favor of a particular political party, which often correlates with a strong motivation to create fake news. Linguistic-based features can be applied to detect hyper partisan articles [62].

Yellow-journalism represents those articles that do not contain well-researched news, but instead rely on eye-catching headlines (i.e., clickbait) with a propensity for exaggeration, sensationalization,scare-mongering, etc.

Social Context Models

The nature of social media provides researchers with additional resources to supplement and enhance News Con-tent Models. Social context models include relevant user social engagements in the analysis, capturing this auxiliary information from a variety of perspectives. We can classify existing approaches for social context modeling into two categories: Stance-based and Propagation-based. Note that very few existing fake news detection approaches have utilized social context models. Thus, we also introduce similar methods for rumor detection using social media, which have potential application for fake news detection.

Stance-based

Stance-based approaches utilize users' viewpoints from relevant post contents to infer the veracity of original news articles.

The stance of users' posts can be represented either explicitly or implicitly:

Explicit stances are direct expressions of emotion or opinion, such as the "thumbs up" and "thumbs down" reactions expressed in Facebook.

Implicit stances can be automatically extracted from social media posts. Stance detection is the task of automatically determining from a post whether the user is in favor of, neutral toward, or against some target entity, event, or idea [53]

@@ Line 1: / Line 1: @@
 En esta página he escrito los puntos que me interesan más sobre este paper: https://www.kdd.org/exploration_files/19-1-Article2.pdf
+<br />
 ==Fake news detection==
 In the previous section, we introduced the '''conceptual characterization  of  traditional  fake  news  and  fake  news  in  social media'''.  Based on this characterization,  we further explore  the  '''problem  definition  and  proposed  approaches  for fake news detection'''.
+<br />
 ===Problem Definition===
 In  this subsection, we present the details of '''mathematical formulation of fake news detection on social media'''.  Specifically, we will introduce the  definition of '''key components of fake news''' and then present the '''formal definition of fake news detection.'''
@@ Line 38: / Line 42: @@
 *'''(ii) Model construction:''' The model construction phase further builds machine learning models to better differentiate fake news and real news '''''based on the feature representations'''''.
+<br />
 ===Feature Extraction===
 '''Fake news detection on traditional news media mainly relies on news content, while in social media, extra social context auxiliary information can be used to as additional information to help detect fake news.''' Thus, we will present the details of how to extract and represent useful features from news content and social context.
+<br />
 ====News Content Features====
 News content features <math>\vec{c_a}</math> describe the meta information related to a piece of news. A list of '''representative news content attributes''' are listed below:
@@ Line 53: / Line 60: @@
+<br />
 =====Linguistic-based=====
 Since fake news  pieces  are  intentionally created for financial or political gain rather than to report  objective  claims,  they  often  contain  opinionated  and inflammatory  language,  crafted  as  ''"clickbait"''  (i.e.,  to  entice  users  to  click  on  the  link  to  read  the  full  article)  or to  incite  confusion [13].   Thus,  it  is  reasonable  to  exploit linguistic  features  that  capture  the  different  writing  styles and  sensational  headlines  to  detect  fake  news.
@@ Line 58: / Line 66: @@
 Linguistic-based features are extracted from the text content in terms of document organizations from different levels, such as characters, words, sentences, and documents.  In order to capture  the  different  aspects  of  fake  news  and  real  news,  existing  work  utilized  both  '''''common  linguistic  features'''''  and '''''domain-specific linguistic features'''''.
+<br />
 ======Common linguistic features======
 ''Common linguistic features'' Are often used to represent documents for various tasks in  natural  language  processing. Typical  common  linguistic  features  are:
@@ Line 65: / Line 75: @@
 *'''''(ii) Syntactic features:''''' Including sentence-level features, such as frequency of function words and phrases (i.e., "n-grams" and bag-of-words approaches [24]) or punctuation and parts-of-speech  (POS)  tagging.
+<br />
 ======Domain-specific linguistic features======
 These are specifically aligned to news domain, such as quoted words, external links, number of graphs, and the average length of graphs, etc [62].
@@ Line 72: / Line 84: @@
+<br />
 =====Visual-based=====
 Visual cues have been shown to be an important  manipulator  for  fake  news  propaganda. As we have characterized, fake news exploits the individual vulnerabilities of people and thus often relies on sensational or even fake images to provoke anger or other emotional response of consumers. Visual-based features are extracted from visual elements (e.g.  images and videos) to capture the different characteristics for fake news.
@@ Line 88: / Line 101: @@
+<br />
 ====Social Context Features====
 In  addition  to  features  related  directly  to  the  content  of the news articles, '''additional social context features can also be derived from the''' '''''user-driven social engagements''''' '''of news consumption on social media platform.'''  '''''Social engagements''''' represent  the  news  proliferation  process  over  time,  which provides useful auxiliary information to infer the veracity of news articles.  <span style="color:#FF0000">Note that few papers exist in the literature that  detect  fake  news  using  social  context  features.  However, because we believe this is a critical aspect of successful fake news detection</span>, '''we introduce a set of common features utilized  in  similar  research  areas,  such  as:'''  '''''rumor  veracity classification  on  social  media'''''.
@@ Line 100: / Line 114: @@
+<br />
 =====User-based=====
 As  we  mentioned  in  Section  2.3,  fake  news pieces  are  likely  to  be  created  and  spread  by  non-human accounts,  such as social bots or cyborgs.  Thus,  capturing users’ profiles and characteristics by user-based features can provide  useful  information  for  fake  news  detection.
+<br />
 =====Post-based=====
 People express their emotions or opinions towards fake news through social media posts, such as skeptical  opinions,  sensational  reactions,  etc. Thus,  it  is  reasonable to extract post-based features to help find potential fake news via reactions from the general public as expressed in posts.
+<br />
 =====Network-based=====
 Users  form  different  networks  on  social media in terms of interests, topics, and relations. As mentioned before, fake news dissemination processes tend to form  an  echo  chamber  cycle,  highlighting  the  value  of  extracting network-based features to represent these types of network  patterns for fake news detection. Network-based features are extracted via constructing specific networks among the users who published related social media posts.
+<br />
 ===Model Construction===
 In the previous section, we introduced features extracted from different sources, i.e., '''''news content''''' and '''''social context''''', for fake news detection. In this section, we discuss the details of the model construction process for several existing approaches. Specifically we categorize existing methods based on their main input sources as: '''''News Content Models''''' and '''''Social Context Models'''''.
+<br />
 ====News Content Models====
 In this subsection, we focus on news content models. which mainly rely on '''''news content features''''' and '''''existing factual sources''''' to classify fake news. Specifically, existing approaches can be categorized as '''''Knowledge-based''''' and '''''Style-based.'''''
+<br />
 =====Knowledge-based=====
 Knowledge-based approaches aim to use '''external sources to''' '''''fact-check''''' '''proposed claims in news content'''. The goal of fact-checking is to assign a truth value to a claim in a particular context [83]. ''Fact-checking'' has attracted increasing attention, and many efforts have been made to develop a feasible automated fact-checking system.
@@ Line 125: / Line 146: @@
+<br />
 ======Expert-oriented approaches======
 Expert-oriented fact-checking heavily relies on human domain experts to investigate relevant data and documents to construct the verdicts of claim veracity, for example ''PolitiFact'' [11], ''Snopes'' [12], etc.  However, expert-oriented  fact-checking  is  an  intellectually  demanding and time-consuming process, which limits the potential for high efficiency and scalability.
+<br />
 ======Crowdsourcing-oriented approaches======
 Crowdsourcing-oriented fact-checking exploits the ''"wisdom  of  crowd"''  to  enable  normal  people  to  annotate news  content;  these  annotations  are then  aggregated to  produce  an  overall  assessment  of  the  news  veracity.  For example, ''Fiskkit'' [13] allows users to discuss and annotate the accuracy of specific parts of a news article.  As another example, an anti-fake news bot named "For real" is a public account in the instant communication mobile application ''LINE'' [14], which allows people to report suspicious news content which is then further checked by editors.
+<br />
 ======Computational-oriented approaches======
 This approaches aims to provide an automatic scalable system to classify true and false claims. Previous computational-oriented fact checking methods try to solve two majors issues:  '''''(i) identifying check-worthy claims''''' (identificar las frases que deben ser comprobadas) and '''''(ii) discriminating the veracity of fact claims.'''''
@@ Line 138: / Line 164: @@
 .
+<br />
 =====Style-based=====
 Style-based approaches try to detect fake news by capturing the ''manipulators'' in the writing style of news content. There are mainly two typical categories of style-based methods: '''''Deception-oriented''''' and '''''Objectivity-oriented''''':
+<br />
 ======Deception-oriented======
 These stylometric  methods  capture  the deceptive (engañoso) statements or claims from news content. The motivation of deception detection originates from forensic psychology (i.e., Undeutsch Hypothesis) [82] and various forensic tools including Criteria-based Content Analysis [84] and Scientific-based Content Analysis [45] have been developed.
@@ Line 154: / Line 183: @@
+<br />
 ======Objectivity-oriented======
 ''Objectivity-oriented approaches''  capture  style  signals that can indicate a decreased objectivity of news content and thus the potential to mislead consumers, such as: ''hyperpartisan styles'' and ''yellow-journalism''.
@@ Line 162: / Line 192: @@
+<br />
 ====Social Context Models====
 The  nature  of  social  media  provides  researchers  with  additional  resources  to  supplement  and  enhance  News  Con-tent  Models.   Social  context  models  include  relevant  user social engagements in the analysis, capturing this auxiliary information  from  a  variety  of  perspectives.   We  can  classify existing approaches for social context modeling into two categories: '''''Stance-based''''' and '''''Propagation-based'''''.  Note that very few existing fake news detection approaches have utilized social context models.  Thus, we also introduce similar methods for rumor detection using social media, which have potential application for fake news detection.
+<br />
 =====Stance-based=====
 Stance-based approaches utilize users' viewpoints  from  relevant  post  contents  to  infer  the  veracity  of original  news  articles.
@@ Line 175: / Line 207: @@
+<br />
 =====Propagation-based=====
+<br />