Paper: Fake news detection on social media - A data mining perspective
En esta página he escrito los puntos que me interesan más sobre este paper: https://www.kdd.org/exploration_files/19-1-Article2.pdf
- 1 Fake news detection
- 1.1 Problem Definition
- 1.2 Feature Extraction
- 1.3 Model Construction
- 1.3.1 News Content Models
- 1.3.2 Social Context Models
Fake news detection
In the previous section, we introduced the conceptual characterization of traditional fake news and fake news in social media. Based on this characterization, we further explore the problem definition and proposed approaches for fake news detection.
In this subsection, we present the details of mathematical formulation of fake news detection on social media. Specifically, we will introduce the definition of key components of fake news and then present the formal definition of fake news detection.
The basic notations are defined below:
- Let refer to a News Article. It consists of two major components: Publisher and Content:
- Publisher includes a set of profile features to describe the original author, such as name, domain, age, among other attributes.
- Content consists of a set of attributes that represent the news article and includes headline, text, image, etc.
- We also define Social News Engagements as a set of tuples to represent the process of how news spread over time among n users and their corresponding posts on social media regarding news article . Each engagement represents that a user spreads news article using at time . Note that we set if the article does not have any engagement yet and thus represents the publisher.
Definition 2 (Fake News Detection): Given the social news engagements among users for news article the task of fake news detection is to predict whether the news article is a fake news piece or not, i.e., such that:
where is the prediction function we want to learn. Note that we define fake news detection as a binary classification problem for the following reason: fake news is essentially a distortion bias on information manipulated by the publisher. According to previous research about media bias theory , distortion bias is usually modeled as a binary classification problem.
Next, we propose a general data mining framework for fake news detection which includes two phases:
- (i) Feature extraction: The feature extraction phase aims to represent news content and related auxiliary information in a formal mathematical structure.
- (ii) Model construction: The model construction phase further builds machine learning models to better differentiate fake news and real news based on the feature representations.
Fake news detection on traditional news media mainly relies on news content, while in social media, extra social context auxiliary information can be used to as additional information to help detect fake news. Thus, we will present the details of how to extract and represent useful features from news content and social context.
News Content Features
News content features describe the meta information related to a piece of news. A list of representative news content attributes are listed below:
- Source: Author or publisher of the news article
- Headline: Short title text that aims to catch the attention of readers and describes the main topic of the article
- Body Text: Main text that elaborates the details of the news story; there is usually a major claim that is specifically highlighted and that shapes the angle of the publisher
- Image/Video: Part of the body content of a news article that provides visual cues to frame the story
Based on these raw content attributes, different kinds of feature representations can be built to extract discriminative characteristics of fake news. Typically, the news content we are looking at will mostly be linguistic-based and visual-based, described in more detail below.
Since fake news pieces are intentionally created for financial or political gain rather than to report objective claims, they often contain opinionated and inflammatory language, crafted as "clickbait" (i.e., to entice users to click on the link to read the full article) or to incite confusion . Thus, it is reasonable to exploit linguistic features that capture the different writing styles and sensational headlines to detect fake news.
Linguistic-based features are extracted from the text content in terms of document organizations from different levels, such as characters, words, sentences, and documents. In order to capture the different aspects of fake news and real news, existing work utilized both common linguistic features and domain-specific linguistic features.
Common linguistic features
Common linguistic features Are often used to represent documents for various tasks in natural language processing. Typical common linguistic features are:
- (i) Lexical features: Including character-level and word-level features, such as total words, characters per word, frequency of large words, and unique words.
- (ii) Syntactic features: Including sentence-level features, such as frequency of function words and phrases (i.e., "n-grams" and bag-of-words approaches ) or punctuation and parts-of-speech (POS) tagging.
Domain-specific linguistic features
These are specifically aligned to news domain, such as quoted words, external links, number of graphs, and the average length of graphs, etc .
Moreover, other features can be specifically designed to capture the deceptive cues in writing styles to differentiate fake news, such as lying-detection features .
Visual cues have been shown to be an important manipulator for fake news propaganda. As we have characterized, fake news exploits the individual vulnerabilities of people and thus often relies on sensational or even fake images to provoke anger or other emotional response of consumers. Visual-based features are extracted from visual elements (e.g. images and videos) to capture the different characteristics for fake news.
Faking images were identified based on various user-level and tweet-level hand-crafted features using classification framework . Recently, various visual and statistical features has been extracted for news verification :
- Visual features include:
- Clarity score
- Coherence score
- Similarity distribution histogram
- Diversity score
- Clustering score.
- Statistical features include count, image ratio, multi-image ratio, hot image ratio, long image ratio, etc.
Social Context Features
In addition to features related directly to the content of the news articles, additional social context features can also be derived from the user-driven social engagements of news consumption on social media platform. Social engagements represent the news proliferation process over time, which provides useful auxiliary information to infer the veracity of news articles. Note that few papers exist in the literature that detect fake news using social context features. However, because we believe this is a critical aspect of successful fake news detection, we introduce a set of common features utilized in similar research areas, such as: rumor veracity classification on social media.
Generally, there are three major aspects of the social media context that we want to represent:
- Generated posts, and
Below, we investigate how we can extract and represent social context features from these three aspects to support fake news detection:
As we mentioned in Section 2.3, fake news pieces are likely to be created and spread by non-human accounts, such as social bots or cyborgs. Thus, capturing users’ profiles and characteristics by user-based features can provide useful information for fake news detection.
People express their emotions or opinions towards fake news through social media posts, such as skeptical opinions, sensational reactions, etc. Thus, it is reasonable to extract post-based features to help find potential fake news via reactions from the general public as expressed in posts.
Users form different networks on social media in terms of interests, topics, and relations. As mentioned before, fake news dissemination processes tend to form an echo chamber cycle, highlighting the value of extracting network-based features to represent these types of network patterns for fake news detection. Network-based features are extracted via constructing specific networks among the users who published related social media posts.
In the previous section, we introduced features extracted from different sources, i.e., news content and social context, for fake news detection. In this section, we discuss the details of the model construction process for several existing approaches. Specifically we categorize existing methods based on their main input sources as: News Content Models and Social Context Models.
News Content Models
In this subsection, we focus on news content models. which mainly rely on news content features and existing factual sources to classify fake news. Specifically, existing approaches can be categorized as Knowledge-based and Style-based.
Knowledge-based approaches aim to use external sources to fact-check proposed claims in news content. The goal of fact-checking is to assign a truth value to a claim in a particular context . Fact-checking has attracted increasing attention, and many efforts have been made to develop a feasible automated fact-checking system.
Existing fact-checking approaches can be categorized as: expert-oriented, crowdsourcing-oriented, and computational-oriented.
Expert-oriented fact-checking heavily relies on human domain experts to investigate relevant data and documents to construct the verdicts of claim veracity, for example PolitiFact , Snopes , etc. However, expert-oriented fact-checking is an intellectually demanding and time-consuming process, which limits the potential for high efficiency and scalability.
Crowdsourcing-oriented fact-checking exploits the "wisdom of crowd" to enable normal people to annotate news content; these annotations are then aggregated to produce an overall assessment of the news veracity. For example, Fiskkit  allows users to discuss and annotate the accuracy of specific parts of a news article. As another example, an anti-fake news bot named "For real" is a public account in the instant communication mobile application LINE , which allows people to report suspicious news content which is then further checked by editors.
This approaches aims to provide an automatic scalable system to classify true and false claims. Previous computational-oriented fact checking methods try to solve two majors issues: (i) identifying check-worthy claims (identificar las frases que deben ser comprobadas) and (ii) discriminating the veracity of fact claims.
. . .
Style-based approaches try to detect fake news by capturing the manipulators in the writing style of news content. There are mainly two typical categories of style-based methods: Deception-oriented and Objectivity-oriented:
These stylometric methods capture the deceptive (engañoso) statements or claims from news content. The motivation of deception detection originates from forensic psychology (i.e., Undeutsch Hypothesis)  and various forensic tools including Criteria-based Content Analysis  and Scientific-based Content Analysis  have been developed.
More recently, advanced natural language processing models are applied to spot deception phases from the following perspectives: Deep syntax and Rhetorical structure.
- Deep syntax models have been implemented using Probabilistic context-free grammar (PCFG), with which sentences can be transformed into rules that describe the syntax structure. Based on the PCFG, different rules can be developed for deception detection, such as unlexicalized/lexicalized production rules and grandparent rules .
- Rhetorical structure theory can be utilized to capture the differences between deceptive and truthful sentences .
- Deep network models, such as convolutional neural networks (CNN), have also been applied to classify fake news veracity .
Objectivity-oriented approaches capture style signals that can indicate a decreased objectivity of news content and thus the potential to mislead consumers, such as: hyperpartisan styles and yellow-journalism.
- Hyperpartisan styles represent extreme behavior in favor of a particular political party, which often correlates with a strong motivation to create fake news. Linguistic-based features can be applied to detect hyper partisan articles .
- Yellow-journalism represents those articles that do not contain well-researched news, but instead rely on eye-catching headlines (i.e., clickbait) with a propensity for exaggeration, sensationalization,scare-mongering, etc.
Social Context Models
The nature of social media provides researchers with additional resources to supplement and enhance News Con-tent Models. Social context models include relevant user social engagements in the analysis, capturing this auxiliary information from a variety of perspectives. We can classify existing approaches for social context modeling into two categories: Stance-based and Propagation-based. Note that very few existing fake news detection approaches have utilized social context models. Thus, we also introduce similar methods for rumor detection using social media, which have potential application for fake news detection.
Stance-based approaches utilize users' viewpoints from relevant post contents to infer the veracity of original news articles.
The stance of users' posts can be represented either explicitly or implicitly:
- Explicit stances are direct expressions of emotion or opinion, such as the "thumbs up" and "thumbs down" reactions expressed in Facebook.
- Implicit stances can be automatically extracted from social media posts. Stance detection is the task of automatically determining from a post whether the user is in favor of, neutral toward, or against some target entity, event, or idea