(CS3244) Final Poster Group 19

Fake News Detection
CS3244 Group 19
Bernard Koh (A0139498J) | Fong Mei Yee Flora (A0158869A) | Haw Zhe Hao Elroy (A0155801L)
Ng Kheng Yi (A0155159Y) | Yong Lin Han (A0139498J) | Yuan Quan (A0160785X)
Abstract
Imagine a world where you're uncertain of what you read, whether it is real or fake. We live in that world now, many thanks to the Internet. Despite its
endless possibilities, it came along with a widespread of fake news which could be hard to tell from first glance. This can potentially cause great
problems be it in the financial markets where billions could be loss, or in politics where decisions could be influenced. With the help of machine
learning, we are able to develop models that can help the vast majority to detect if the news article is real or fake before even reading them.
Data
Text processing: Pre-labelled data set with title and body of news articles was obtained from Kaggle. To extract useful information, as well as reduce
noise, we cleaned up the data set using the NLTK library by first removing punctuations, non-English characters, and stop words such as “the”, “is” and
“are”. Then, we also converted all words to lowercase for uniformity.
Feature selection and engineering: The title and body of the news article were used as the features to our models. We trained the models on each of
them after applying various word embedding techniques. These techniques includes frequency based and prediction based embedding. Count Vector and
Term Frequency-Inverse Document Frequency were employed for the former technique, and Word2Vec, which makes use of Continuous Bag of Words
and Skip-gram models, is used for the latter.
Machine Learning Models

Naive Bayes Classifier: The Naive Bayes classifier is a probabilistic
classifier which is based on Bayes theorem with strong and naive
independence assumptions. Due to its simplicity and independence
assumptions, it is a common technique to use for text classification.
Specifically, the Gaussian Naive Bayes was used in this project as it
yielded the highest accuracy as compared to its counterparts.
Logistic Regression: Another commonly used model for binary text
classification is Logistic Regression. It uses the Maximum Likelihood

Common fake words Common real words
Estimation as the learning algorithm, and the sigmoid function to output
the probability whether a news article is real or fake.
Linear Support Vector Classifier: LinearSVC is similar to the SVC
with a linear kernel. This model seeks for the optimal boundary line
with a maximized margin which allows for better generalization. This
is highly beneficial for text classification since words may not have
appeared before in the test set.
Ensembles: Various ensembles were tested in the project, namely,
Extra Tree, Adaptive Boosting, and Random Forest Classifiers. This
type of learning uses multiple learning algorithms to obtain
better accuracy in text classification as compared to the algorithms
standalone.
Overall Evaluation Deep Learning Models

Recurrent Neural Network: This type of
neural network deals with sequential data,
which is great for text classification since
words are not unrelated to each other in a
sentence. Specifically, the Long Short Term
Memory was used in our project. This consists
of an input layer, an embedding layer, a
bidirectional LSTM layer, and a dense layer.
Future Works
The current project's implementation and models are trained on the US
Overall, the model that yields the highest accuracy is the Recurrent political data set provided by Kaggle. In the future, subjected to data
Neural Network, specifically the LSTM, with over 96% accuracy on availability, we will ideally train it on various data sets around the world
the test data. Text body of the article was used to achieve this. Since the to see how the performance fair. This will give us a better idea of how
body contains a lot more words than the title, it allowed the model to good our model actually is. At the same time, we will also look at other
learn much more. On a side note, most of the other models presented variants of neural networks and explore with different layers to further
above achieve over 90% accuracy on the test data as well. improve our results.

(CS3244) Final Poster Group 19

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

(CS3244) Final Poster Group 19

Uploaded by

Copyright:

Available Formats

Fake News Detection

“are”. Then, we also converted all words to lowercase for uniformity.

and Skip-gram models, is used for the latter.

Machine Learning Models

classifier which is based on Bayes theorem with strong and naive

independence assumptions. Due to its simplicity and independence

assumptions, it is a common technique to use for text classification.

Specifically, the Gaussian Naive Bayes was used in this project as it

yielded the highest accuracy as compared to its counterparts.

Logistic Regression: Another commonly used model for binary text

classification is Logistic Regression. It uses the Maximum Likelihood

the probability whether a news article is real or fake.

Linear Support Vector Classifier: LinearSVC is similar to the SVC

with a maximized margin which allows for better generalization. This

appeared before in the test set.

Ensembles: Various ensembles were tested in the project, namely,

Extra Tree, Adaptive Boosting, and Random Forest Classifiers. This

type of learning uses multiple learning algorithms to obtain

better accuracy in text classification as compared to the algorithms

Overall Evaluation Deep Learning Models

neural network deals with sequential data,

which is great for text classification since

words are not unrelated to each other in a

sentence. Specifically, the Long Short Term

Memory was used in our project. This consists

of an input layer, an embedding layer, a

bidirectional LSTM layer, and a dense layer.

You might also like