You are on page 1of 1

Fake News Detection

CS3244 Group 19
Bernard Koh (A0139498J) | Fong Mei Yee Flora (A0158869A) | Haw Zhe Hao Elroy (A0155801L)
Ng Kheng Yi (A0155159Y) | Yong Lin Han (A0139498J) | Yuan Quan (A0160785X)
Problem
Imagine a world where you're uncertain of what you read, whether it is real or fake. We live in that world now, many thanks to the Internet. Despite its

endless possibilities, it came along with a widespread of fake news which could be hard to tell from first glance. This can potentially cause great

problems be it in the financial markets where billions could be loss, or in politics where decisions could be influenced. With the help of machine

learning, we are able to develop models that can help the vast majority to detect if the news article is real or fake before even reading them.

Data
Text processing: Pre-labelled data set with title and body of news articles was obtained from Kaggle. To extract useful information, as well as reduce

noise, we cleaned up the data set using the NLTK library by first removing punctuations, non-English characters, and stop words such as “the”, “is” and

“are”. Then, we also converted all words to lowercase for uniformity.

Feature selection and engineering: The title and body of the news article were used as the features to our models. We trained the models on each of

them after applying various word embedding techniques. These techniques includes frequency based and prediction based embedding. Count Vector and

Term Frequency-Inverse Document Frequency were employed for the former technique, and Word2Vec, which makes use of Continuous Bag of Words

and Skip-gram models, is used for the latter.

Machine Learning Models


Naive Bayes Classifier: The Naive Bayes classifier is a probabilistic

classifier which is based on Bayes theorem with strong and naive

independence assumptions. Due to its simplicity and independence

assumptions, it is a common technique to use for text classification.

Specifically, the Gaussian Naive Bayes was used in this project as it

yielded the highest accuracy as compared to its counterparts.

Logistic Regression: Another commonly used model for binary text

classification is Logistic Regression. It uses the Maximum Likelihood


Common fake words Common real words
Estimation as the learning algorithm, and the sigmoid function to output

the probability whether a news article is real or fake.

Linear Support Vector Classifier: LinearSVC is similar to the SVC

with a linear kernel. This model seeks for the optimal boundary line

with a maximized margin which allows for better generalization. This

is highly beneficial for text classification since words may not have

appeared before in the test set.

Ensembles: Various ensembles were tested in the project, namely,

Extra Tree, Adaptive Boosting, and Random Forest Classifiers. This

type of learning uses multiple learning algorithms to obtain

better accuracy in text classification as compared to the algorithms

standalone.

Overall Evaluation Deep Learning Models


Recurrent Neural Network: This type of

neural network deals with sequential data,

which is great for text classification since

words are not unrelated to each other in a

sentence. Specifically, the Long Short Term

Memory was used in our project. This consists

of an input layer, an embedding layer, a

bidirectional LSTM layer, and a dense layer.

Future Works
The current project's implementation and models are trained on the US

political data set provided by Kaggle. In the future, subjected to data


Overall, the model that yields the highest accuracy is the Recurrent
availability, we will ideally train it on various data sets around the world
Neural Network, specifically the LSTM, with over 96% accuracy on
to see how the performance fair. This will give us a better idea of how
the test data. Text body of the article was used to achieve this. Since the
good our model actually is. At the same time, we will also look at other
body contains a lot more words than the title, it allowed the model to
variants of neural networks and explore with different layers to further
learn much more. On a side note, most of the other models presented
improve our results.
above achieve over 90% accuracy on the test data as well.

You might also like