Professional Documents
Culture Documents
CS3244 Group 19
Bernard Koh (A0139498J) | Fong Mei Yee Flora (A0158869A) | Haw Zhe Hao Elroy (A0155801L)
Ng Kheng Yi (A0155159Y) | Yong Lin Han (A0139498J) | Yuan Quan (A0160785X)
Abstract
Imagine a world where you're uncertain of what you read, whether it is real or fake. We live in that world now, many thanks to the Internet. Despite its
endless possibilities, it came along with a widespread of fake news which could be hard to tell from first glance. This can potentially cause great
problems be it in the financial markets where billions could be loss, or in politics where decisions could be influenced. With the help of machine
learning, we are able to develop models that can help the vast majority to detect if the news article is real or fake before even reading them.
Data
Text processing: Pre-labelled data set with title and body of news articles was obtained from Kaggle. To extract useful information, as well as reduce
noise, we cleaned up the data set using the NLTK library by first removing punctuations, non-English characters, and stop words such as “the”, “is” and
Feature selection and engineering: The title and body of the news article were used as the features to our models. We trained the models on each of
them after applying various word embedding techniques. These techniques includes frequency based and prediction based embedding. Count Vector and
Term Frequency-Inverse Document Frequency were employed for the former technique, and Word2Vec, which makes use of Continuous Bag of Words
with a linear kernel. This model seeks for the optimal boundary line
is highly beneficial for text classification since words may not have
standalone.
Future Works
The current project's implementation and models are trained on the US
Overall, the model that yields the highest accuracy is the Recurrent political data set provided by Kaggle. In the future, subjected to data
Neural Network, specifically the LSTM, with over 96% accuracy on availability, we will ideally train it on various data sets around the world
the test data. Text body of the article was used to achieve this. Since the to see how the performance fair. This will give us a better idea of how
body contains a lot more words than the title, it allowed the model to good our model actually is. At the same time, we will also look at other
learn much more. On a side note, most of the other models presented variants of neural networks and explore with different layers to further
above achieve over 90% accuracy on the test data as well. improve our results.