Professional Documents
Culture Documents
CS3244 Group 19
Bernard Koh (A0139498J) | Fong Mei Yee Flora (A0158869A) | Haw Zhe Hao Elroy (A0155801L)
Ng Kheng Yi (A0155159Y) | Yong Lin Han (A0139498J) | Yuan Quan (A0160785X)
Problem
Imagine a world where you're uncertain of what you read, whether it is real or fake. We live in that world now, many thanks to the Internet. Despite its
endless possibilities, it came along with a widespread of fake news which could be hard to tell from first glance. This can potentially cause great
problems be it in the financial markets where billions could be loss, or in politics where decisions could be influenced. With the help of machine
learning, we are able to develop models that can help the vast majority to detect if the news article is real or fake before even reading them.
Data
Text processing: Pre-labelled data set with title and body of news articles was obtained from Kaggle. To extract useful information, as well as reduce
noise, we cleaned up the data set using the NLTK library by first removing punctuations, non-English characters, and stop words such as “the”, “is” and
Feature selection and engineering: The title and body of the news article were used as the features to our models. We trained the models on each of
them after applying various word embedding techniques. These techniques includes frequency based and prediction based embedding. Count Vector and
Term Frequency-Inverse Document Frequency were employed for the former technique, and Word2Vec, which makes use of Continuous Bag of Words
with a linear kernel. This model seeks for the optimal boundary line
is highly beneficial for text classification since words may not have
standalone.
Future Works
The current project's implementation and models are trained on the US