Scalable Fake News Detection: Implementing NLP and Embedding Models for Large-Scale Data
Main Article Content
Abstract
This paper describes a scalable approach to fake news detection by employing Natural Language Processing and word embedding models for huge datasets. The collective work with different embeddings (Bag of Words, TF-IDF, Word2Vec, and Bidirectional Encoder Representations from Transformers BERT), extracting not only word frequency but also content relation in news articles. These embeddings are then combined with machine learning classifiers including logistic regression, random forests and neural networks to evaluate how different models perform. It is a scalable system using distributed processing frameworks to process large amounts of data and to enable large scale model training. Our methodology with widely adopted fake news datasets including PolitiFact and the LIAR dataset show superior classification results, in particular when employing deep learning-based embeddings such as BERT which outperforms traditional methods by accuracy and recall. The authors investigate the effect of text preprocessing methods (e.g. stop-word removal, tokenization) on classification results. Our findings call attention to the trade-offs required for launching large-scale fake news detection systems given a balance between model complexity and computational efficiency.