Exploring the Generalisability of Fake News Detection Using NLP | Realcode4you

realcode4you
May 19
7 min read

1. Introduction & Problem Statement

In the age of digital communication, the rapid spread of misinformation - commonly known as fake news - has become a critical societal challenge. With the rise of social media and online news platforms, fabricated stories can go viral within minutes, influencing public opinion, political decisions, and even financial markets. Traditional fact-checking methods are often too slow to counter this fast-paced information flow, highlighting the urgent need for automated detection systems.

This project presents a data-driven approach to fake news detection using Natural Language Processing (NLP). By analyzing the textual content of news articles, we apply both machine learning and deep learning techniques to classify news as either real or fake. The goal is to evaluate the effectiveness of classical algorithms like Logistic Regression and advanced models like LSTM and Bidirectional LSTM in identifying deceptive content based solely on linguistic patterns.

2. Exploratory Data Analysis (EDA)

We have 2 dataset files True.csv and fake.csv loaded into respective dataframe and explore some top 5 rows of dataset as below:

We have total 21417 rows for True dataframe and 23481 false dataframe with similar 4 columns further we have combined these datasets together for processing

Categories of News:

From above graph we can see True new dataset has only two categories new I.e political and world news along with Fake news has multiple categories such as Politics, Government , US & Middle east news, so per analysis these categories are not that much variation so it will not help us in Predicting the news either its fake or true

Wordcloud of News:

TRUE News commonly uses words like "said," "reuters," "government," indicating formal journalistic tone.
FAKE News is rich in emotionally charged or politically sensitive words like "clinton," "obama," "people," suggesting informal or biased language.

This supports the hypothesis that word patterns can help distinguish fake from real news.

Top Words in TRUE News

The most frequent word is "said" (99,017 times), reflecting the journalistic practice of quoting sources and reporting verified statements.
Words like "reuters," "president," "state," "government," and "house" indicate references to credible news agencies, formal institutions, and structured reporting.
The presence of "us" and "trump" highlights a focus on U.S. political reporting, a common theme in real news content.

Top Words in FAKE News

Interestingly, "trump" is even more frequent in fake news (73,422 times), suggesting that fake news often capitalizes on politically charged figures to gain attention.
Terms like "people," "one," "like," "clinton," and "obama" suggest a more informal tone or focus on emotionally driven content.
Compared to TRUE news, FAKE news features fewer mentions of official sources or news agencies, indicating a possible lack of citation and credibility.

Key Observations

Both real and fake news discuss politics extensively, especially U.S. politics, but real news emphasizes facts and sources, while fake news leans toward opinionated or speculative content.
The word "said" is more dominant in TRUE news, supporting the idea that factual reporting often includes attributed quotes.
Fake news appears to use more personalized or sensational language, which could influence reader emotions and engagement.

3. State-of-the-Art Techniques in Fake News Detection

In recent years, detecting fake news has become a major application of Natural Language Processing (NLP) and machine learning. The most effective state-of-the-art methods include Traditional ML models with Logistic Regression using TF-IDF or bag-of-words for feature extraction and RNN models

In this project, we implemented both traditional machine learning and modern deep learning approaches for fake news detection. Our goal was to evaluate how well these methods can classify news as real or fake using natural language processing (NLP) techniques.
We began with a Logistic Regression model, a widely used baseline in text classification tasks. Using TF-IDF (Term Frequency-Inverse Document Frequency) for feature extraction, the model achieved high accuracy and provided interpretable results. This approach reflects early state-of-the-art methods in fake news detection and serves as a strong benchmark for comparison.
To incorporate more advanced techniques, we implemented a Long Short-Term Memory (LSTM) neural network. LSTM is a type of recurrent neural network (RNN) designed to capture long-term dependencies in sequential data, making it suitable for analyzing news articles and understanding word context.
Further, we built a Bidirectional LSTM (BiLSTM) model to enhance performance. Unlike standard LSTM, the BiLSTM reads the text in both forward and backward directions, allowing it to capture richer contextual information. This approach aligns with recent research in NLP, where BiLSTM has shown promising results in tasks like sentiment analysis, fake news detection, and spam filtering. By combining classical machine learning and modern deep learning models, our project demonstrates a relevant and practical approach aligned with current advancements in the field of fake news classification.

4 . Relevance of the Proposed Model

The selection of models in this project is guided by both effectiveness and relevance to the fake news classification task. We proposed three models: Logistic Regression with TF-IDF, LSTM, and Bidirectional LSTM, each offering distinct advantages.

The Logistic Regression model, when combined with TF-IDF vectorization, is highly interpretable and computationally efficient. It is particularly relevant for tasks involving textual data and binary classification, making it a suitable starting point for fake news detection.
To capture more complex patterns in language, we incorporated LSTM (Long Short-Term Memory), which is capable of learning long-range dependencies in sequences. This model is well-suited for analyzing article content where the order and context of words play a crucial role in determining authenticity.
To further improve context awareness, we implemented a Bidirectional LSTM (BiLSTM). This model processes input sequences from both directions, allowing it to understand not only past but also future context in a sentence. This is especially important in fake news detection, where subtle cues in language can alter meaning.
Overall, the models proposed in this project are well-aligned with the nature of the task and represent a balanced mix of traditional and deep learning approaches, each contributing to a more comprehensive solution.

5. Implementation of the Model

We implemented three different models using Python, Scikit-learn, and TensorFlow/Keras libraries.

1. Logistic Regression with TF-IDF:

o We applied TF-IDF vectorization to convert text data into numerical features.

o The Logistic Regression model was trained on this vectorized dataset and evaluated using accuracy, precision, recall, and F1-score.

o It achieved an impressive accuracy of 98.83%, indicating strong performance on the classification task.

2. LSTM (Long Short-Term Memory):

We used Keras to build an LSTM-based model, starting with an Embedding layer followed by an LSTM layer and a Dense output layer with a sigmoid activation.
Text data was pre-processed using tokenization and padding to ensure uniform input length.
After training for two epochs, the model achieved a test accuracy of 98.56%, showing that it could effectively capture sequential patterns in text.

LSTM Model Summary

Embedding Layer: Converts each word into a 128-dimensional dense vector using a vocabulary of 10,000 words (128 * 10,000 = 1,280,000 parameters).
LSTM Layer: Learns temporal dependencies in the sequence data using 128 memory units.
- Handles sequential patterns and context well, especially useful for fake news detection.
Dense Layer: A single neuron with a sigmoid activation to output the probability of the news being fake or real.
Strength: Captures long-term dependencies and context in the text.

3. Bidirectional LSTM:

A more advanced version of LSTM, we implemented BiLSTM by wrapping the LSTM layer in a Bidirectional layer.
This model was trained similarly and evaluated on the same dataset.
It achieved a test accuracy of 98.52%, confirming that bidirectional context improves language understanding and model performance.

Bidirectional LSTM (BiLSTM) Model Summary

Embedding Layer: Same as above - converts words to dense vectors.
Bidirectional LSTM Layer: Consists of two LSTM layers running in forward and backward directions, doubling the output size to 256 units
- Allows the model to understand context from both past and future words in the sentence.
Dense Layer: Final sigmoid layer for binary classification.
Strength: More powerful than standard LSTM in understanding deeper context due to dual-directional processing.

Each model was carefully tuned and evaluated using the same dataset split, allowing fair comparison and consistent benchmarking.

6. Result analysis

Logistic Regression with TF-IDF

The Logistic Regression model achieved an impressive accuracy of 98.73%, showcasing its strength in handling text classification with high precision. TF-IDF effectively transformed raw text into weighted feature vectors that highlight important terms.

Strengths:
- Fast training and low resource consumption.
- Highly interpretable with simple implementation.
- Performs well with large datasets and sparse features.

Ø Correctly Predicted Real News (True Negatives): 4654

Ø Correctly Predicted Fake News (True Positives): 4212

Ø False Positives (Fake news wrongly predicted as real): 63

Ø False Negatives (Real news wrongly predicted as fake): 51

Logistic Regression made very few mistakes, misclassifying only 114 out of 8980 instances. It maintained a strong balance between identifying both real and fake news accurately. Despite its high performance, LR lacks the capability to understand word sequences or context, limiting its adaptability to complex language patterns.

LSTM Model: A deep learning model using sequential word data.

The LSTM model captured sequential dependencies in text and achieved a test accuracy of 98.56% after two epochs. It outperformed LR in contextual understanding, especially in nuanced sentences.

Strengths:
- Learns temporal relationships between words.
- Good for capturing semantic meaning over sequences.
- More robust to varied writing styles and syntax.

Ø Correctly Predicted Real News: 4612 & Fake News: 4239

Ø False Positives: 105 & False Negatives: 24

LSTM did better in recognizing fake news (fewer false negatives), though it slightly struggled more with false positives. It shows the model’s stronger understanding of fake news patterns. While LSTM introduced longer training times, it provided better generalization in some edge cases where word order was critical for classification.

Bidirectional LSTM (Bi-LSTM):

The BiLSTM model achieved the highest test accuracy of 98.91%. By reading the input text in both forward and backward directions, it leveraged richer context, allowing better prediction especially in complex or ambiguous statements.

Strengths:
- Captures both past and future word context.
- Best performance among all models.
- Effective in detecting subtle fake/real indicators in content.

Ø Correctly Predicted Real News: 4652

Ø Correctly Predicted Fake News: 4232

Ø False Positives: 65 & False Negatives: 31

BiLSTM performed the best overall, achieving the highest number of correctly predicted instances (8884 out of 8980). It balanced both real and fake news classification well, leveraging contextual understanding in both directions. The trade-off was increased training time and model complexity, but it paid off with the best predictive performance.

Model Comparison Table

Model	Accuracy (%)	FP	FN	Key Strength
Logistic Regression	98.73	63	51	Fast, interpretable, works well on TF-IDF
LSTM	98.56	105	24	Captures sequential patterns in text
Bidirectional LSTM	98.91	65	31	Leverages context from both directions

Conclusion

All three models performed exceptionally well, with accuracies above 98%. Logistic Regression with TF-IDF proved to be a strong baseline due to its simplicity and speed. However, the deep learning models, especially BiLSTM, demonstrated superior understanding of language structure and subtle context. Given its balanced precision and contextual learning capability, BiLSTM is the most suitable model for real-world fake news detection, particularly in dynamic and linguistically diverse environments.

References

Malarkodi, C.S. (n.d.). ISOT Fake News Dataset. [online] Kaggle. Available at: https://www.kaggle.com/datasets/csmalarkodi/isot-fake-news-dataset/data?select=True.csv [Accessed 2 May 2025].
Alsmadi, I., & Obeid, N. (2023). Fake News Detection Datasets: A Review and Research Opportunities. [online] Brunel University London. Available at: https://bura.brunel.ac.uk/bitstream/2438/25909/1/FullText.pdf [Accessed 2 May 2025].
Chollet, F., 2015. Keras. [online] Available at: https://keras.io [Accessed 2 May 2025].

RealCode4You