Arabic Tweet NLP Project

The goal of this project is to analyze Arabic tweets and classify them into different categories using NLP techniques. In order to do this, we will be using a Random Forest Classifier model and comparing the performance of three different feature extraction techniques: TF-IDF, Count Vectorizer, and Binary Encoding.

Dataset

The dataset used for this project is a collection of Arabic tweets, collected from various sources including Twitter, news websites, and blogs. The dataset contains 10,000 tweets and is labeled with one of five categories: Positive, Negative, Neutral, News, and Other. The dataset is preprocessed and cleaned before being used for training and testing the models.

Preprocessing

Before the data can be used for training and testing the models, it needs to be preprocessed and cleaned. The following steps are taken in the preprocessing phase:

Tokenization:

The tweets are tokenized using the Arabic language tokenizer provided by the Natural Language Toolkit (nltk).

Normalization:

The tweets are normalized by removing diacritics and converting all text to lowercase.

Stopwords Removal:

Stopwords are removed from the tweets. A list of Arabic stopwords is used from the NLTK library. Stemming: The tweets are stemmed using the Arabic Snowball Stemmer.

Feature Extraction

After the preprocessing phase, the tweets are converted into a numerical representation that can be used as input for the Random Forest Classifier model. Three different feature extraction techniques are used:
- TF-IDF
- Count Vectorizer
- Binary Encoding

Model Training and Evaluation

The Random Forest Classifier model is trained using the preprocessed and feature-extracted data. The dataset is split into training and testing sets, with 80% of the data used for training and 20% used for testing. The model is evaluated using the F1 score, which is a weighted average of precision and recall.

Conclusion

In this project, we have analyzed Arabic tweets using NLP techniques and a Random Forest Classifier model. We have compared the performance of three different feature extraction techniques: TF-IDF, Count Vectorizer, and Binary Encoding. The best performing technique is used to predict the category of new, unseen tweets. This project can be extended to include more categories or to use different models and feature extraction techniques.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
khoja-stemmer-command-line-master		khoja-stemmer-command-line-master
LICENSE		LICENSE
README.md		README.md
cleaned_dataset.csv		cleaned_dataset.csv
raw_dataset.csv		raw_dataset.csv
requirments.txt		requirments.txt
task-workflow.ipynb		task-workflow.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arabic Tweet NLP Project

Dataset

Preprocessing

Tokenization:

Normalization:

Stopwords Removal:

Feature Extraction

Model Training and Evaluation

Conclusion

NOTE: this an old project that dates back to 2020

About

Releases

Packages

Languages

License

Abdullah-Elkasaby/Arabic-Tweets-NLP-Project

Folders and files

Latest commit

History

Repository files navigation

Arabic Tweet NLP Project

Dataset

Preprocessing

Tokenization:

Normalization:

Stopwords Removal:

Feature Extraction

Model Training and Evaluation

Conclusion

NOTE: this an old project that dates back to 2020

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages