Articles-Categorization-Analysis-Project 💬

Sentiment analysis (or opinion mining) is a natural language processing (NLP) technique used to determine whether data is positive, negative or neutral. Sentiment analysis is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback, and understand customer needs.

Project Descriptions 📝

Text documents are essential as they are one of the richest sources of data for businesses. Text documents often contain crucial information which might shape the market trends or influence the investment flows. Therefore, companies often hire analysts to monitor the trend via articles posted online, tweets on social media platforms such as Twitter or articles from newspaper. However, some companies may wish to only focus on articles related to technologies and politics. Thus, filtering of the articles into different categories is required.

Here, our aim is to categorize unseen articles into 5 categories namely Sport, Tech, Business, Entertainment and Politics.

Project Organization 📁

├── Datasets                                    : Contains file about the project
├── Saved_models                                : Contains model saved from .py in .pkl/.json/.h5 format
├── Statics                                     : Contains all save image (graphs/heatmap/tensorboard)
├── __pycache__                                 : .pcy file
├── logs/20220726-160113                        : log folder
├──.gitattributes                               : .gitattributes
├── Article_categorization_Analysis.py          : Code file in python format
├── GitHub_url.txt                              : Github url in .txt
├── README.md                                   : Project Descriptions
├── aca_module.py                               : Module file in python format
└──  model.png                                  : Model picture

Requirements 💻

This project is created using Spyder as the main IDE. The main frameworks used in this project are Pandas, Matplotlib, Seaborn, Scikit-learn, Tensorflow and Tensorboard.

Methodology 🏃

This project contains two .py files. The training file and the module file is Article_categorization_Analysis.py, aca_module.py. The flow of the projects are as follows:

Step 1 - Loading the data:
- Data preparation is the primary step for any deep learning problem. The dataset can be obtained from this link
dataset. This dataset consists of a texts that will be use for training.
```
   CSV_URL = 'https://raw.githubusercontent.com/susanli2016/PyCon-Canada-\
   2019-NLP-Tutorial/master/bbc-text.csv'
   
   df = pd.read_csv(CSV_URL)
```

Step 2) Data Inspection:

   df.head()
   df.tail()
   
   # can check duplicated in NLP
   # There is 99 duplicated text
   df.duplicated().sum()

Step 3) Data Cleaning:

   # Remove the duplicated data
   df = df.drop_duplicates() 
   
   # Assign variable to the dataset columns
   article = df['text'].values  # features of X
   category = df['category'].values # target, y

   # To backup the dataset
   article_backup = article.copy()
   category_backup = category.copy()

Step 4) Features Selection
- no features selection were made

Step 5) Data Preprocessing

1. Convert into lower case.
  - No upper case been detected in text. So, we may skip this process.

Tokenizing

Here, we would like to change the text into numbers and the process to learn all the words is been done here.

 # must not contain empty list
 # need to convert the text to numbers
 vocab_size = 10000 
 oov_token = '<OOV>'

 tokenizer = Tokenizer(num_words=vocab_size,oov_token=oov_token)
 tokenizer.fit_on_texts(article) # Learning all the words
 word_index = tokenizer.word_index

 # To show 10 to 20 only put the slice after the list
 print(dict(list(word_index.items())[10:20]))

 # to convert into numbers
 article_int = tokenizer.texts_to_sequences(article)

 # to check length of every sentence in review
 for i in range(len(article_int)):
   print(len(article_int[i]))

Padding & truncating

 # to decide the length of the padding, use median to pick the padding number
 length_article = []
 for i in range(len(article_int)):
   length_article.append(len(article_int[i]))
   #print(len(article_int[i]))
 
 # to get the number of max length for padding
 np.median(length_article)

 # comprehension
 max_len = np.median([len(article_int[i])for i in range(len(article_int))])
 max_len # need to convert to integer

 padded_article = pad_sequences(article_int,
                               maxlen=int(max_len),
                               padding='post',
                               truncating='post')

One Hot Encoding for the target

   ohe = OneHotEncoder(sparse=False)
   category = ohe.fit_transform(np.expand_dims(category,axis=-1))

Train test split

 X_train,X_test,y_train,y_test = train_test_split(padded_article, 
                                      category,
                                      test_size=0.3,
                                      random_state=123)

Model Development

By using the model Sequential, LSTM, dropout, Bidirectional, Embedding and Dense. Our model development is been structured.

The model can be view in aca_module.py file.

       input_shape = np.shape(X_train)[1:]
       nb_class = len(np.unique(category,axis=0))
       out_dim = 128

       # Model 
       from aca_module import ModelDevelopment
       md = ModelDevelopment()
       model = md.simple_dl_model(input_shape, nb_class, vocab_size, out_dim)

       model.compile(optimizer='adam',loss='categorical_crossentropy',
                     metrics=['acc'])

Model Training

This model include tensorboard callbacks, ModelCheckpoint, and EarlyStopping to reduce overfitting when training the model. This model only used 5 epoch to train.

   # Tensorboard Callbacks
   tensorboard_callback = TensorBoard(log_dir=LOGS_PATH,histogram_freq=1)

   # ModelCheckpoint
   mdc = ModelCheckpoint(BEST_MODEL_PATH,monitor='val_acc',
                         save_best_only=True,
                         modes='max',verbose=1)
   # EarlyStopping
   early_callback = EarlyStopping(monitor='val_loss',patience=3)

   hist = model.fit(X_train,y_train,
                   epochs=5,
                   validation_data=(X_test,y_test),
                   callbacks=[mdc,tensorboard_callback,early_callback])

The visualization of our model architecture can be presented in the figure below:

Model Evaluation

Through this section, classification report and confusion matrix is been used as a part of our model evaluation and analysis. Around 92% accuracy have been achieved.

    y_pred = np.argmax(model.predict(X_test),axis=1)
    y_actual = np.argmax(y_test,axis=1)
    cm = confusion_matrix(y_actual,y_pred)
    cr = classification_report(y_actual,y_pred)

    print(cm)
    print(cr)

    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot(cmap=plt.cm.Blues)
    plt.show()

The classification report is shown below.

The confusion_matrix is shown below.

Results and Discussion 📝

Plotting the graph
- Although there is a sign of overfitting towards the end of the training, the early stopping have prevent the overfiting to happen.

Credits 📂

This project is made possible by the data provided from this susanli2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Articles-Categorization-Analysis-Project 💬

Project Descriptions 📝

Project Organization 📁

Requirements 💻

Methodology 🏃

Results and Discussion 📝

Credits 📂

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Datasets		Datasets
Saved_models		Saved_models
Statics		Statics
__pycache__		__pycache__
logs/20220726-160113		logs/20220726-160113
.gitattributes		.gitattributes
Article_categorization_Analysis.py		Article_categorization_Analysis.py
Github_url.txt		Github_url.txt
README.md		README.md
aca_module.py		aca_module.py
model.png		model.png

amaninaas/Articles-Categorization-Analysis-Project

Folders and files

Latest commit

History

Repository files navigation

Articles-Categorization-Analysis-Project 💬

Project Descriptions 📝

Project Organization 📁

Requirements 💻

Methodology 🏃

Results and Discussion 📝

Credits 📂

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages