Skip to content

Natural Language Processing using Deep Learning approach to categorize unseen articles into 5 categories namely Sport, Tech, Business, Entertainment and Politics.

Notifications You must be signed in to change notification settings

amaninaas/Articles-Categorization-Analysis-Project

Repository files navigation

Spyder Python NumPy Pandas scikit-learn Tensorflow Keras

Articles-Categorization-Analysis-Project 💬

Sentiment analysis (or opinion mining) is a natural language processing (NLP) technique used to determine whether data is positive, negative or neutral. Sentiment analysis is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback, and understand customer needs.

Project Descriptions 📝

Text documents are essential as they are one of the richest sources of data for businesses. Text documents often contain crucial information which might shape the market trends or influence the investment flows. Therefore, companies often hire analysts to monitor the trend via articles posted online, tweets on social media platforms such as Twitter or articles from newspaper. However, some companies may wish to only focus on articles related to technologies and politics. Thus, filtering of the articles into different categories is required.

Here, our aim is to categorize unseen articles into 5 categories namely Sport, Tech, Business, Entertainment and Politics.

Project Organization 📁

├── Datasets                                    : Contains file about the project
├── Saved_models                                : Contains model saved from .py in .pkl/.json/.h5 format
├── Statics                                     : Contains all save image (graphs/heatmap/tensorboard)
├── __pycache__                                 : .pcy file
├── logs/20220726-160113                        : log folder
├──.gitattributes                               : .gitattributes
├── Article_categorization_Analysis.py          : Code file in python format
├── GitHub_url.txt                              : Github url in .txt
├── README.md                                   : Project Descriptions
├── aca_module.py                               : Module file in python format
└──  model.png                                  : Model picture

Requirements 💻

This project is created using Spyder as the main IDE. The main frameworks used in this project are Pandas, Matplotlib, Seaborn, Scikit-learn, Tensorflow and Tensorboard.

Methodology 🏃

This project contains two .py files. The training file and the module file is Article_categorization_Analysis.py, aca_module.py. The flow of the projects are as follows:

  • Step 1 - Loading the data:

    • Data preparation is the primary step for any deep learning problem. The dataset can be obtained from this link

    dataset. This dataset consists of a texts that will be use for training.

       CSV_URL = 'https://raw.githubusercontent.com/susanli2016/PyCon-Canada-\
       2019-NLP-Tutorial/master/bbc-text.csv'
       
       df = pd.read_csv(CSV_URL)
    
  • Step 2) Data Inspection:

       df.head()
       df.tail()
       
       # can check duplicated in NLP
       # There is 99 duplicated text
       df.duplicated().sum()
    
  • Step 3) Data Cleaning:

       # Remove the duplicated data
       df = df.drop_duplicates() 
       
       # Assign variable to the dataset columns
       article = df['text'].values  # features of X
       category = df['category'].values # target, y
    
       # To backup the dataset
       article_backup = article.copy()
       category_backup = category.copy()
    
  • Step 4) Features Selection

    • no features selection were made
  • Step 5) Data Preprocessing

      1. Convert into lower case.
        • No upper case been detected in text. So, we may skip this process.
      1. Tokenizing
        • Here, we would like to change the text into numbers and the process to learn all the words is been done here.

           # must not contain empty list
           # need to convert the text to numbers
           vocab_size = 10000 
           oov_token = '<OOV>'
          
           tokenizer = Tokenizer(num_words=vocab_size,oov_token=oov_token)
           tokenizer.fit_on_texts(article) # Learning all the words
           word_index = tokenizer.word_index
          
           # To show 10 to 20 only put the slice after the list
           print(dict(list(word_index.items())[10:20]))
          
           # to convert into numbers
           article_int = tokenizer.texts_to_sequences(article)
          
           # to check length of every sentence in review
           for i in range(len(article_int)):
             print(len(article_int[i]))
          
      1. Padding & truncating

         # to decide the length of the padding, use median to pick the padding number
         length_article = []
         for i in range(len(article_int)):
           length_article.append(len(article_int[i]))
           #print(len(article_int[i]))
         
         # to get the number of max length for padding
         np.median(length_article)
        
         # comprehension
         max_len = np.median([len(article_int[i])for i in range(len(article_int))])
         max_len # need to convert to integer
        
         padded_article = pad_sequences(article_int,
                                       maxlen=int(max_len),
                                       padding='post',
                                       truncating='post')
        
      1. One Hot Encoding for the target

           ohe = OneHotEncoder(sparse=False)
           category = ohe.fit_transform(np.expand_dims(category,axis=-1))
        
      1. Train test split

         X_train,X_test,y_train,y_test = train_test_split(padded_article, 
                                              category,
                                              test_size=0.3,
                                              random_state=123)
        
  • Model Development

    • By using the model Sequential, LSTM, dropout, Bidirectional, Embedding and Dense. Our model development is been structured.

    The model can be view in aca_module.py file.

           input_shape = np.shape(X_train)[1:]
           nb_class = len(np.unique(category,axis=0))
           out_dim = 128
    
           # Model 
           from aca_module import ModelDevelopment
           md = ModelDevelopment()
           model = md.simple_dl_model(input_shape, nb_class, vocab_size, out_dim)
    
           model.compile(optimizer='adam',loss='categorical_crossentropy',
                         metrics=['acc'])
    
  • Model Training

    • This model include tensorboard callbacks, ModelCheckpoint, and EarlyStopping to reduce overfitting when training the model. This model only used 5 epoch to train.

         # Tensorboard Callbacks
         tensorboard_callback = TensorBoard(log_dir=LOGS_PATH,histogram_freq=1)
      
         # ModelCheckpoint
         mdc = ModelCheckpoint(BEST_MODEL_PATH,monitor='val_acc',
                               save_best_only=True,
                               modes='max',verbose=1)
         # EarlyStopping
         early_callback = EarlyStopping(monitor='val_loss',patience=3)
      
         hist = model.fit(X_train,y_train,
                         epochs=5,
                         validation_data=(X_test,y_test),
                         callbacks=[mdc,tensorboard_callback,early_callback])
      
    • The visualization of our model architecture can be presented in the figure below:

    model

  • Model Evaluation

    • Through this section, classification report and confusion matrix is been used as a part of our model evaluation and analysis. Around 92% accuracy have been achieved.

          y_pred = np.argmax(model.predict(X_test),axis=1)
          y_actual = np.argmax(y_test,axis=1)
          cm = confusion_matrix(y_actual,y_pred)
          cr = classification_report(y_actual,y_pred)
      
          print(cm)
          print(cr)
      
          disp = ConfusionMatrixDisplay(confusion_matrix=cm)
          disp.plot(cmap=plt.cm.Blues)
          plt.show()
      
    • The classification report is shown below.

    cr

    • The confusion_matrix is shown below.

    cm

Results and Discussion 📝

  • Plotting the graph

    • Although there is a sign of overfitting towards the end of the training, the early stopping have prevent the overfiting to happen.

    epoch_acc epoch_loss

Credits 📂

This project is made possible by the data provided from this susanli2016

About

Natural Language Processing using Deep Learning approach to categorize unseen articles into 5 categories namely Sport, Tech, Business, Entertainment and Politics.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages