This lab's goal is to familiarize you with natural language processing by building a sentiment analysis model using pytorch.
Pre-requisites (you should probably follow 42-AI's instructions here)
- Python 3.5+
- Pytorch 0.2+
- Download and unzip a pre-trained embedding file, preferably the wikipedia EN fasttext embeddings wiki-news-300d-1M.vec.zip. Note: this file will be made available on a server internal to 42.
The dataset is an extract from Stanford Sentiment Treebank. It consists in portions of English sentences about a movie along with a sentiment label. The sentiment label is 0 for sentences that emit a negative opinion about the movie, and 1 for positive reviews. E.g., a gorgeous , witty , seductive movie will be labeled 1 and forced , familiar and thoroughly condescending is labeled 0.
- Fork this repo and add your fork's URL to this document
- Put the embedding file described in prerequisites into a folder called
data
inside your repo root. - Create a python script that loads embeddings using
embedding.py
and runs two tasks:- Nearest-neighbor search: print the 10 nearest neighbors of the word 'geek'
- Analogy: retrieve embeddings closest to a combination of embeddings that corresponds to an analogy, e.g.
'Tokyo' + 'Spain' - 'Japan' = ?
. Caveats: you will need to remove the embeddings corresponding to any of the query words (in this case,Tokyo
,Spain
andJapan
) explicitly from the tokens you retrieve. - Find the best analogy you can and add it to the analogy tab of the google sheets. Be creative!
- Run the baseline model using
train.py
. After the data is loaded, training will start and the loss and accuracy on train and dev sets will be reported. Remember: the goal is to have the highest test accuracy possible. At each epoch, the script will write amodel.pth
file in your repo that you will need to push along the rest of your code for your work to be evaluated. - Try to improve the model. Several options appear here:
- The baseline uses only the first word of the sentence to perform classification. Edit
model.py
so that it performs the averaging of all embeddings in the sentence. - (requires 1. above) Use an importance weighting scheme for each word in the sentence, e.g. scale each embedding in the inverse proportion of the frequency of the token in the documents. This will reduce the importance of very common words such as "the" or "and". After that, you could go for more advanced weighting schemes, e.g. TF-IDF
- Change the optimizer parameters: you could use different learning rates or different optimizers.
- Change the model altogether: you could go for a recurrent neural network such as an LSTM
- The baseline uses only the first word of the sentence to perform classification. Edit
Good luck!