CLTK models for Middle English

How the word2vec model was trained

We first need to install some mathematics and machine learning packages.

pip install scipy sklearn gensim

We import a corpus of Middle English texts

import pickle
with open('./middle_english_txt-.p','rb') as p:
    corpus = pickle.load(p)

The normalization process involves removal of punctuation, special characters and numbers.

CHARACTERS_TO_REMOVE = "..."

def remove_punc(text):
    for ele in text:
        if ele in CHARACTERS_TO_REMOVE:
            return text.lower().replace(ele, "")
        else:
            return text.lower()
            
new_data = []

for texts in corpus:
    part_text = []
    for word in texts:
        part_text.append(remove_punc(word))
    new_data.append(part_text)

The data was in a form of a list of lists of strings or a list of sentences, where a sentence is a list of words.

Then we use Word2Vec class from gensim

# time to try gensim to create word2vecs
# see NLP in action, 6.2.4
from gensim.models.word2vec import Word2Vec

num_features = 50
min_word_count = 30
num_workers = 2
window_size = 20
subsampling = 1e-3

model = Word2Vec(
    new_data,
    workers=num_workers,
    vector_size=num_features,
    min_count=min_word_count,
    window=window_size,
    sample=subsampling)

model.init_sims(replace=True)
me_w2v_model = "me_word_embeddings_model.bin"
model.save(me_w2v_model)

The model is now saved in the file me_word_embeddings_model.bin.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CLTK models for Middle English

How the word2vec model was trained

Files

README.md

Latest commit

History

README.md

File metadata and controls

CLTK models for Middle English

How the word2vec model was trained