A novel method to compare the phonetic similarity between words based on phonetic features. This is the official repository for the paper https://arxiv.org/pdf/2109.14796.pdf
- Table of content
Download The CMU Pronouncing Dictionary in the data directory.
wget -P data http://svn.code.sf.net/p/cmusphinx/code/trunk/cmudict/cmudict-0.7b
Download SOTA model vocab from NLP for Hindi git repo.
wget -O data/hindi_lm_large.vocab https://drive.google.com/uc?export=download&id=1P6r8UBcegvVmr1kBDjqcYppmt_WgnbNt
Add missing words to cmu dictionary
cat data/cmudict-0.7b res/cmudict_missing_words >> data/cmudict-0.7b-with-vitz-nonce
Install all the dependencies.
pip install -r src/requirements.txt
Generate hindi dictionary from LM vocab
python src/preprocess/vocab2dict.py res/hindi_phones.csv data/hindi_lm_large.vocab data/dict_hindi
results_method.ipynb contains results for the algorithm. The result includes:
Comparision between unigram, bigram, bigram with penalty and bigram with penalty & vowel weight.
How we obtained the penalty of 2.5.
Comparision between Vitz and Winkler (1973), Parrish's Embeddings (2017), and our methods (with and without vowel weights).
^ The Parrish's Embeddings (PSSVec) results are generated from the author's provided git code using numpy.seed(0)
in generate.py
. We can not use author provided pretrained vectors because the dictionary used by them misses a word BELATION
used in the RELATION
dataset by Vitz and Winkler (1973).
The similarity vectors used by us for calculating PSSVec can be downloaded using
wget -O data/cmudict-0.7b-simvecs https://drive.google.com/uc?export=download&id=1gCvwI8ldxGM52vCoN70wUKmJfFMdapNl
Embedding scores can be re-generated using src/embedding.py by providing the learned embedding file and the output file.
python src/embedding.py data/cmudict-0.7b-simvecs res/PSSVec_results.csv
python src/embedding.py embedding_english/simvecs res/embedding_score.csv
^ These files are used to generate scores in the result section using results_method.ipynb.
TSNE Plot for some English words
TSNE Plot for some Hindi words
Pun Dataset (see docs/puns.md for more details)
Docker supported for development and training.
make build
make develop
This will give you a command prompt inside the docker. Current directory will be mounted at /workspace
.
The container will be destroyed on exit but all the files and changes done in the directly will persist.
You can also start it with GPU support:
make develop_gpu
make clean
Remember this will not delete the base image. To clean the base image run:
make clean_base
This project is licensed under the MIT License - see the LICENSE file for details
- Hat tip to anyone whose code was used