Instructor: Dr. A. Nikabadi
Course content: CS276 Standford University
Semester: Fall 2022
This project is for Information Retrieval course which aims to implement a search engine for both phrase queries and Free text queries on Fars News Dataset.
-
Preprocessing on data (Noramlization, Tokenization, Stemming, Removing Stopwords)
-
Working with both most used NLP persian toolkits : hazm, parsivar
-
Created a positional inverted index
-
Used Zipf's law
- Used Heaps law
-
Searching by Normal quries, Phrase Queries (used permuterm index), Boolean queries
-
Ranking results
-
Show words in vector representation
-
Compute tf-idf
-
Compute cosine similarity between query terms and documents
-
Used Index elimination techniques such as creating champion list
-
Rank results based on most relevent results
Contributors : Rojina kashefi & Leili Barekatein