A simple implementation of ranking for search based systems using semantic similarity.
https://ciir.cs.umass.edu/downloads/WebAP/
Note: A more detailed writeup will be added soon
-
Acquired the dataset through Slack.
-
Pre-processed the dataset
- Removed Stop-words
- Lemmatizated the corpus and saved for future reference
- Creation of Inverted-Index (demonstation purposes)
-
Converted corpus to vectors using Word2Vec
-
Tested the semantic similarity on random query words using the model,
Most similar word examples to the query
modelW2V.wv.similarity('cancer', 'tumor') #0.8035345
modelW2V.wv.similarity('cancer','ovarian') #0.860453
Least similar word examples to the query
modelW2V.wv.similarity('cancer', 'cloud') #0.8035345
-
Converted corpus to vectors using Doc2Vec
-
Found most similary documents given a query
new_sentence = "i love dogs".split(" ") # *query = {i,love,dogs}* model.docvecs.most_similar(positive=[model.infer_vector(new_sentence)],topn=5) # *selecting the top n documents* #Result #[('5235', 0.7422172427177429), #('4870', 0.7328481674194336), #('95', 0.7185875773429871), #('5868', 0.7118589878082275), #('1954', 0.6987151503562927)] # *Format = {'DocID','Accuracy of the document with the query'}*
Cheers!