Efficiently Distributed Representation of Words and Phrases using Negative Sampling for Regional Languages
This paper introduces the concept of multilingual word semantic similarity which helps in measuring semantic similarity of word pairs within languages: English, Hindi, Kannada, German, Italian. The model was trained efficiently with high quality datasets. The total dataset size used for all the five languages is about 90GB. This paper proposes a computationally efficient technique of measuring semantic similarity of word pairs by building a neural network model. This paper also introduces the idea of negative sampling in order to improve the accuracy of the model. We also propose a technique to detect phrases in order to improve our models accuracy. The results obtained show that combining statistical knowledge from text corpus (word embeddings) give very high accuracy.
Keywords - Word Embeddings, Negative Sampling, Phrase Detection.