Paper Title
Performance Comparison of Similarity Functions For Document Retrieval System
Abstract
Nowadays, measuring the similarity of documents plays an important role in text related researches and
applications such as document clustering, plagiarism detection, information retrieval, machine translation and automatic
essay scoring. Many researches have been proposed to solve this problem. They can be grouped into three main approaches:
String-based, Corpus-based and Knowledge-based Similarities. String based approach is further categorized as the characterbased
approach and the term-based approach. Some of the existing similarity measures can’t properly decide the document
pair similarity in some circumstance. So, this paper proposes a new similarity approach (called KSD: Keyword Similarity
Distance) based on term-based similarity function to properly decide the similarity score in each document pair. The
KSDfunction takes keyword similarity distance between each pair of documents and then computes average similarity scores
for all documents. In the paper, the proposed function gives the correct related document list than the existing similarity
functions. Three similarity functions such as cosine, overlap and proposed similarity are appliedfor evaluating the
performance of similarity scores. The keyword extraction process and the similarity calculation are done in C#. According
to the experimental results, the proposed function will outperform than other similarity function.
Keywords— Similarity function, KSD, Cosine, Overlap.