Paper Title
Entity Conflation In Large Structured Datasets & Post Conflation Database Reconciliation
Abstract
Often large real world databases encounter scenarios where a single entity (a person, a place, a country etc.) is
stored as two or more separate entities. This results in duplication and redundancy which can be the root cause of irrelevant
or undesired information when we process these datasets to churn out meaningful results. For instance, a database which
stores all the country names can have ‘South Africa’ and ‘Republic of South Africa’ as two separate entities. This paper
proposes an approach to map such entities, purge the duplicate ones and reconcile the database to make sure all foreign key
references to the purged entities are updated to point to the entities that are being persisted. Our experiments on large real
world databases with more than a million entries yielded results with high coverage and precision.
Keywords- Conflation, Confidence Score, Fuzzy String Match, Dice’s Coefficient, Reconciliation, Purging