Paper Title
Entity Conflation In Large Structured Datasets & Post Conflation Database Reconciliation

Often large real world databases encounter scenarios where a single entity (a person, a place, a country etc.) is stored as two or more separate entities. This results in duplication and redundancy which can be the root cause of irrelevant or undesired information when we process these datasets to churn out meaningful results. For instance, a database which stores all the country names can have ‘South Africa’ and ‘Republic of South Africa’ as two separate entities. This paper proposes an approach to map such entities, purge the duplicate ones and reconcile the database to make sure all foreign key references to the purged entities are updated to point to the entities that are being persisted. Our experiments on large real world databases with more than a million entries yielded results with high coverage and precision. Keywords- Conflation, Confidence Score, Fuzzy String Match, Dice’s Coefficient, Reconciliation, Purging