Motivating Entity Resolution for Data Science

939

Why Entity Resolution?

Data is the new oil. Thus, analytical models are the new combustion engines. A combustion engine functions efficiently with good fuel. Similarly, for a model to output sensible results, quality data is imperative. Hence, the data needs to go through the refinements. This has led to a revolution in academia, called Data Centric AI, led by the iconic Prof Andrew Ng.

Nevertheless, Enterprise Data Science is significantly driven by relational databases. But, as organizations grow, Information systems get siloed, leading to duplicates. Besides, quality issues, schema variations and disparate data collection traditions add to the ambiguity.

In a DBMS, an entity is a real-world object like a customer. Overtime, this customer entity can have multiple versions. Within the same database, the entity may have multiple records with different hospital types, addresses, etc.

Further, across databases, the entity may vary in structure and semantics. So, how do we reconcile these variations? The answer is Entity Resolution.

What is Entity Resolution?

Entity Resolution is disambiguation of records that correspond to entities in the real world, within and across databases.  The three primary tasks involved in entity resolution are deduplication, record linkage, and canonicalizations:

  1. Deduplication: To identify duplicate data within the same source.
  2. Record linkage: To identify records that reference the same entity across sources.
  3. Canonicalization: To convert data with multiple representation into a standard form.

Having said that,  let’s take an example of Customer Records in databases D1 and D2.

Database D1:

Let’s take an example of customer address records:

Customer Name Address1 City State Zip
Aarogya Health 3rd cross, MG Road Bengaluru Karnataka 560093
aarogya Health 3rd Cross, MG Rd Bangalore KN 560093

We can make out that they refer to the same record. However, there are minor variations in all the columns(except zip).

Database D2:

Let’s take an example of customer Contact records:

Customer Name Website Email Contact
Aarogya Health aarogyahealth.com [email protected] Dr ABC
aarogya Health aarogyahealth.com [email protected] Dr AB

The two records point to the same Customer.

In these examples, identifying and marking the similarity between records in the same database, either D1 or D2,is deduplication. Furthermore, identifying and marking the similarity between records across database D1 and D2 is Record Linkage.

Lastly, we can see that one contact name is in the lowercase, while the other is in Camel Case. Moreover, the Address1 varies in case and short forms. Bringing  all the records to one standard form (e.g.lower case etc.) is called Canonicalization.

How to perform Entity Resolution?

With such minor variations in data, it is difficult to find duplicates within or across database(s). Moreover, this problem aggravates as the scale of data grows. Hence, rule based engines are infeasible to build.

Fortunately, with Machine Learning, probabilistic entity matching is a possibility. Having said that, we strongly recommend you to read our article using Machine Learning to De-Duplicate DataThis is a hands-on tutorial for Deduplication using Active Machine Learning, using pandas-dedupe library. Additionally, to read more about Record Linkage using the same library, refer to this link. Notably, this implementation does not scale well.

Conclusion

Finally, this is not a comprehensive guide to Entity resolution, since it is a big subject. Hence, we will expand upon this topic in the future. Also, please note that this is only for information. We do not claim any guarantees regarding its accuracy or completeness. Note that any names that occur here are purely imaginary. Any resemblance of names and places is purely co-incidental.

 



I am a Data Scientist with 6+ years of experience.


Leave a Reply