Entity Resolution
Enigma brings all of these data sources together into a single profile of a business through a process Enigma calls entity resolution. Enigma's entity resolution involves state-of-the-art machine learning models that are regularly validated with human labeling.
Entity Resolution is a critical part of how Enigma processes SMB data. This section provides insights into why Entity Resolution is so important and the mechanics of how it works.
What is Entity Resolution?
Enigma’s SMB data is integrated from a wide variety of disparate sources. Once ingested, the data is mapped to Enigma’s SMB ontology, i.e. the data model for the small-business world. The data is processed through Enigma’s data pipeline to produce each business profile.
A crucial step in this process is what Enigma refers to as Entity Resolution. A given business in the real world may have several different representations across disparate data sources. Entity resolution ensures that these separate “instances” of a business are combined into a single record containing the full set of attributes observed in the underlying data.
Deeper Dive
Entity Resolution is the name for the process of linking and managing the identity of records between datasets that come from different sources.
Entity Resolution for business data specifically refers to taking pairs of business records and comparing their known information to see if they are referring to the same business entity. These record pairs of companies come from various data sources, each containing a different view into the characteristics and activity of a business.
For instance, one dataset will provide information about whether a business is associated with an SBA Loan, another will say if it’s legally registered as a corporation, and another will indicate whether revenues from credit cards, and so forth. By connecting pairs of business records that are referring to the entity, Enigma is able to get a linked view of the known activity of a business, and construct a full business profile.
For Example
Suppose Enigma observes information for a given business - let’s say, Enigma Technologies - across a dozen data sources. In each source, the metadata identifying the business as Enigma Technologies may look somewhat different. In a simple example, the entity name and address might have slight variations:
Data Source A
Enigma Technologies
245 5th Ave, New York, NY
Year Incorporated: 2011
Data Source B
Enigma Tech Inc.
245 5th Ave, 17th Fl, New York, NY 10016
Industry: Computer Software
Enigma’s entity resolution process runs machine learning models that determine whether these two records - the first from Data Source A and the second from Data Source B - do in fact represent the same business.
In this case, the models would resolve the two records into a single business profile representing Enigma Technologies. This ensures that the attributes coming from data sources A and B - Year Incorporated and Industry, respectively - are associated with a single, correct profile for Enigma.
How the Model Works
Business records are compared using a random forest model that considers factors such as
- What is the string distance between the full/cleaned company names?
- What are the semantic similarities between the company names?
- What are the string distances between the full addresses and address components?
- What are the shared tokens between company names and addresses?
The model will then predict whether the records represent the same business entity. The model is set to have a 97%+ precision in the resolution process. Enigma validates this accuracy on an ongoing basis using human labeling.
Updated about 1 year ago