Enigma’s SMB data is integrated from a wide variety of disparate sources. Once ingested, we map the data to our SMB ontology, i.e. our data model for the small-business world. We then process the data through our unique data pipeline to produce the robust business records that you access via the Businesses API.
A crucial step in this process is what we call Entity Resolution. A given business in the real-world may have several different representations across the disparate data sources we compile. Entity resolution ensures that these separate “instances” of a business are combined into a single record containing the full set of attributes observed in the underlying data.
Entity Resolution is the name for the process of linking and managing the identity of records between datasets that come from different sources. When our algorithm decides that a set of records are the same, we create an identifier that refers to the underlying real-world entity. The process of managing this identity as we link or unlink records is Entity Resolution.
Entity Resolution for our SMB data means taking pairs of business records and comparing their known information to see if they are referring to the same entity. These record pairs of companies come from our various source data sets, each containing a different view into the characteristics and activity of a business.
For instance, one data set will tell you if the business is associated with a firearms license, another will say if it’s legally registered as a corporation, and another will indicate whether has been permanently closed, and so forth. By connecting pairs of business records that are referring to the entity, we are able to get a linked view of the known activity of a business, and construct a full business profile.
Suppose we observe information for a given business - let’s say, Enigma Technologies - across a dozen data sources. In each, the metadata identifying the business as Enigma Technologies may look somewhat different. In a simple example, the entity name and address might have slight variations:
245 5th Ave, New York, NY
Year Incorporated: 2011
Enigma Tech Inc.
245 5th Ave, 17th Fl, New York, NY 10016
Industry: Computer Software
Our entity resolution process runs machine learning models that determine whether these two records - the first from Data Source A and the second from Data Source B - do in fact represent the same business.
In this case, the models would resolve the two records into a single business profile representing Enigma Technologies. This ensures that the attributes coming from data sources A and B - Year Incorporated and Industry, respectively - are associated with a single, correct profile for Enigma.
As a result, all the attributes we compile and compute for Enigma can be pulled simply by calling the business in our API.
(Recall that we initially assumed that Enigma Technologies appears in a dozen data sources - without entity resolutions, we would have 12 distinct Enigma records! This would make our business data pretty unwieldy).
Business records are compared using a random forest model that considers factors such as (similar to our Match Models:
- What is the string distance between the full/cleaned company names?
- What are the semantic similarities between the company names?
- What are the string distances between the full addresses and address components?
- What are the shared tokens between company names and addresses?
The model will then predict whether the records represent the same business entity. We consider two business records to be the same if they have a functionally equivalent company name and USPS address.
As we link various records to one another via Entity Resolution, a unique Enigma ID is assigned to the resulting business profile. This ID is then persisted in our database and can be used to retrieve data about a given business using the ID Endpoint of the Businesses API.
Occasionally, business profiles can be merged to or split based on new information; in such cases, we persist the Enigma ID so that future references to old IDs will still be valid, and simply map to the current correct ID.
We think of the Enigma ID like a bookmark to a website. As the content of the website might change, and even the page that you land on when you click the bookmark might change. The intended reference will stay the same.
Updated about a year ago