A crucial step in any SMB data integration is matching - the notion of identifying and surfacing the right business from Enigma’s records. When you hit the Businesses API Match endpoint, our system runs match models in the background to find the SMB record that most closely matches your API query. This page provides more detail on why the match models are so important and how they work.
Essentially, matching means that you don’t need to have perfect information about a business in order to find the correct SMB record in Enigma’s data. All you have to do is provide a combination of the 3 required inputs - business name, address, and associated person - and our match models will identify the best result. The models are highly optimized around the relevancy of the data returned. This means that when you see a match, you can be very confident that it’s the right business. It also means that when there are no relevant results, no match will be returned.
When we return a match, the business details are accompanied by a
match_confidence score that indicates, on a scale of
1, how closely the business record matched your input. In the case of a match, the score helps you interpret the result - whether you can be fully confident that you’ve identified the right business, or whether some additional investigation may be required. A lower score can also help you identify incorrect data in your inputs (for example, typos or wrong business addresses).
By default, the Businesses API considers a match to be a result with a score greater than
When you make a request to the Match endpoint, we first perform a simple search against our database using the inputs provided. This initial search is conducted based on how closely your inputs matched business identifiers in our data, as well as how rare those identifiers are in the set of records. This generates a list of records that can now be filtered further for matching.
We now refine the initial result set by going through each of the top outputs and using a machine-learning model to ask whether the surfaced business truly matches the one specified in your input. More specifically, this list of record pairs - your search input and the Enigma SMB record - go through a model that compares each pair using questions such as:
- What is the string distance between the full/cleaned company names?
- What are the semantic similarities between the company names?
- What are the string distances between the full addresses and address components?
- What are the shared tokens between company names and addresses?
Next, the answers to these questions are turned into features that are fed into a neural network model. (The model is trained and tuned using vast quantities of data that has been labeled to indicate true vs. false matches). The model then returns its best prediction for the true matching entity - this is the business you obtain in response to your Match endpoint request.
Note that this entire process completes in under a second!
As discussed above, by default a match is considered to be a result whose
match_confidence score is at least
0.5. How did we come up with this threshold? We did a lot of QA work to determine the effect of different match thresholds on prediction accuracy, precision, and recall. After extensive research, we determined that a threshold of
0.5 results in the greatest accuracy while also maintaining the optimal precision-recall tradeoff.
For those less familiar with predictive analytics, here’s a breakdown of these terms:
The share of observations classified correctly (as matches vs. non-matches)
The share of predicted matches that are in fact true matches
The share of true matches that are classified as such
Recall that the Match Endpoint of the Businesses API is configurable. While
0.5 presents the optimal tradeoff, the threshold can be manipulated on a call-by-call basis. Additionally, you can choose whether you want to show
non-matches and/or return multiple results.
For more information on this, refer to the Match Endpoint guide.
Updated almost 2 years ago