Run a job to find low latency matches on records or on associated clusters of records.
The request body is a series of records to match, separated by newlines, which have the same attributes as the unified dataset for the project. The request will look the same whether you are matching records or clusters. For instance:
{"recordId":"8793219","record":{"NAME":["MANNY'S CAR WASH"],"CITY":["OAKLAND"],"ZIP":["94603"],"PHONE":["5556325115"],"STATE_CODE":["CA", "MA"]}}
{"recordId":"8800364","record":{"NAME":["BEST BEAUTY SALON"],"CITY":["SONORA"],"ZIP":["95370"],"PHONE":["5555324000"],"STATE_CODE":["CA"]}}
Requirements for Streaming Records
If you need to provide multiple records as an input, or stream records, use these tips:
- Swagger endpoints available within Tamr do not support a streaming response. To add multiple input records, use Curl for this endpoint.
- When making an LLM match request with Curl, use the
‘--data-binary’
instead of the‘-d’
option.
Response Fields
The response body looks different depending on whether the posted records were matched against records or clusters.
Output
Matching record information is returned as a response stream, so matches are returned as soon as the first batch of match records is processed. For records, the response is similar to the following example:
{"queryRecordId":"8793219","matchedRecordId":"7117244409972542111","matchedOriginSourceId":"source1.csv","matchedOriginRecordId":"rec-654-org","suggestedLabel":"MATCH","suggestedLabelConfidence":1.0,"attributeSimilarities":{"name_default_cosine":1.0,"city_default_cosine":1.0,"phone_default_cosine":1.0}}
{"queryRecordId":"8800364","matchedRecordId":"7117244409972542111","matchedOriginSourceId":"source1.csv","matchedOriginRecordId":"rec-6541-org","suggestedLabel":"NON_MATCH","suggestedLabelConfidence":1.0,"attributeSimilarities":{"name_default_cosine":1.0,"city_default_cosine":0.0,"phone_default_cosine":1.0}}
For clusters of records, the response looks similar to this example:
{"entityId": "8793219", "clusterId": "c3", "avgMatchProb": 0.73}
{"entityId": "8800364", "clusterId": "c2", "avgMatchProb": 0.89}
Record Parameters
Field | Description |
---|---|
queryRecordId | The ID of the record from the POST body. |
matchedRecordId | The Tamr ID of the record returned as a match. |
matchedOriginSourceId | The origin dataset of the record returned as a match. |
matchedOriginRecordId | The origin ID of the record returned as a match. |
suggestedLabel | MATCH or NON-MATCH for the record. |
suggestedLabelConfidence | The confidence level of the label. |
attributeSimilarities | A JSON of each individual attribute compared and the confidence level of each attribute. |
Cluster Parameters
Field | Description |
---|---|
entityId | The ID of the record from the POST body. |
clusterId | The ID of the cluster the record was compared against. |
avgMatchProb | The average of the matching probability for the record against each record in the cluster. |
API Properties
- Request Type: Synchronous. Match requests use the Mastering project's most recent model.
- Request Processing: Streaming
- Response Processing: Streaming
- Implementation Details: The following datasets are materialized:
- Features of unified source (tokens, parsed numbers)
- Binning data of the unified source
- Clustering of the unified source
Steps in the LLM Process
The matching operation performs these steps:
- Runs pre-processing for similarity functions.Tokenizes records by treating numbers as numbers and converting text records to tokens. Bins records.
- Generates record pairs, that is, generates pairs of (input-record, existing-record) that pass the binning model.
- Predicts match or no match using the current matching model:
- If using a 'record' match, rolls up pair match probabilities to get input record, or existing cluster associations.
- If using a 'cluster' match, for each input record, selects the existing cluster with the highest similarity.