August 15, 2023

Scenario: Creating a clean data set from the suspect pairs labeled by tMatchPredict and the unique rows computed by tMatchPairing – Docs for ESB 6.x

Scenario: Creating a clean data set from the suspect pairs labeled by tMatchPredict and the
unique rows computed by tMatchPairing

This scenario applies only to a subscription-based Talend Platform solution with Big data or Talend Data Fabric.

In this example, there are two sources of input data:

  • The suspect records labeled as duplicates and grouped by
    tMatchPredict.

    You can find an example of how to label suspect
    pairs with assigned labels on Talend Help Center (https://help.talend.com).

  • The unique rows computed by tMatchPairing.

    You can find examples of how to compute unique rows
    from source data on Talend Help Center (https://help.talend.com).

The use case described here uses two subjobs:

  • In the first subjob, tRuleSurvivorship processes the
    records labeled as duplicates and grouped by
    tMatchPredict, to create one single
    representation of each duplicates group.

  • In the second subjob, tUnite merges the survivors and
    the unique rows to create a clean and deduplicated data set to be used with
    the tMatchIndex component.

The output file contains clean and deduplicated data. You can index this reference data
set in ElasticSearch using the tMatchIndex component.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x