Scenario: Creating a clean data set from the suspect pairs labeled by tMatchPredict and the
unique rows computed by tMatchPairing
This scenario applies only to a subscription-based Talend Platform solution with Big data or Talend Data Fabric.
-
The suspect records labeled as duplicates and grouped by
tMatchPredict.You can find an example of how to label suspect
pairs with assigned labels on Talend Help Center (https://help.talend.com). -
The unique rows computed by tMatchPairing.
You can find examples of how to compute unique rows
from source data on Talend Help Center (https://help.talend.com).
-
In the first subjob, tRuleSurvivorship processes the
records labeled as duplicates and grouped by
tMatchPredict, to create one single
representation of each duplicates group. -
In the second subjob, tUnite merges the survivors and
the unique rows to create a clean and deduplicated data set to be used with
the tMatchIndex component.
The output file contains clean and deduplicated data. You can index this reference data
set in ElasticSearch using the tMatchIndex component.