Scenario 1: Computing suspect pairs and writing a sample in Talend Data Stewardship
This scenario applies only to a subscription-based Talend Platform solution with Big data or Talend Data Fabric.
Finding duplicate records is hard and time consuming especially when you are dealing with
huge volume of data. In this example, tMatchPairing uses a
blocking key to compute the pairs of suspect duplicates in a long list of early
childhood education centers in Chicago coming from ten different sources.
It also computes a sample of the suspect duplicates and writes it in the form of tasks
into a Grouping campaign on the Talend Data Stewardship server.
Authorized data stewards can then intervene on the data sample and decide if suspect
pairs are duplicates.
You can then use the labeled sample to compute a matching model and apply it on all
suspect duplicates in the context of machine learning on Spark.
-
You have been assigned in Talend Administration Center the Campaign Owner role which grants you
access to the campaigns on the server. - You have created the Grouping campaign in Talend Data Stewardship and defined
the schema which corresponds to the structure of the education centers file.For further information, see the online publication about
Grouping campaigns on Talend Help Center (https://help.talend.com).