Scenario 2: Computing suspect pairs and suspect sample from source
data
This scenario applies only to a subscription-based Talend Platform solution with Big data or Talend Data Fabric.
In this example, tMatchPairing uses a blocking key to compute the
pairs of suspect duplicates in a list of early childhood education centers in
Chicago.
The use case described here uses:
-
a tFileInputDelimited component to read the source file,
which contains a list of early childhood education centers in Chicago coming
from ten different sources; -
a tMatchPairing component to pre-analyze the data, compute
pairs of suspect duplicates and generate a pairing model which is used by the
tMatchPredict component; -
three tFileOutputDelimited
components to output the suspect duplicates, a sample of suspect pairs and the
unique records; and -
a tLogRow component to
output the exact duplicates.