August 15, 2023

Computing suspect pairs and unique rows – Docs for ESB 6.x

Computing suspect pairs and unique rows

  1. Double-click the first tFileOutputDelimited
    component to display the Basic settings view and
    define the component properties.

    You have already accepted to propagate the schema to the output
    components when you defined the input component.
  2. Clear the Define a storage configuration component
    check box to use the local system as your target file system.
  3. Click the […] button next to Edit
    schema
    and use the [+] button in the
    dialog box to add the columns from the reference data set to the schema.

    You must add _ref at the end of the column names
    to be added to the suspect duplicates output. In this example:
    Original_id_ref,
    Source_ref,
    Site_name_ref and
    Address_ref.

    use_case_tmatchindexpredict6.png

  4. In the Folder field, set the path to the folder
    which will hold the output data.
  5. From the Action list, select the operation for
    writing data:

    • Select Create when you run the Job for the
      first time.

    • Select Overwrite to replace the file every
      time you run the Job.

  6. Set the row and field separators in the corresponding fields.
  7. Select the Merge results to single file check box,
    and in the Merge file path field set the path where
    to output the file of the suspect record pairs.
  8. Double-click the second tFileOutputDelimited
    component and define the component properties in the Basic
    settings
    view, as you do with the first component.

    This component creates the file which holds the unique rows generated
    from the input data.


  9. Press F6 to save and execute the
    Job.

tMatchIndexPredict groups together records from the input data and
the matching records from the reference data set indexed in Elasticsearch and labels
the suspect pairs.

use_case_tmatchindexpredict7.png
tMatchIndexPredict excludes unique records to write them in
another file.

use_case_tmatchindexpredict8.png

You can now clean and deduplicate the unique rows and use
tMatchIndex to add them to the reference data set stored in
Elasticsearch.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x