July 30, 2023

tFuzzyUniqRow – Docs for ESB 7.x

tFuzzyUniqRow

Compares columns in the input flow by using a defined matching method and collects
the encountered duplicates.

tFuzzyUniqRow Standard properties

These properties are used to configure tFuzzyUniqRow running in the Standard Job framework.

The Standard
tFuzzyUniqRow component belongs to the Data Quality family.

This component is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

Basic settings

Schema and Edit schema

A schema is a row description, it defines the number of fields to be processed
and passed on to the next component. The schema is either Built-in or stored remotely in the Repository.

 

Built-in: You create and store the schema
locally for this component only. Related topic: see
Talend Studio User
Guide
.

 

Repository: You have already created the schema
and stored it in the Repository. Thus, you can reuse it in various projects and job
designs. Related topic: see
Talend Studio User
Guide
.

Column

List of all columns in the input flow.

Key attribute

Select the check boxes next to the columns you want to check.

Matching type

Select the relevant matching algorithm from the list:

Exact Match: matches each processed entry to all
possible reference entries with exactly the same value.

Levenshtein: Based on the edit distance theory. It
calculates the number of insertion, deletion or substitution required for an entry
to match the reference entry.

Metaphone: Based on a phonetic algorithm
for indexing entries by their pronunciation. It first loads the phonetics of
all entries of the lookup reference and checks all entries of the main flow
against the entries of the reference flow. It does not support Chinese characters.

Double Metaphone: a new version of the
Metaphone phonetic algorithm that produces more accurate results than the
original algorithm. It can return both a primary and a secondary code for a
string. This accounts for some ambiguous cases as well as for multiple variants
of surnames with common ancestry. It does not support Chinese characters.

Min. Distance

Only for Levenshtein. Set the minimum number of
changes allowed to match the reference. If set to 0, only perfect matches(Exact Match) are returned.

Max. Distance

Only for Levenshtein. Set the maximum number of
changes allowed to match the reference.

Advanced settings

tStat
Catcher
Statistics

Select this check box to collect log data at the component level.

Global Variables

Global Variables

NB_UNIQUES: the number of unique rows. This is an After
variable and it returns an integer.

NB_DUPLICATES: the number of duplicate rows. This is an
After variable and it returns an integer.

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

This component is not startable (green background) and it requires an input
component and two output components.

Comparing four columns using different matching methods and collecting
encountered duplicates

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

This scenario describes a four-component job aiming at collecting in two separate files
all unique entries and all duplicate entries from few defined processed columns based on the
Levenshtein and Double Metaphone matching types.

The input file in this example looks like the following:

Setting up the Job

  1. Drop tFileInputDelimited, tFuzzyUniqRow, and two tFileOutputExcel
    components from the Palette onto the design
    workspace.
  2. Connect tFileInputDelimited to tFuzzyUniqRow using the Main
    link, tFuzzyUniqRow to the tFileOutputExcel components using the Uniques link with one and the Duplicates
    link with the other.

    tFuzzyUniqRow_1.png

Configuring the components

  1. Double-click tFileInputDelimited to open its
    Basic settings view and define its properties.

    tFuzzyUniqRow_2.png

  2. Click the three-dot button next to the File Name
    field to browse to the file holding the input data.
  3. If needed, set Header, Footer, and Limit.

    For this scenario, set Header to 1. Footer and
    limit for the number of processed rows are not set.
  4. Click Edit schema to open a dialog box where you
    can describe the data structure of the source delimited file.

    tFuzzyUniqRow_3.png

    In this scenario, the source schema is made of the following columns: ID, Status, FirstName, Email,
    City, Initial,
    and ZipCode.
  5. Double click tFuzzyUniqRow to display its Basic settings view and define its properties.
  6. In the Key Attribute column, select the check boxes
    next to the columns you want to check using the defined matching method,
    Firstname, Email, City,
    and ZipCode in this example.

    tFuzzyUniqRow_4.png

  7. In the Matching Type column, set the matching
    methods you want to use on each of the selected columns.

    In this example, Leveshtein is to be used as the matching
    method for the FirstName, Email, and
    ZipCode columns, Double Metaphone is to be
    used as the matching method for the City column.
    Then set the minimum and maximum distances for the Levenshtein
    method. In this method, the distance is the number of character changes (insertion,
    deletion or substitution) that needs to be carried out in order for the entry to fully
    match the reference. In this example, we want the min. distance to be 0 and the max.
    distance to be 2. This will output all entries in the FirstName,
    Email, and ZipCode columns that exactly
    match or that have maximum two character changes. There is no minimum nor maximum
    distance to set for Double Metaphone because this matching method
    is based on phonetic discrepancies in the input data.
  8. Double click the first tFileOutputExcel to display
    its Basic settings view and define its
    properties.

    tFuzzyUniqRow_5.png

  9. Set the destination file name as well as the Sheet name and select the Include header check box.
  10. Do the same for the second tFileOutputExcel.

Executing the Job

Save your Job and click F6 to execute it.

tFuzzyUniqRow uses the Levenshtein
method to compare each of the three defined columns separately, it uses the Double
Metaphone
method to compare data in the City column, and
finally passes the unique and duplicate rows to the defined output files. In our example,
the first two rows match, hence the second row will go in the “duplicates” output.

tFuzzyUniqRow_6.png

The generated FID column gives a reference identifier
of the original record which the current record refers to.

The third row is unique and will go in the “uniques” output.

tFuzzyUniqRow_7.png

The generated UID column is an identifier generated for
the main record.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x