Warning
This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend Platform products.
Component family |
Data Quality |
|
Function |
tFuzzyUniqRow compares columns in the input |
|
Purpose |
Helps ensuring the data quality of any source data. |
|
Basic settings |
Schema and Edit schema |
A schema is a row description, it defines the number of fields to be processed Since version 5.6, both the Built-In mode and the Repository mode are |
|
|
Built-in: You create and store the schema |
|
|
Repository: You have already created the schema |
|
Column |
List of all columns in the input flow. |
|
Key attribute |
Select the check boxes next to the columns you want to check. |
|
Matching type |
Select the relevant matching algorithm from the list:
Exact Match: matches each processed entry to all
Levenshtein: Based on the edit distance theory. It
Metaphone: Based on a phonetic algorithm for
Double Metaphone: a new version of the Metaphone |
|
Min. Distance |
Only for Levenshtein. Set the minimum number of |
|
Max. Distance |
Only for Levenshtein. Set the maximum number of |
Advanced settings |
tStatCatcher |
Select this check box to collect log data at the component level. |
Global Variables |
NB_UNIQUES: the number of unique rows. This is an After NB_DUPLICATES: the number of duplicate rows. This is an ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see Talend Studio |
|
Usage |
This component is not startable (green background) and it requires an input |
|
Limitation/prerequisite |
n/a |
This scenario describes a four-component job aiming at collecting in two separate files
all unique entries and all duplicate entries from few defined processed columns based on the
Levenshtein and Double Metaphone matching types.
The input file in this example looks like the following:
1 2 3 4 |
ID;Status;FirstName;Email;City;Initial;ZipCode 1;married;Paul;pnewman@comp.com;New York;P.N.;55677 2;single;Raul;rnewman@comp.com;New Ork;R.N.;55677 3;single;Mary;mnewman@comp.com;Chicago;M.N;66898 |
-
Drop tFileInputDelimited, tFuzzyUniqRow, and two tFileOutputExcel
components from the Palette onto the design
workspace. -
Connect tFileInputDelimited to tFuzzyUniqRow using the Main
link, tFuzzyUniqRow to the tFileOutputExcel components using the Uniques link with one and the Duplicates
link with the other.
-
Double-click tFileInputDelimited to open its
Basic settings view and define its properties. -
Click the three-dot button next to the File Name
field to browse to the file holding the input data. -
If needed, set Header, Footer, and Limit.
For this scenario, set Header to 1. Footer and
limit for the number of processed rows are not set. -
Click Edit schema to open a dialog box where you
can describe the data structure of the source delimited file.In this scenario, the source schema is made of the following columns: ID, Status, FirstName, Email,
City, Initial,
and ZipCode. -
Double click tFuzzyUniqRow to display its Basic settings view and define its properties.
-
In the Key Attribute column, select the check boxes
next to the columns you want to check using the defined matching method,
Firstname, Email, City,
and ZipCode in this example. -
In the Matching Type column, set the matching
methods you want to use on each of the selected columns.In this example, Leveshtein is to be used as the matching
method for the FirstName, Email, and
ZipCode columns, Double Metaphone is to be
used as the matching method for the City column.Then set the minimum and maximum distances for the Levenshtein
method. In this method, the distance is the number of character changes (insertion,
deletion or substitution) that needs to be carried out in order for the entry to fully
match the reference. In this example, we want the min. distance to be 0 and the max.
distance to be 2. This will output all entries in the FirstName,
Email, and ZipCode columns that exactly
match or that have maximum two character changes. There is no minimum nor maximum
distance to set for Double Metaphone because this matching method
is based on phonetic discrepancies in the input data. -
Double click the first tFileOutputExcel to display
its Basic settings view and define its
properties. -
Set the destination file name as well as the Sheet name and select the Include header check box.
-
Do the same for the second tFileOutputExcel.
-
Save your Job and click F6 to execute it.
tFuzzyUniqRow uses the Levenshtein
method to compare each of the three defined columns separately, it uses the Double
Metaphone method to compare data in the City column, and
finally passes the unique and duplicate rows to the defined output files. In our example,
the first two rows match, hence the second row will go in the “duplicates” output.
The generated FID column gives a reference identifier
of the original record which the current record refers to.
The third row is unique and will go in the “uniques” output.
The generated UID column is an identifier generated for
the main record.