August 17, 2023

tGenKey – Docs for ESB 5.x

tGenKey

tWindowKey_icon32_white.png

Warning

This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend Platform products.

tGenKey properties

Component family

Data Quality

 

Function

tGenKey
enables you to apply several kinds of
algorithms on each input column and use the
computed results to generate a functional key.
These algorithms can be key or optional
algorithms.

The values returned by the key algorithms
will be concatenated, according to the column
order in the Key
composition
table.

Purpose

tGenKey
generates a functional key from the
input columns, by applying different types of
algorithms on each column and grouping the
computed results in one key. It outputs this key
together with the input columns.

This component helps, using the generated
functional key, to narrow down your data
filter/matching results for example.

Basic settings

Schema and
Edit schema

A schema is a row description, it defines
the number of fields to be processed and passed on
to the next component. The schema is either
Built-in or
stored remotely in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.

Click Sync
columns
to retrieve the schema from the
previous component in the Job.

 

 

Built-in:
You create and store the schema locally for this
component only. Related topic: see
Talend Studio User Guide.

 

 

Repository:
You have already created and stored the schema in
the Repository. You can reuse it in other projects
and job designs. Related topic: see
Talend Studio User Guide.

 

match_rule_import_icon.png

Click the import icon to import blocking
keys from the match rules that are defined and
saved in the Studio repository.

When you click the import icon, a [Match Rule Selector]
wizard is opened to help you import blocking keys
from the match rules listed in the Studio
repository and use them in your Job.

You can import blocking keys only from match
rules that are defined with the VSR algorithm and
saved in the Studio repository. For further
information, see Importing match rules from the studio repository

Algorithm

Column

Select the column(s) from the main flow on
which you want to define certain algorithms to set
the functional key.

Note

When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.

For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
yyyy” in the Date
Pattern
field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison.

 

Pre-Algorithm

If required, select the relevant matching
algorithm from the list:

remove diacritical
marks:
removes any diacritical
mark.

remove diacritical marks and
lower case
: removes any diacritical
mark and converts to lower case before generating
the code of the column.

remove diacritical marks and
upper case:
removes any diacritical
mark and converts to upper case before generating
the code of the column.

lower case:
converts the field to lower case before applying
the key algorithm.

upper case:
converts the field to upper case before applying
the key algorithm.

add left position
character
: enables you to add a
character to the left of the column.

add right position
character
: enables you to add a
character to the right of the column.

 

Value

Set the algorithm value, where
applicable.

 

Algorithm

Select the relevant algorithm from the
list:

first character of each
word
: includes in the functional key
the first character of each word in the
column.

N first characters of each
word
: includes in the functional key N
first characters of each word in the column.

first N characters of the
string
: includes in the functional key
N first characters of the string.

last N characters of the
string
: includes in the functional key
N last characters of the string.

first N consonants of the
string
: includes in the functional key
N first consonants of the string.

first N vowels of the
string:
includes in the functional key
N first vowels of the string.

pick characters:
includes in the functional key the characters
located at a fixed position (corresponding to the
set digital/range).

exact: includes
in the functional key the full string.

substring(a,b):
includes in the functional key character according
to the set index.

soundex code:
generates a code according to a standard English
phonetic algorithm. This code represents the
character string that will be included in the
functional key.

metaphone code:
generates a code according to the character
pronunciation. This code represents the character
string that will be included in the functional
key.

double-metaphone
code
: generates a code according to the
character pronunciation using a new version of the
Metaphone phonetic algorithm, that produces more
accurate results than the original algorithm. This
code represents the character string that will be
included in the functional key.

fingerPrintkey: generates the
functional key from a string value through the
following sequential process:

  1. remove leading and trailing
    whitespace,

  2. change all characters to their lowercase
    representation,

  3. remove all punctuation and control
    characters,

  4. split the string into whitespace-separated
    tokens,

  5. sort the tokens and remove
    duplicates,

  6. join the tokens back together,

    Because the string parts are sorted, the
    given order of tokens does not matter. So,
    Cruise, Tom and Tom
    Cruise
    both end up with a fingerprint
    cruise tom and therefore end
    up in the same cluster.

  7. normalize extended western characters to
    their ASCII representation, for example
    gödel to
    godel.

    This reproduce data entry mistakes performed
    when entering extended characters with an
    ASCII-only keyboard. However, this procedure can
    also lead to false positives, for example
    gödel and
    godél would both end up with
    godel as their fingerprint
    but they are likely to be different names. So this
    might work less effectively for datasets where
    extended characters play substantial
    differentiation role.

nGramkey:
this algorithm is similar to the fingerPrintkey method
described above. But instead of using whitespace
separated tokens, it uses n-grams, where the
n can be specified by the
user. This method generates the functional key
from a string value through the following
sequential process:

  1. change all characters to their lowercase
    representation,

  2. remove all punctuation and control
    characters,

  3. obtain all the string n-grams,

  4. sort the n-grams and remove
    duplicates,

  5. join the sorted n-grams back
    together,

  6. normalize extended western characters to
    their ASCII representation, for example
    gödel to
    godel.

    For example, the 2-gram fingerprint of
    Paris is
    arispari and the 1-gram
    fingerprint is aiprs.

    The delivered implementation of this
    algorithm is 2-grams.

Note

If the column on which you want to use the
nGramkey
algorithm can have data, with only
0 or 1
characters, you must filter this data before
generating the functional key. This way you avoid
potentially comparing records to each other that
should not be match candidates.

colognPhonetic: a soundex phonetic
algorithm optimized for the German language. It
encodes a string into a Cologne phonetic value.
This code represents the character string that
will be included in the functional key.

 

Value

Set the algorithm value, where
applicable.

 

Post-Algorithm

If required, select the relevant matching
algorithm from the list:

use default value
(string): enables you to choose a string to
replace null or empty data.

add left position
character
: enables you to add a
character to the left of the column.

add right position
character
: enables you to add a
character to the right of the column.

 

Value

Set the option value, where
applicable.

 

Show help

Select this check box to display
instructions on how to set algorithms/options
parameters.

Advanced settings

tStat
Catcher
Statistics

Select this check box to collect log data at
the component level. Note that this check box is
not available in the Map/Reduce version of the
component.

Global
Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component is an intermediary step. It
requires an input flow as well as an output. This
component can be used with other components, such
as tMatchGroup,
in order to create a blocking key.

Usage in Map/Reduce
Jobs

If you have subscribed to one of the Talend solutions with Big Data, you can also
use this component as a Map/Reduce component. In a Talend Map/Reduce Job, this
component is used as an intermediate step and other components used along with it must be
Map/Reduce components, too. They generate native Map/Reduce code that can be executed
directly in Hadoop.

For further information about a Talend Map/Reduce Job, see the sections
describing how to create, convert and configure a Talend Map/Reduce Job of the
Talend Big Data Getting Started Guide.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents
only Standard Jobs, that is to say traditional Talend data
integration Jobs, and non Map/Reduce Jobs.

Limitation/prerequisite

n/a

Scenario 1: Generating functional keys in the output flow

This three-component scenario describes a basic Job that generates a
functional key for each of the data records using one algorithm on one of
the input columns, PostalCode.

This functional key can be used in different ways to narrow down the results
of data filtering or data matching, for example. So the tGenKey component can be used with so many other data
quality and data integration components to form different useful use cases.
For an example of one use case of tGenKey,
see Scenario 2: Comparing columns and grouping in the output flow duplicate
records that have the same functional key
.

In this scenario, the input data flow has four columns:
Firstname, Lastname,
DOB (date of birth), and
PostalCode. This data has problems such as
duplication, first or last names spelled differently or wrongly, different
information for the same customer, etc. This scenario generates a functional
key for each data record using an algorithm that concatenates the first two
characters of the postal code.

Use_Case_tWindowKey.png

Setting up the Job

  1. Drop the following components from the Palette onto the design
    workspace: tFixedFlowInput, tGenKey and tLogRow.

    Use_Case_tWindowKey1.png
  2. Connect all the components together using the
    Main link.

Configuring the data input

  1. Double-click tFixedFlowInput to display the
    Basic settings
    view and define the component properties.

    Use_Case_tWindowKey2.png
  2. Click the […]
    button next to Edit
    Schema
    to open the [Schema] dialog box.

    Use_Case_tWindowKey3.png
  3. Click the plus button to add as many lines as needed
    for the input schema you want to create from
    internal variables.

    In this example, the schema is made of four columns:
    Firstnam,
    Lastname,
    DOB and
    PostalCode.

    Then, click OK to
    close the dialog box.

  4. In the Mode area,
    select the Use Inline
    Table
    option.

    The Value table
    displays as Inline
    Table
    .

    Use_Case_tWindowKey4.png
  5. Click the plus button to add as many line as needed
    and then click in each of the lines to define the
    input data for the four columns.

Configuring key generation

  1. Double-click tGenKey
    to display the Basic
    settings
    view and define the component
    properties.

    Use_Case_tWindowKey6.png

    You can click match_rule_import_icon.png and import blocking keys from
    the match rules created with the VSR algorithm and
    tested in the Profiling perspective of
    Talend Studio and use them in your
    Job. Otherwise, define the blocking key parameters
    as described in the below steps.

  2. Under the Algorithm
    table, click the plus button to add a row in this
    table.

  3. On the column column,
    click the newly added row and select from the list
    the column you want to process using an algorithm.
    In this example, select
    PostalCode.

    Note

    When you select a date column on which to apply an algorithm or a matching algorithm,
    you can decide what to compare in the date format.

    For example, if you want to only compare the year in the date, in the component schema
    set the type of the date column to Date and then enter
    yyyy” in the Date
    Pattern
    field. The component then converts the date format to a string
    according to the pattern defined in the schema before starting a string
    comparison.

  4. On the algorithm
    column, click the newly added row and select from
    the list the algorithm you want to apply to the
    corresponding column. In this example, select
    first N characters of the
    string
    .

  5. Click in the Value
    column and enter the value for the selected
    algorithm, when needed. In this scenario, type in
    2.

    In this example, we want to generate a functional key
    that holds the first two characters of the postal
    code for each of the data rows and we do not want to
    define any extra options on these columns.

    Note

    You can select the Show
    help
    check box to display instructions
    on how to set algorithms/options
    parameters.

    Once you have defined the tGenKey properties, you can display a
    statistical view of these parameters. To do
    so:

  6. Right-click the tGenKey component and select View Key Profile in the
    contextual menu.

    The View Key Profile
    editor displays, allowing you to visualize the
    statistics regarding the number of rows per block
    and to adapt them according to the results you want
    to get.

    Note

    When you are processing a large amount of data
    and when this component is used to partition data
    in order to use them in a matching component (such
    as tRecordMatching or tMatchGroup), it is
    preferable to have a limited number of rows in one
    block. An amount of about 50 rows per block is
    considered optimal, but it depends on the number
    of fields to compare, the total number of rows and
    the time considered acceptable for data
    processing.

Configuring the console output

  1. Double-click the tLogRow
    component to display the Basic settings view.

  2. In the Mode area,
    select Table to
    display the Job execution result in table cells.

Use_Case_tBlockedFuzzyJoin5.png

Executing the Job

  • Save your Job and press F6 to execute it.

    Use_Case_tWindowKey8.png

    All the output columns including the
    T_GEN_KEY column are listed
    in the Run console.
    The functional key for each data record is
    concatenated from the first two characters of the
    corresponding value in the PostalCode
    column.

Scenario 2: Comparing columns and grouping in the output flow duplicate
records that have the same functional key

This second scenario describes a Job that aims at:

  • generating a functional key using one algorithm on one of the
    input columns, DoB as described in
    scenario 1.

  • matching two input columns using the
    Jaro-Winkler algorithm.

  • grouping the output columns by the generated functional key to
    optimize the matching operation and compare only the records
    that have the same blocking value, functional key in this
    scenario. For more information on grouping output columns
    and using blocking values, see tMatchGroup.

Use_Case_tGenKey9.png

Setting up the Job

  1. Drop the following components from the Palette onto the design
    workspace: tRowGenerator, tGenKey and tMatchGroup.

  2. Connect all the components together using the
    Main link.

Configuring the data input

  1. Double-click tRowGenerator to define its schema as
    follows:

    Use_Case_tGenKey10.png

    The tRowGenerator
    component will generate an input data flow that has
    three columns: Firstname,
    Lastname
    , and DoB
    (date of birth). This data may have problems such as
    duplication, first or last names spelled differently
    or wrongly, different information for the same
    customer, etc.

  2. Click OK to validate
    the settings and close the dialog box, and accept to
    propagate the changes when prompted.

Configuring key generation

  1. Double-click tGenKey
    to display the Basic
    settings
    view and define the component
    properties.

    Use_Case_tGenKey11.png

    You can click match_rule_import_icon.png and import blocking keys from
    the match rules created with the VSR algorithm and
    tested in the Profiling perspective of
    Talend Studio and use them in your
    Job. Otherwise, define the blocking key parameters
    as described in the below steps.

  2. Under the Algorithm
    table, click the [+] button to add a row in this
    table.

  3. On the column column,
    click the newly added row and select from the list
    the column you want to process using an algorithm.
    In this example, select
    DoB.

  4. On the algorithm
    column, click the newly added row and select from
    the list the algorithm you want to apply to the
    corresponding column. In this example, select
    substring(a,b).

  5. Click in the value
    column and enter the value for the selected
    algorithm, when needed. In this scenario, type in
    6;10.

    Note

    The substring(a,b)
    algorithm allows you to extract the characters
    from a string, between two specified indices, and
    to return the new substring. First character is at
    index 0. In this scenario,
    for a given DoB
    21-01-1995“, 6;10
    will return only the year of birth, that is to say
    “1995” which is the substring from the 7th to the
    10th character.

    In this example, we want to generate a functional key
    that holds the last four characters of the date of
    birth, which correspond to the year of birth, for
    each of the data rows and we do not want to define
    any extra options on these columns.

    Note

    You can select the Show
    help
    check box to display instructions
    on how to set algorithms/options
    parameters.

    Once you have defined the tGenKey properties, you can display a
    statistical view of these parameters. To do
    so:

  6. Right-click on the tGenKey component and select View Key Profile in the
    contextual menu.

    Use_Case_tGenKey12.png

    The View Key Profile
    editor displays, allowing you to visualize
    statistics regarding the number of blocks and to
    adapt the parameters according to the results you
    want to get.

    Note

    When you are processing a large amount of data
    and when this component is used to partition data
    in order to use them in a matching component (such
    as tRecordMatching or tMatchGroup), it is
    preferable to have a limited number of rows in one
    block. An amount of about 50 rows per block is
    considered optimal, but it depends on the number
    of fields to compare, the total number of rows and
    the time considered acceptable for data
    processing.

    From the key editor, you can:

    • edit the Limit of rows used to calculate the
      statistics.

    • click match_rule_import_icon.png and import blocking keys from
      the Studio repository and use them in your
      Job.

    • edit the input column you want to process
      using an algorithm.

    • edit the parameters of the algorithm you
      want to apply to input columns.

    Note

    Every time you make a modification, you can
    see its implications by clicking the Refresh button which is
    located at the top right part of the
    editor.

  7. Click OK to close the
    View Key Profile
    editor.

Configuring the grouping of the output data

  1. Click the tMatchGroup
    component, and then in its basic settings click the
    Edit schema
    button to view the input and output columns and do
    any modifications in the output schema, if
    needed.

    Use_Case_tGenKey13.png

    In the output schema of this component, there are four
    output standard columns that are read-only. For more
    information, see tMatchGroup properties.

  2. Click OK to close the
    dialog box.

  3. Double-click the tMatchGroup component to display its
    Configuration
    Wizard
    and define the component
    properties.

    Use_Case_tWindowKey7.png

    Note

    If you want to add a fixed output column,
    MATCHING_DISTANCES, which
    gives the details of the distance between each
    column, click the Advanced
    settings
    tab and select the Output distance details
    check box. For more information, see tMatchGroup properties.

  4. In the Key definition
    table, click the plus button to add to the list the
    columns on which you want to do the matching
    operation, FirstName and
    LastName in this
    scenario.

  5. Click in the first and second cells of the Matching Function column
    and select from the list the algorithm(s) to be used
    for the matching operation,
    Jaro-Winkler in this
    example.

  6. Click in the first and second cells of the Weight column and set the
    numerical weights for each of the columns used as
    key attributes.

  7. In the Match
    threshold
    field, enter the match
    probability threshold. Two data records match when
    the probability threshold is above this value.

  8. Click the plus button below the Blocking Selection table to add a line
    in the table, then click in the line and select from
    the list the column you want to use as a blocking
    value, T_GEN_KEY in this
    example.

    Using a blocking value reduces the number of pairs of
    records that needs to be examined. The input data is
    partitioned into exhaustive blocks based on the
    functional key. This will decrease the number of
    pairs to compare, as comparison is restricted to
    record pairs within each block.

  9. Click the Chart
    button in the top right corner of the wizard to
    execute the Job in the defined configuration and
    have the matching results directly in the
    wizard.

    The matching chart gives a global picture about the
    duplicates in the analyzed data. The matching table
    indicates the details of items in each group and
    colors the groups in accordance with their color in
    the matching chart.

Configuring the console output

  1. Double-click the tLogRow
    component to display the Basic settings view.

  2. In the Mode area,
    select Table to
    display the Job execution result in table cells.

Use_Case_tBlockedFuzzyJoin5.png

Executing the Job

  • Save your Job and press F6 to execute it.

    Use_Case_tGenKey14.png

    The output columns include the
    T_GEN_KEY column that holds
    the functional key generated by the tGenKey component.

    You can see that all records that have the same
    functional key are grouped together in different
    blocks “groups”. The identifier for each group is
    listed in the GID column next
    to the corresponding record. The number of records
    in each of the output blocks is listed in the
    GRP_SIZE column and computed
    only on the master record. The
    MASTER column indicates with
    true/false if the corresponding record is a master
    record or not a master record. The
    SCORE column lists the
    calculated distance between the input record and the
    master record according to the
    Jaro-Winkler matching
    algorithm.

    For an example of creating data partitions based on
    different blocking keys and using them with multiple
    tMatchGroup
    components, see tMatchGroup.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x