August 17, 2023

tRuleSurvivorship – Docs for ESB 5.x

tRuleSurvivorship

trulesurvivorship_icon32_white.png

Warning

This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend Platform products.

tRuleSurvivorship Properties

Component family

Data Quality

 

Function

tRuleSurvivorship receives
records where duplicates, or possible duplicates, are already
estimated and grouped together. Based on user-defined business
rules, it creates one single representation for each duplicates
group using the best-of-breed data. This representation is called a
“survivor”.

Purpose

tRuleSurvivorship creates the
single representation of an entity according to business rules. It
helps to create a master copy of data for MDM.

Basic settings

Schema and Edit
schema

A schema is a row description, it defines the number of fields to be processed and
passed on to the next component. The schema is either Built-in or stored remotely in the
Repository.

Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.

This component provides two read-only columns:

  • SURVIVOR: this column
    is of type Boolean. It
    indicates whether a record is the survivor (true) or not
    (false). There will be only one survivor for each group
    .

  • CONFLICT: when more
    than one record meet a given business rule, this column
    presents them.

 

 

Built-in: The schema will be
created and stored locally for this component only. Related topic:
see Talend Studio User Guide.

 

 

Repository: The schema already
exists and is stored in the Repository, hence can be reused in
various projects and job designs. Related topic: see
Talend Studio User
Guide
.

 

Group identifier

Select the column whose content indicates the required group
identifiers from the input schema.

 

Group size

Select the column whose content indicates the required group size
from the input schema.

 

Rule package name

Type in the name of the rule package you want to create with this
component.

  Generate rules and survivorship flow

Once you have defined all of the rules of a rule package or
modified some of them with this component, click the survivorship_rule_icon.png icon to generate this rule package into the
Survivorship Rules node of
Rules Management under
Metadata in the Repository of the Integration perspective of your Studio.

Note

This step is necessary to validate these changes and take them
into account at runtime. If the rule package of the same name
exists already in the Repository, these changes will overwrite it once
validated, otherwise the Repository one takes the priority during
execution.

 

Rule table

Complete this table to create a complete survivor validation flow.
Basically, each given rule is defined as an execution step, so in
the top-down order within this table, these rules form a sequence
and thus a flow takes shape. The columns of this table are:

Order: From the list, select the
execution order of the rules you are creating so as to define a
survivor validation flow. The types of order may be:

  • Sequential: a
    Sequential rule is
    an execution step of the survivor validation flow. For
    example, the first rule on the top of this Rule table will be the first
    step and from this rule down, the second Sequential rule will be the
    second step.

    The first rule on the top must be a Sequential rule.

  • Multi-condition: a
    Multi-condition
    rule is an additional rule to a given execution step. It
    is always added to the last Sequential rule above it in this table
    and then at this step, both of these two rules become
    necessary to respect. For example, having defined the
    first Sequential rule,
    you define a Multi-condition rule below; then both of
    them will become the rules of the first step.

  • Multi-target: as each
    step, once executed, validates a record field value from
    a given Reference
    column
    and select the corresponding value
    as the best from a given Target
    column
    , a Multi-target rule allows you to add one
    more Target column to
    the same step.

    You need to define each Reference column and Target column manually in
    this table.

Rule Name: Type in the name of
each rule you are creating. This column is only available to the
Sequential rules as they define
the steps of the survivor validation flow.

Reference column: Select the
column you need to apply a given rule on. They are the columns you
have defined in the schema of this component. This column is not
available to the Multi-target rules
as they define only the Target
column
.

Function: Select the type of
validation operation to be performed on a given Reference column. The available types include:

  • None: no validation
    operation is performed.

  • Most common: it
    validates the most frequent field value in each
    duplicates group.

  • Most recent or
    Most ancient: the
    former validates the earliest date value and the latter
    the latest date value in each duplicates group. The
    relevant reference column must be of the Date type.

  • Longest or Shortest: the former
    validates the longest field value and the latter the
    shortest in each duplicates group.

  • Largest or Smallest: the former
    validates the largest numerical value and the latter the
    smallest numerical value in a duplicates group.

  • Match regex: it
    validates the field when this field complies to the
    regular expression given in the Value column .

  • Expression: it
    validates the field when it complies to the expression
    that you enter in the Value column. The expression value must
    be written with the Drools language.

  • Most Complete: it
    validates the field when the record it belongs to has
    the least empty fields.

Value: enter the expression of
interest corresponding to the Match
regex
or the Expression function you have selected in the
Function column.

Target column: when a step is
executed, it validates a record field value from a given Reference column and selects the
corresponding value as the best from a given Target column. Select this Target column from the schema columns of this
component.

Ignore blanks: Select the check
boxes which correspond to the names of the columns for which you
want the blank value to be ignored.

 Advanced settings

tStatCatcher Statistics

Select this check box to collect log data at the Job and the
component levels.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component requires an input component and an output
component.

As it needs grouped data to process, this component works
straightforwardly alongside the components like tMatchGroup as it requires a group
identifier column and a group size column.

It also requires that the input data are sorted by the group
identifier and that the first row of a group contains the group
size.

When you export a Job using tRuleSurvivorship, you need to select the Export dependencies check box in order to
export the generated survivor validation rules together. For further
information about how to export a Job, see Talend StudioUser Guide.

Scenario 1: Selecting the best-of-breed data from a group of duplicates to create a
survivor

The Job in this scenario uses five components to group the duplicate data and create
one single representation of these duplicates. This representation is the “survivor” at
the end of the selection process and you can use this survivor, for example, to create a
master copy of data for MDM.

use_case-trulesurvivorship1.png

The components used in this Job are:

  • tFixedFlowInput: it provides the input data
    to be processed by this Job. In the real-world use case, you may use another
    input component of interest to replace tFixedFlowInput for providing the required data.

  • tMatchGroup: it groups the duplicates of the
    input data and gives each group the information about its group ID and group
    size. The technical names of the information are GID and GRP_SIZE respectively
    and they are required by tRuleSurvivorship.

  • tRuleSurvivorship: it creates the
    user-defined survivor validation flow to select the best-of-breed data that
    composes the single representation of each duplicates group.

  • tFilterColumns: it rules out the technical
    columns and outputs the columns that carry the actual information of interest.

  • tLogRow: it presents the result of the Job
    execution.

Dropping and linking the components

  1. Drop tFixedFlowInput, tMatchGroup, tRuleSurvivorship, tFilterColumns and tLogRow
    from Palette onto the Design
    workspace.

  2. Right-click tFixedFlowInput to open its
    contextual menu and select the Row >
    Main link from this menu to connect
    this component to tMatchGroup.

  3. Do the same to create the Main link from
    tMatchGroup to tRuleSurvivorship, then to tFilterColumns and to tLogRow.

Configuring the process of grouping the input data

Setting up the input records

  1. Double-click tFixedFlowInput to open its
    Component view.

    use_case-trulesurvivorship2.png
  2. Click the three-dot button next to Edit
    schema
    to open the schema editor.

    use_case-trulesurvivorship3.png
  3. Click the plus button nine times to add nine rows and rename these rows
    respectively. In this example, they are: acctName, addr, city, state,
    zip, country, phone, data, credibility. They are the nine columns of the schema of the
    input data.

  4. In the Type column, select the data types
    for the rows of interest. In this example, select Date for the data column
    and Double for the credibility column.

    Note

    Be aware of setting the proper data type so that later you are able to
    define the validation rules easily.

  5. In the Date Pattern column, type in the
    data pattern to reflect the date format of interest. In this scenario, this
    format is yyyyMMdd.

  6. Click OK to validate these changes and
    accept the propagation prompted by the pop-up dialog box.

  7. In the Mode area of the Basic settings view, select Use Inline Content (delimited file) to enter the input data
    of interest.

  8. In the Content field, enter the input
    data to be processed. This data should correspond to the schema you have
    defined and in this example, the contents of the data are:

Grouping the duplicate records

  1. Right-click tMatchGroup to open its
    contextual menu and select Configuration
    Wizard
    .

    From the wizard, you can see how your groups look like and you can adjust
    the component settings in order to correctly get the similar matches.

    use_case-trulesurvivorship4-config_wizard.png
  2. Click the plus button under the Key
    Definition
    table to add one row.

  3. In the Input Key Attribute column of this
    row, select acctName. This way, this
    column becomes the reference used to match the duplicates of the input data.

  4. In the Matching Function column, select
    the Jaro-Winkler matching algorithm.

  5. In the Match threshold field, enter the
    numerical value to indicate at which value two record fields match each
    other. In this example, type in 0.6.

  6. Click Chart to execute this matching rule
    and show the result in this wizard.

    If the input records are not put into one single group, replace 0.6 with a smaller value and click Chart again to check the result until all of the
    four records are in the same group.

    The Job in this scenario puts four similar records into one single
    duplicates group so that tRuleSurvivorship
    is able to create one survivor from them. This simple sample allows you to
    have a clear picture about how tRuleSurvivorship works along with other components to
    create the best data. However, in the real-world case, you may need to
    process much more data with complex duplicate situation and thus put the
    data into much more groups.

  7. Click OK to close this Configuration wizard and the Basic settings view of the tMatchGroup component is automatically filled with the
    parameters you have set.

    For further information about the Configuration
    wizard
    , see Configuration wizard

Defining the survivor validation flow

Having configured and grouped the input data, you need to create the survivor
validation flow using tRuleSurvivorship. To do
this, proceed as follows:

  1. Double-click tRuleSurvivorship to open
    its Component view.

    use_case-trulesurvivorship5.png
  2. Select GID for the Group identifier field and GRP_SIZE for the Group
    size
    field.

  3. In the Rule package name field, enter the
    name of the rule package you need to create to define the survivor
    validation flow of interest. In this example, this name is org.talend.survivorship.sample.

  4. In the Rule table, click the plus button
    to add as many rows as required and complete them using the corresponding
    rule definitions. In this example, add ten rows and complete them using the
    contents as follows:

    Order

    Rule name

    Reference column

    Function

    Value

    Target column

    Sequential

    "1_LengthAcct"

    acctName

    Expression

    ".length >11"

    acctName

    Sequential

    "2_LongestAddr"

    addr

    Longest

    n/a

    addr

    Sequential

    "3_HighCredibility"

    credibility

    Expression

    "> 3"

    credibility

    Sequential

    "4_MostCommonCity"

    city

    Most common

    n/a

    city

    Sequential

    "5_MostCommonZip"

    zip

    Most common

    n/a

    zip

    Multi-condition

    n/a

    zip

    Match regex

    "\d{5}"

    n/a

    Multi-target

    n/a

    n/a

    n/a

    n/a

    state

    Multi-target

    n/a

    n/a

    n/a

    n/a

    country

    Sequential

    "6_LatestPhone"

    date

    Most recent

    n/a

    phone

    Multi-target

    n/a

    n/a

    n/a

    n/a

    date

    These rules are executed in the top-down order. The Multi-condition rule is one of the conditions of the
    5_MostCommonZip rule, so the
    rule-compliant zip code should be the most common zip code and meanwhile
    have five digits. The zip column is the
    target column of the 5_MostCommonZip rule
    and the two Multi-target rules below it add
    another two target columns, state and
    country, so the zip, the state and the country
    columns will be the source of the best-of-breed data. Thus once a zip code
    is validated, the corresponding record field values from these three columns
    will be selected.

    The same is true to the Sequential rule
    6_LatestPhone. Once a date value is
    validated, the corresponding record field values will be selected from the
    phone and the date columns.

    Note

    In this table, the fields reading n/a indicate that these fields are not available to the
    corresponding Order types or Function types you have selected. In the
    Rule table of the Basic settings view of tRuleSurvivorship, these unavailable fields are greyed
    out. For further information about this rule table, see the properties
    table at the beginning of this tRuleSurvivorShip section.

  5. Next to Generate rules and survivorship
    flow
    , click the survivorship_rule_icon.png icon to generate the rule package with its contents you
    have defined.

    Once done, you can find the generated rule package in the Metadata > Rules Management >
    Survivorship Rules
    directory of your Studio
    Repository. From there, you are able to
    open the newly created survivor validation flow of this example and read its
    diagram. For further information, see Talend Studio
    User Guide.

    use_case-trulesurvivorship6-validation_flow.png

Selecting the columns of interest

The schema of tRuleSurvivorship includes several
technical columns like GID, GRP_SIZE, which are not interesting in this example, so you may need
to use tFilterColumns to rule these technical
columns out and leave the columns carrying actual data to be output. To do this,
proceed as follows:

  1. Double-click tFilterColumns to open its
    Component view.

  2. Click Sync columns to retrieve the schema
    from its preceding component. If a dialog box pops up to prompt the
    propagation, click Yes to accept it.

  3. Click the three-dot button next to Edit
    schema
    to open the schema editor.

  4. On the tFilterColumns side of this
    editor, select the GID, GRP_SIZE, MASTER
    and SCORE columns and click the red cross
    icon below to remove them.

    use_case-trulesurvivorship7.png
  5. Click OK to validate these changes and
    accept the propagation prompted by the pop-up dialog box.

Executing the Job

The tLogRow component is used to present the
execution result of the Job. You can configure the presentation mode on its
Component view.

To do this, double-click tLogRow to open the
Component view and in the Mode area, select the Table (print values in
cells of a table)
check box.

To execute this Job, press F6.

Once done, the Run view is opened automatically,
where you can check the execution result.

use_case-trulesurvivorship8.png

You can read that the last row is the survivor record because its SURVIVOR column indicates true. This record is composed of the best-of-breed data of each
column from the four other rows which are the duplicates of the same group.

The CONFLICT column presents the columns carrying
more than one record field values compliant with the given validation rules. Take
the credibility column for example: apart from
the survivor record whose credibility is 5.0, the
CONFLICT column indicates that the credibility
of the second record GRIZZARD is 4.0, also bigger than 3, the threshold set in the rules you have defined, however, as the
credibility 5.0 appears in the first record
GRIZZARD CO., tRuleSurvivorship selects it as best-of-breed data.

Scenario 2: Modifying the rule file manually to code the conditions you want to use
to create a survivor

In a Job, the tRuleSurvivorship component generates a
survivorship rule package based on the conditions you define in the Rule table in the component Basic
settings
view.

If you want the rule to survive records based on some more advanced criteria, you must
manually code the conditions in the rule using the Drools language.

The Job in this scenario gives an example about how to modify the code in the rule
generated by the component to use specific conditions to create a survivor. Later, you
can use this survivor, for example, to create a master copy of data for MDM.

use_case-trulesurvivorship_modify_rule.png

The components used in this Job are:

  • tFixedFlowInput: it provides the input data
    to be processed by this Job.

  • tRuleSurvivorship: it creates the survivor
    validation flow based on the conditions you code in the rule. This component
    selects the best-of-breed data that composes the single representation of each
    duplicate group.

  • tLogRow: it shows the result of the Job
    execution.

Dropping the components and linking them together

  1. Drop tFixedFlowInput, tRuleSurvivorship and tLogRow from the palette of the studio onto the Design
    workspace.

  2. Right-click tFixedFlowInput and select
    the Row > Main link to connect this component to tRuleSurvivorship.

  3. Do the same to connect tRuleSurvivorship
    to tLogRow using the Row > Main link.

Setting up the input records

  1. Double-click tFixedFlowInput to open its
    Component view.

    use_case-trulesurvivorship_modify_rule2.png
  2. Click the three-dot button next to Edit
    schema
    to open the schema editor.

    use_case-trulesurvivorship_modify_rule3.png
  3. Click the plus button and add five rows.

    Rename these rows respectively as the following: Record_ID, File,
    Acctname, GRP_ID and GRP_SIZE.

    The input data has information about group ID and group size. In real life
    scenario, such information can be gathered by the tMatchGroup component as shown in scenario 1. tMatchGroup groups duplicates in the input data
    and gives each group a group ID and a group size. These two columns are
    required by tRuleSurvivorship.

  4. In the Type column, select the data types
    for your columns. In this example, set the type to Integer for Record_ID and
    GRP_SIZE, and set it to String for the other columns.

    Note

    Make sure to set the proper data type so that you can define the
    validation rules without error messages.

  5. Click OK to validate these changes and
    accept the propagation when prompted by the pop-up dialog box.

  6. In the Mode area of the Basic settings view, select Use Inline Content (delimited file).

  7. In the Content field, enter the input
    data to be processed.

    This data should correspond to the schema you have defined. In this
    example, the input data is as the following:

  8. Set the row and field separators in the corresponding fields.

Defining the survivor validation flow

  1. Double-click tRuleSurvivorship to open
    its Component view.

    use_case-trulesurvivorship_modify_rule4.png
  2. Select GRP_ID from the Group Identifier list and GRP_SIZE from the Group
    size
    list.

  3. In the Rule package name field, replace
    the by-default name org.talend.survivorship.sample with a name of your choice,
    if needed.

    The survivor validation flow will be generated and saved under this name
    in the Repository tree view of the
    Integration perspective.

  4. In the Rule table, click the plus button
    to add a row per rule.

    In this example, define one rule and complete it as the following:

    Order

    Rule name

    Reference column

    Function

    Value

    Target column

    Sequential

    “Rule1”

    File

    Expression

    .equals("1")

    Acctname

    One rule, “Rule1”, will be generated and executed by
    tRuleSurvivorship. This rule validates
    the records in the File column that
    comply with the expression you enter in the Value column of the Rule
    table
    . The component will then select the corresponding value
    as the best breed from the Acctname
    target column.

  5. Next to Generate rules and survivorship
    flow
    , click the survivorship_rule_icon.png icon to generate the rule package according to the
    conditions you have defined.

    The rule package is generated and saved under Metadata > Rules Management >
    Survivorship Rules
    in the Repository tree view of the Integration perspective.

  6. In the Repository tree view, browse to
    the rule file under the Survivorship Rules
    folder and double-click “Rule1” to open it.

    use_case-trulesurvivorship_modify_rule5.png

    But this rule will select the values that come from file 1. However, you
    may also want to survive records based on specific criteria; for example, if
    Acctname has a value in file1, you
    may want to use that value, or else use the value from file2 instead. To do
    this, you must modify the code manually in the rule file.

  7. Modify the rule with the following Drools code:

    Warning

    After you modify the rule file, you must not click the survivorship_rule_icon.png icon. Otherwise, your modifications will be
    replaced by the new generation of the rule package.

Executing the Job

  1. Double-click tLogRow to open the
    Component view and in the Mode area, select the Table
    (print values in cells of a table)
    option.

    The execution result of the Job will be printed in a table.

  2. Press F6 to execute the Job.

    The Run view is opened automatically
    showing the execution results.

    use_case-trulesurvivorship_modify_rule6.png

    You can read that four rows are the survivor records because their
    SURVIVOR column indicates true. In the survivor records, the
    Acctname value is selected from file 1, if the
    value exists. If not, the value is selected from file 2, as you defined in
    the rule. Other rows are the duplicates of same groups.

    The CONFLICT column shows that no column
    has more than one value compliant with the given validation rules.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x