tRuleSurvivorship

Creates the single representation of an entity according to business rules and can
create a master copy of data for Master Data Management.

tRuleSurvivorship receives records
where duplicates, or possible duplicates, are already estimated and
grouped together. Based on user-defined business rules, it creates one
single representation for each duplicates group using the best-of-breed
data. This representation is called a “survivor”.

In local mode, Apache Spark 2.0.0, 2.1.0, 2.3.0 and 2.4.0 are supported.

tRuleSurvivorship Standard properties

These properties are used to configure tRuleSurvivorship running in the Standard Job framework.

The Standard
tRuleSurvivorship component belongs to the Data Quality family.

The component in this framework is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and in Talend Data Fabric.

Basic settings

Schema and Edit schema	A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word `line` when naming the fields. This component provides two read-only columns: SURVIVOR: this column is of type Boolean. It indicates whether a record is the survivor (true) or not (false). There will be only one survivor for each group . CONFLICT: when more than one record meet a given business rule, this column presents them. When a survivor record is created, the CONFLICT column does not show the conflicting columns if the conflicts have been resolved by the conflict rules.
	Built-In: You create and store the schema locally for this component only.
	Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.
Group identifier	Select the column whose content indicates the required group identifiers from the input schema.
Group size	Select the column whose content indicates the required group size from the input schema.
Rule package name	Type in the name of the rule package you want to create with this component.
Generate rules and survivorship flow	Once you have defined all of the rules of a rule package or modified some of them with this component, click the icon to generate this rule package into the Survivorship Rules node of Rules Management under Metadata in the Repository of the Integration perspective of your Studio. Note: This step is necessary to validate these changes and take them into account at runtime. If the rule package of the same name exists already in the Repository, these changes will overwrite it once validated, otherwise the Repository one takes the priority during execution. Warning: In a rule package, two rules cannot use the same name.
Rule table	Complete this table to create a complete survivor validation flow. Basically, each given rule is defined as an execution step, so in the top-down order within this table, these rules form a sequence and thus a flow takes shape. The columns of this table are: Order: From the list, select the execution order of the rules you are creating so as to define a survivor validation flow. The types of order may be: Sequential: a Sequential rule is an execution step of the survivor validation flow. For example, the first rule on the top of this Rule table will be the first step and from this rule down, the second Sequential rule will be the second step. The first rule on the top must be a Sequential rule. Multi-condition: a Multi-condition rule is an additional rule to a given execution step. It is always added to the last Sequential rule above it in this table and then at this step, both of these two rules become necessary to respect. For example, having defined the first Sequential rule, you define a Multi-condition rule below; then both of them will become the rules of the first step. Multi-target: as each step, once executed, validates a record field value from a given Reference column and select the corresponding value as the best from a given Target column, a Multi-target rule allows you to add one more Target column to the same step. You need to define each Reference column and Target column manually in this table. Rule Name: Type in the name of each rule you are creating. This column is only available to the Sequential rules as they define the steps of the survivor validation flow. Do not use special characters in rule names, otherwise the Job may not run correctly. Rule names are case insensitive. Reference column: Select the column you need to apply a given rule on. They are the columns you have defined in the schema of this component. This column is not available to the Multi-target rules as they define only the Target column. Function: Select the type of validation operation to be performed on a given Reference column. The available types include: None: no validation operation is performed. Most common: it validates the most frequent field value in each duplicates group. Most recent or Most ancient: the former validates the earliest date value and the latter the latest date value in each duplicates group. The relevant reference column must be of the Date type. Most Complete: it validates the field when the record it belongs to has the least empty fields. Longest or Shortest: the former validates the longest field value and the latter the shortest in each duplicates group. Largest or Smallest: the former validates the largest numerical value and the latter the smallest numerical value in a duplicates group. Match regex: it validates the field when this field complies to the regular expression given in the Value column. Expression: it validates the field when it complies to the expression that you enter in the Value column. The expression value must be written with the Drools language. Value: enter the expression of interest corresponding to the Match regex or the Expression function you have selected in the Function column. Target column: when a step is executed, it validates a record field value from a given Reference column and selects the corresponding value as the best from a given Target column. Select this Target column from the schema columns of this component. Ignore blanks: Select the check boxes which correspond to the names of the columns for which you want the blank value to be ignored.
Define conflict rule	Select this check box to be able to create rules to resolve conflicts in the Conflict rule table.
Conflict rule table	Complete this table to create rules to resolve conflicts. The columns of this table are: Rule name: Type in the name of each rule you are creating. Do not use special characters in rule names, otherwise the Job may not run correctly. Conflicting column:When a step is executed, it validates a record field value from a given Reference column and selects the corresponding value as the best from a given Conflicting column. Select this Conflicting column from the schema columns of this component. Function: Select the type of validation operation to be performed on a given Conflicting column. The available types include those in the Rule table and the following ones: Fill empty: This function fills empty fields width the specified value. Remove duplicate: This function removes the field value in the Reference column if the same field value has been survived in the Conflicting column. Not match regex: This function validates the field when this field does not comply to the regular expression given in the Value column. Survive as: When a field value from the Reference column is survived, this function selects the corresponding field value as the best from the Conflicting column. Value: enter the expression of interest corresponding to the Match regex or the Expression function you have selected in the Function column. Reference column: Select the column you need to apply a given conflicting rule on. They are the columns you have defined in the schema of this component. Ignore blanks: Select the check boxes which correspond to the names of the columns for which you want the blank value to be ignored. Disable: Select the check box to disable the corresponding rule.

Advanced settings

tStatCatcher Statistics	Select this check box to gather the Job processing metadata at the Job level as well as at each component level.

Global Variables

Global Variables	ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide.

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule	This component is usually used as an intermediate component, and it requires an input component and an output component. As it needs grouped data to process, this component works straightforwardly alongside the components like tMatchGroup as it requires a group identifier column and a group size column. It also requires that the input data are sorted by the group identifier and that the first row of a group contains the group size. When you export a Job using tRuleSurvivorship, you need to select the Export dependencies check box in order to export the generated survivor validation rules together. For further information about how to export a Job, see Talend Studio User Guide.

Usage rule

This component is usually used as an intermediate component, and it requires an
input component and an output component.

As it needs grouped data to process, this component works
straightforwardly alongside the components like tMatchGroup as it requires a group identifier column and a
group size column.

It also requires that the input data are sorted by the group
identifier and that the first row of a group contains the group
size.

When you export a Job using tRuleSurvivorship, you need to select the Export dependencies check box in order to export
the generated survivor validation rules together. For further
information about how to export a Job, see
Talend Studio User
Guide.

Selecting the best-of-breed data from a group of duplicates to create a
survivor

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

The Job in this scenario groups the duplicate data and create one single representation of
these duplicates. This representation is the “survivor” at the end of the selection process
and you can use this survivor, for example, to create a master copy of data for MDM.

The components used in this Job are:

tFixedFlowInput: it provides the input data
to be processed by this Job. In the real-world use case, you may use another
input component of interest to replace tFixedFlowInput for providing the required data.
tMatchGroup: it groups the duplicates of the
input data and gives each group the information about its group ID and group
size. The technical names of the information are GID and GRP_SIZE respectively
and they are required by tRuleSurvivorship.
tRuleSurvivorship: it creates the
user-defined survivor validation flow to select the best-of-breed data that
composes the single representation of each duplicates group.
tFilterColumns: it rules out the technical
columns and outputs the columns that carry the actual information of interest.
tLogRow: it presents the result of the Job
execution.

Dropping and linking the components

Drop tFixedFlowInput, tMatchGroup, tRuleSurvivorship, tFilterColumns and tLogRow
from Palette onto the Design
workspace.
Right-click tFixedFlowInput to open its
contextual menu and select the Row >
Main link from this menu to connect
this component to tMatchGroup.
Do the same to create the Main link from
tMatchGroup to tRuleSurvivorship, then to tFilterColumns and to tLogRow.

Configuring the process of grouping the input data

Setting up the input records

Double-click tFixedFlowInput to open its
Component view.
Click the three-dot button next to Edit
schema to open the schema editor.
Click the plus button nine times to add nine rows and rename these rows
respectively. In this example, they are: acctName, addr, city, state,
zip, country, phone, data, credibility. They are the nine columns of the schema of the
input data.
In the Type column, select the data types
for the rows of interest. In this example, select Date for the data column
and Double for the credibility column.

Note:

Be aware of setting the proper data type so that later you are able to
define the validation rules easily.
In the Date Pattern column, type in the
data pattern to reflect the date format of interest. In this scenario, this
format is yyyyMMdd.
Click OK to validate these changes and
accept the propagation prompted by the pop-up dialog box.
In the Mode area of the Basic settings view, select Use Inline Content (delimited file) to enter the input data
of interest.

In the Content field, enter the input
data to be processed. This data should correspond to the schema you have
defined and in this example, the contents of the data are:

GRIZZARD CO.;110 N MARYLAND AVE;GLENDALE;CA;912066;FR;8185431314;20110101;5
GRIZZARD;110 NORTH MARYLAND AVENUE;GLENDALE;CA;912066;US;9003254892;20110118;4
GRIZZARD INC;110 N. MARYLAND AVENUE;GLENDALE;CA;91206;US;(818) 543-1315;20110103;2
GRIZZARD CO;1480 S COLORADO BOULEVARD;LOS ANGELES;CA;91206;US;(800) 325-4892;20110115;1

GRIZZARD CO.;110 N MARYLAND AVE;GLENDALE;CA;912066;FR;8185431314;20110101;5

GRIZZARD;110 NORTH MARYLAND AVENUE;GLENDALE;CA;912066;US;9003254892;20110118;4

GRIZZARD INC;110 N. MARYLAND AVENUE;GLENDALE;CA;91206;US;(818) 543-1315;20110103;2

GRIZZARD CO;1480 S COLORADO BOULEVARD;LOS ANGELES;CA;91206;US;(800) 325-4892;20110115;1

Grouping the duplicate records

Right-click tMatchGroup to open its
contextual menu and select Configuration
Wizard.

From the wizard, you can see how your groups look like and you can adjust
the component settings in order to correctly get the similar matches.
Click the plus button under the Key
Definition table to add one row.
In the Input Key Attribute column of this
row, select acctName. This way, this
column becomes the reference used to match the duplicates of the input data.
In the Matching Function column, select
the Jaro-Winkler matching algorithm.
In the Match threshold field, enter the
numerical value to indicate at which value two record fields match each
other. In this example, type in 0.6.
Click Chart to execute this matching rule
and show the result in this wizard.

If the input records are not put into one single group, replace 0.6 with a smaller value and click Chart again to check the result until all of the
four records are in the same group.

The Job in this scenario puts four similar records into one single
duplicates group so that tRuleSurvivorship
is able to create one survivor from them. This simple sample allows you to
have a clear picture about how tRuleSurvivorship works along with other components to
create the best data. However, in the real-world case, you may need to
process much more data with complex duplicate situation and thus put the
data into much more groups.
Click OK to close this Configuration wizard and the Basic settings view of the tMatchGroup component is automatically filled with the
parameters you have set.

For further information about the Configuration
wizard, see Configuration wizard

Defining the survivor validation flow

Having configured and grouped the input data, you need to create the survivor
validation flow using tRuleSurvivorship. To do
this, proceed as follows:

Double-click tRuleSurvivorship to open
its Component view.
Select GID for the Group identifier field and GRP_SIZE for the Group
size field.
In the Rule package name field, enter the
name of the rule package you need to create to define the survivor
validation flow of interest. In this example, this name is org.talend.survivorship.sample.

In the Rule table, click the plus button to
add as many rows as required and complete them using the corresponding rule
definitions. In this example, add ten rows and complete them using the table
below:

Order	Rule name	Reference column	Function	Value	Target column
Sequential	`"1_LengthAcct"`	acctName	Expression	`".length >11"`	acctName
Sequential	`"2_LongestAddr"`	addr	Longest	n/a	addr
Sequential	`"3_HighCredibility"`	credibility	Expression	`"> 3"`	credibility
Sequential	`"4_MostCommonCity"`	city	Most common	n/a	city
Sequential	`"5_MostCommonZip"`	zip	Most common	n/a	zip
Multi-condition	n/a	zip	Match regex	`"\d{5}"`	n/a
Multi-target	n/a	n/a	n/a	n/a	state
Multi-target	n/a	n/a	n/a	n/a	country
Sequential	`"6_LatestPhone"`	date	Most recent	n/a	phone
Multi-target	n/a	n/a	n/a	n/a	date

Do not use special characters in rule names, otherwise the Job may not run
correctly.

These rules are executed in the top-down order. The Multi-condition rule is one of the conditions of the 5_MostCommonZip rule, so the rule-compliant zip
code should be the most common zip code and meanwhile have five digits. The
zip column is the target column of the
5_MostCommonZip rule and the two
Multi-target rules below it add another two
target columns, state and country, so the zip, the state and the
country columns will be the source of the
best-of-breed data. Thus once a zip code is validated, the corresponding record
field values from these three columns will be selected.

The same is true to the Sequential rule
6_LatestPhone. Once a date value is
validated, the corresponding record field values will be selected from the
phone and the date columns.

Note:

In this table, the fields reading n/a
indicate that these fields are not available to the corresponding Order types or Function types you have selected. In the Rule table of the Basic
settings view of tRuleSurvivorship, these unavailable fields are greyed out.
For further information about this rule table, see the properties table at
the beginning of this tRuleSurvivorShip
section.

Next to Generate rules and survivorship
flow, click the

icon to generate the rule package with its contents you
have defined.

Once done, you can find the generated rule package in the Metadata > Rules Management > Survivorship Rules directory of your Studio Repository. From there, you are able to open the newly created
survivor validation flow of this example and read its diagram. For further
information, see
Talend Studio

User
Guide.

Selecting the columns of interest

The schema of tRuleSurvivorship includes several
technical columns like GID, GRP_SIZE, which are not interesting in this example, so you may need
to use tFilterColumns to rule these technical
columns out and leave the columns carrying actual data to be output. To do this,
proceed as follows:

Double-click tFilterColumns to open its
Component view.
Click Sync columns to retrieve the schema
from its preceding component. If a dialog box pops up to prompt the
propagation, click Yes to accept it.
Click the three-dot button next to Edit
schema to open the schema editor.
On the tFilterColumns side of this
editor, select the GID, GRP_SIZE, MASTER
and SCORE columns and click the red cross
icon below to remove them.
Click OK to validate these changes and
accept the propagation prompted by the pop-up dialog box.

Executing the Job

The tLogRow component is used to present the
execution result of the Job. You can configure the presentation mode on its
Component view.

To do this, double-click tLogRow to open the
Component view and in the Mode area, select the Table (print values in
cells of a table) check box.

To execute this Job, press F6.

Once done, the Run view is opened automatically,
where you can check the execution result.

You can read that the last row is the survivor record because its SURVIVOR column indicates true. This record is composed of the best-of-breed data of each
column from the four other rows which are the duplicates of the same group.

The CONFLICT column presents the columns carrying
more than one record field values compliant with the given validation rules. Take
the credibility column for example: apart from
the survivor record whose credibility is 5.0, the
CONFLICT column indicates that the credibility
of the second record GRIZZARD is 4.0, also bigger than 3, the threshold set in the rules you have defined, however, as the
credibility 5.0 appears in the first record
GRIZZARD CO., tRuleSurvivorship selects it as best-of-breed data.

Modifying the rule file manually to code the conditions you want to use
to create a survivor

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

In a Job, the tRuleSurvivorship component generates a
survivorship rule package based on the conditions you define in the Rule table in the component Basic
settings view.

If you want the rule to survive records based on some more advanced criteria, you must
manually code the conditions in the rule using the Drools language.

The Job in this scenario gives an example about how to modify the code in the rule
generated by the component to use specific conditions to create a survivor. Later, you
can use this survivor, for example, to create a master copy of data for MDM.

The components used in this Job are:

tFixedFlowInput: it provides the input data
to be processed by this Job.
tRuleSurvivorship: it creates the survivor
validation flow based on the conditions you code in the rule. This component
selects the best-of-breed data that composes the single representation of each
duplicate group.
tLogRow: it shows the result of the Job
execution.

Dropping the components and linking them together

Drop tFixedFlowInput, tRuleSurvivorship and tLogRow from the palette of the studio onto the Design
workspace.
Right-click tFixedFlowInput and select
the Row > Main link to connect this component to tRuleSurvivorship.
Do the same to connect tRuleSurvivorship
to tLogRow using the Row > Main link.

Setting up the input records

Double-click tFixedFlowInput to open its
Component view.
Click the three-dot button next to Edit
schema to open the schema editor.
Click the plus button and add five rows.

Rename these rows respectively as the following: Record_ID, File,
Acctname, GRP_ID and GRP_SIZE.

The input data has information about group ID and group size. In real life
scenario, such information can be gathered by the tMatchGroup component as shown in scenario 1. tMatchGroup groups duplicates in the input data
and gives each group a group ID and a group size. These two columns are
required by tRuleSurvivorship.
In the Type column, select the data types
for your columns. In this example, set the type to Integer for Record_ID and
GRP_SIZE, and set it to String for the other columns.

Note:

Make sure to set the proper data type so that you can define the
validation rules without error messages.
Click OK to validate these changes and
accept the propagation when prompted by the pop-up dialog box.
In the Mode area of the Basic settings view, select Use Inline Content (delimited file).
In the Content field, enter the input
data to be processed.

This data should correspond to the schema you have defined. In this
example, the input data is as the following:

1;2;AcmeFromFile2;1;2 2;1;AcmeFromFile1;1;0 3;1;AAA;2;1 4;2;BBB;3;1 5;1; ;4;2 6;2;NotNull;4;0

1
2
3
4
5
6

1;2;AcmeFromFile2;1;2
2;1;AcmeFromFile1;1;0
3;1;AAA;2;1
4;2;BBB;3;1
5;1; ;4;2
6;2;NotNull;4;0
Set the row and field separators in the corresponding fields.

Defining the survivor validation flow

Double-click tRuleSurvivorship to open
its Component view.
Select GRP_ID from the Group Identifier list and GRP_SIZE from the Group
size list.
In the Rule package name field, replace
the by-default name org.talend.survivorship.sample with a name of your choice,
if needed.

The survivor validation flow will be generated and saved under this name
in the Repository tree view of the

Integration
perspective.

In the Rule table, click the plus button
to add a row per rule.

In this example, define one rule and complete it as the following:

Order	Rule name	Reference column	Function	Value	Target column
Sequential	“Rule1”	File	Expression	`.equals("1")`	Acctname

One rule, “Rule1”, will be generated and executed by
tRuleSurvivorship. This rule validates
the records in the File column that
comply with the expression you enter in the Value column of the Rule
table. The component will then select the corresponding value
as the best breed from the Acctname
target column.

Next to Generate rules and survivorship
flow, click the

icon to generate the rule package according to the
conditions you have defined.

The rule package is generated and saved under Metadata > Rules Management > Survivorship Rules in the Repository tree
view of the
Integration
perspective.
In the Repository tree view, browse to
the rule file under the Survivorship Rules
folder and double-click “Rule1” to open it.

But this rule will select the values that come from file 1. However, you
may also want to survive records based on specific criteria; for example, if
Acctname has a value in file1, you
may want to use that value, or else use the value from file2 instead. To do
this, you must modify the code manually in the rule file.

Modify the rule with the following Drools code:

package org.talend.survivorship.sample

rule "ExistInFile1"
    no-loop true
    dialect "mvel"
    ruleflow-group "Rule1Group"
when
    $input : RecordIn( file.equals("1"), acctname!= null, !acctname.trim().equals("") )
then
    System.out.println("ExistInFile1 fired	" + $input.record_id);
    dataset.survive( $input.TALEND_INTERNAL_ID, "Acctname" );
    dataset.survive( $input.TALEND_INTERNAL_ID, "File" );
end

rule "NotExistFile1"
    no-loop true
    dialect "mvel"
    ruleflow-group "Rule1Group"
when
    $input : RecordIn( file.equals("2"), acctname!= null &amp;&amp; !acctname.trim().equals("") )
    (not (exists (RecordIn( file.equals("1") )))
    or exists( RecordIn( file.equals("1"), acctname== null || acctname.trim().equals("") ) ))
then
    System.out.println("NotExistFile1 fired	" + $input.record_id);
    dataset.survive( $input.TALEND_INTERNAL_ID, "Acctname" );
    dataset.survive( $input.TALEND_INTERNAL_ID, "File" );
end

package org.talend.survivorship.sample

rule "ExistInFile1"

no-loop true

dialect "mvel"

ruleflow-group "Rule1Group"

when

$input : RecordIn( file.equals("1"), acctname!= null, !acctname.trim().equals("") )

then

System.out.println("ExistInFile1 fired " + $input.record_id);

dataset.survive( $input.TALEND_INTERNAL_ID, "Acctname" );

dataset.survive( $input.TALEND_INTERNAL_ID, "File" );

end

rule "NotExistFile1"

no-loop true

dialect "mvel"

ruleflow-group "Rule1Group"

when

$input : RecordIn( file.equals("2"), acctname!= null && !acctname.trim().equals("") )

(not (exists (RecordIn( file.equals("1") )))

or exists( RecordIn( file.equals("1"), acctname== null || acctname.trim().equals("") ) ))

then

System.out.println("NotExistFile1 fired " + $input.record_id);

dataset.survive( $input.TALEND_INTERNAL_ID, "Acctname" );

dataset.survive( $input.TALEND_INTERNAL_ID, "File" );

end

Warning:

After you modify the rule file, you must not click the

icon. Otherwise, your modifications will be
replaced by the new generation of the rule package.

Executing the Job

Double-click tLogRow to open the
Component view and in the Mode area, select the Table
(print values in cells of a table) option.

The execution result of the Job will be printed in a table.
Press F6 to execute the Job.

The Run view is opened automatically
showing the execution results.

You can read that four rows are the survivor records because their
SURVIVOR column indicates true. In the survivor records, the
Acctname value is selected from file 1, if the
value exists. If not, the value is selected from file 2, as you defined in
the rule. Other rows are the duplicates of same groups.

The CONFLICT column shows that no column
has more than one value compliant with the given validation rules.

tRuleSurvivorship properties for Apache Spark Batch

These properties are used to configure tRuleSurvivorship running in the Spark Batch Job framework.

The Spark Batch
tRuleSurvivorship component belongs to the Data Quality family.

The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.

Basic settings

Schema and Edit schema	A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word `line` when naming the fields. This component provides two read-only columns: SURVIVOR: this column is of type Boolean. It indicates whether a record is the survivor (true) or not (false). There will be only one survivor for each group . CONFLICT: when more than one record meet a given business rule, this column presents them. When a survivor record is created, the CONFLICT column does not show the conflicting columns if the conflicts have been resolved by the conflict rules.
	Built-In: You create and store the schema locally for this component only.
	Repository: You have already created the schema and stored it in the Repository. You can reuse it in various projects and Job designs.
Group identifier	Select the column whose content indicates the required group identifiers from the input schema.
Rule package name	Type in the name of the rule package you want to create with this component.
Generate rules and survivorship flow	Once you have defined all of the rules of a rule package or modified some of them with this component, click the icon to generate this rule package into the Survivorship Rules node of Rules Management under Metadata in the Repository of the Integration perspective of your Studio. Note: This step is necessary to validate these changes and take them into account at runtime. If the rule package of the same name exists already in the Repository, these changes will overwrite it once validated, otherwise the Repository one takes the priority during execution. Warning: In a rule package, two rules cannot use the same name.
Rule table	Complete this table to create a complete survivor validation flow. Basically, each given rule is defined as an execution step, so in the top-down order within this table, these rules form a sequence and thus a flow takes shape. The columns of this table are: Order: From the list, select the execution order of the rules you are creating so as to define a survivor validation flow. The types of order may be: Sequential: a Sequential rule is an execution step of the survivor validation flow. For example, the first rule on the top of this Rule table will be the first step and from this rule down, the second Sequential rule will be the second step. The first rule on the top must be a Sequential rule. Multi-condition: a Multi-condition rule is an additional rule to a given execution step. It is always added to the last Sequential rule above it in this table and then at this step, both of these two rules become necessary to respect. For example, having defined the first Sequential rule, you define a Multi-condition rule below; then both of them will become the rules of the first step. Multi-target: as each step, once executed, validates a record field value from a given Reference column and select the corresponding value as the best from a given Target column, a Multi-target rule allows you to add one more Target column to the same step. You need to define each Reference column and Target column manually in this table. Rule Name: Type in the name of each rule you are creating. This column is only available to the Sequential rules as they define the steps of the survivor validation flow. Do not use special characters in rule names, otherwise the Job may not run correctly. Rule names are case insensitive. Reference column: Select the column you need to apply a given rule on. They are the columns you have defined in the schema of this component. This column is not available to the Multi-target rules as they define only the Target column. Function: Select the type of validation operation to be performed on a given Reference column. The available types include: None: no validation operation is performed. Most common: it validates the most frequent field value in each duplicates group. Most recent or Most ancient: the former validates the earliest date value and the latter the latest date value in each duplicates group. The relevant reference column must be of the Date type. Most Complete: it validates the field when the record it belongs to has the least empty fields. Longest or Shortest: the former validates the longest field value and the latter the shortest in each duplicates group. Largest or Smallest: the former validates the largest numerical value and the latter the smallest numerical value in a duplicates group. Match regex: it validates the field when this field complies to the regular expression given in the Value column. Expression: it validates the field when it complies to the expression that you enter in the Value column. The expression value must be written with the Drools language. Value: enter the expression of interest corresponding to the Match regex or the Expression function you have selected in the Function column. Target column: when a step is executed, it validates a record field value from a given Reference column and selects the corresponding value as the best from a given Target column. Select this Target column from the schema columns of this component. Ignore blanks: Select the check boxes which correspond to the names of the columns for which you want the blank value to be ignored.
Define conflict rule	Select this check box to be able to create rules to resolve conflicts in the Conflict rule table.
Conflict rule table	Complete this table to create rules to resolve conflicts. The columns of this table are: Rule name: Type in the name of each rule you are creating. Do not use special characters in rule names, otherwise the Job may not run correctly. Conflicting column:When a step is executed, it validates a record field value from a given Reference column and selects the corresponding value as the best from a given Conflicting column. Select this Conflicting column from the schema columns of this component. Function: Select the type of validation operation to be performed on a given Conflicting column. The available types include those in the Rule table and the following ones: Fill empty: This function fills empty fields width the specified value. Remove duplicate: This function removes the field value in the Reference column if the same field value has been survived in the Conflicting column. Not match regex: This function validates the field when this field does not comply to the regular expression given in the Value column. Survive as: When a field value from the Reference column is survived, this function selects the corresponding field value as the best from the Conflicting column. Value: enter the expression of interest corresponding to the Match regex or the Expression function you have selected in the Function column. Reference column: Select the column you need to apply a given conflicting rule on. They are the columns you have defined in the schema of this component. Ignore blanks: Select the check boxes which correspond to the names of the columns for which you want the blank value to be ignored. Disable: Select the check box to disable the corresponding rule.

Advanced settings

Set the number of partitions by GID	Enter the number of partitions you want to split each group into.

Global Variables

Global Variables	ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule	This component is used as an intermediate step. This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.
Spark Connection	In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files: Yarn mode (Yarn client or Yarn cluster): When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab. When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab. When using Altus, specify the S3 bucket or the Azure Data Lake Storage for Job deployment in the Spark configuration tab. When using Qubole, add a tS3Configuration to your Job to write your actual business data in the S3 system with Qubole. Without tS3Configuration, this business data is written in the Qubole HDFS system and destroyed once you shut down your cluster. When using on-premise distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration. Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration or tS3Configuration. If you are using Databricks without any configuration component present in your Job, your business data is written directly in DBFS (Databricks Filesystem). This connection is effective on a per-Job basis.

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to,
appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a
scenario presents only Standard Jobs, that is to
say traditional
Talend
data integration Jobs.

Spark Connection

In the Spark
Configuration tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

Yarn mode (Yarn client or Yarn cluster):
- When using Google Dataproc, specify a bucket in the
  Google Storage staging bucket
  field in the Spark configuration
  tab.
- When using HDInsight, specify the blob to be used for Job
  deployment in the Windows Azure Storage
  configuration area in the Spark
  configuration tab.
- When using Altus, specify the S3 bucket or the Azure
  Data Lake Storage for Job deployment in the Spark
  configuration tab.
- When using Qubole, add a
  tS3Configuration to your Job to write
  your actual business data in the S3 system with Qubole. Without
  tS3Configuration, this business data is
  written in the Qubole HDFS system and destroyed once you shut
  down your cluster.
- When using on-premise
  distributions, use the configuration component corresponding
  to the file system your cluster is using. Typically, this
  system is HDFS and so use tHDFSConfiguration.
Standalone mode: use the
configuration component corresponding to the file system your cluster is
using, such as tHDFSConfiguration or
tS3Configuration.

If you are using Databricks without any configuration component present
in your Job, your business data is written directly in DBFS (Databricks
Filesystem).

This connection is effective on a per-Job basis.

Creating a clean data set from the suspect pairs labeled by tMatchPredict and the
unique rows computed by tMatchPairing

This scenario applies only to subscription-based Talend Platform products with Big Data and Talend Data Fabric.

In this example, there are two sources of input data:

The suspect records labeled as duplicates and grouped by
tMatchPredict.

You can find an example of how to label suspect
pairs with assigned labels on Talend Help Center (https://help.talend.com).
The unique rows computed by tMatchPairing.

You can find examples of how to compute unique rows
from source data on Talend Help Center (https://help.talend.com).

The use case described here uses two subJobs:

In the first subJob, tRuleSurvivorship processes the
records labeled as duplicates and grouped by
tMatchPredict, to create one single
representation of each duplicates group.
In the second subJob, tUnite merges the survivors and
the unique rows to create a clean and deduplicated data set to be used with
the tMatchIndex component.

The output file contains clean and deduplicated data. You can index this reference data
set in ElasticSearch using the tMatchIndex component.

Setting up the Job

Set up the first subJob:
1. Drop the following components from the Palette onto the design
  workspace: tFileInputDelimited, two
  tFilterRow components and
  tFileOutputDelimited.
  
  Use the Main link to connect the
  components.
2. Connect tFileInputDelimited to the first
  tFilterRow component.
3. Connect the first tFilterRow component to
  tRuleSurvivorship.
4. Connect tRuleSurvivorship to the second
  tFilterRow component.
5. Connect the second tFilterRow component to
  tFileOutputDelimited.
Set up the second subJob:
1. Drop the following components from the Palette onto the design
  workspace: two tFileInputDelimited components,
  tFilterColumns, tUnite
  and tFileOutputDelimited.
  
  Use the Main link to connect the
  components.
2. Connect the first tFileInputDelimited component
  to tFilterColumn.
3. Connect tFilterColumn to
  tUnite.
4. Connect the second tFileInputDelimited component
  to tUnite.
5. Connect tUnite to
  tFileOutputDelimited.
Connect the tFileInputDelimited from the first subJob to
the tFileInputDelimited from the second subJob using a Trigger > OnSubjobOk link.

Selecting the Spark mode

Depending on the Spark cluster to be used, select a Spark mode for your Job.

The Spark documentation provides an exhaustive list of Spark properties and
their default values at Spark Configuration. A Spark Job designed in the Studio uses
this default configuration except for the properties you explicitly defined in the
Spark Configuration tab or the components
used in your Job.

Click Run to open its view and then click the
Spark Configuration tab to display its view
for configuring the Spark connection.
Select the Use local mode check box to test your Job locally.

In the local mode, the Studio builds the Spark environment in itself on the fly in order to
run the Job in. Each processor of the local machine is used as a Spark
worker to perform the computations.

In this mode, your local file system is used; therefore, deactivate the
configuration components such as tS3Configuration or
tHDFSConfiguration that provides connection
information to a remote file system, if you have placed these components
in your Job.

You can launch
your Job without any further configuration.
Clear the Use local mode check box to display the
list of the available Hadoop distributions and from this list, select
the distribution corresponding to your Spark cluster to be used.
This distribution could be:
- Databricks
- Qubole
- Amazon EMR
  For this distribution, Talend supports:
  
  Yarn client
  
  Yarn cluster
- Cloudera
  For this distribution, Talend supports:
  
  Standalone
  
  Yarn client
  
  Yarn cluster
- Google Cloud
  Dataproc
  For this distribution, Talend supports:
  
  Yarn client
- Hortonworks
  For this distribution, Talend supports:
  
  Yarn client
  
  Yarn cluster
- MapR
  For this distribution, Talend supports:
  
  Standalone
  
  Yarn client
  
  Yarn cluster
- Microsoft HD
  Insight
  For this distribution, Talend supports:
  
  Yarn cluster
- Cloudera Altus
  For this distribution, Talend supports:
  
  Yarn cluster
  
  Your Altus cluster should run on the following Cloud
  providers:
  
  Azure
  
  The support for Altus on Azure is a technical
  preview feature.
  
  AWS
As a Job relies on Avro to move data among its components, it is recommended to set your
cluster to use Kryo to handle the Avro types. This not only helps avoid
this Avro known issue but also
brings inherent preformance gains. The Spark property to be set in your
cluster is:

spark.serializer org.apache.spark.serializer.KryoSerializer

1

spark.serializer org.apache.spark.serializer.KryoSerializer

If you cannot find the distribution corresponding to yours from this
drop-down list, this means the distribution you want to connect to is not officially
supported by
Talend
. In this situation, you can select Custom, then select the Spark
version of the cluster to be connected and click the
[+] button to display the dialog box in which you can
alternatively:
1. Select Import from existing
  version to import an officially supported distribution as base
  and then add other required jar files which the base distribution does not
  provide.
2. Select Import from zip to
  import the configuration zip for the custom distribution to be used. This zip
  file should contain the libraries of the different Hadoop/Spark elements and the
  index file of these libraries.
  
  In
  Talend
  
  Exchange, members of
  Talend
  community have shared some ready-for-use configuration zip files
  which you can download from this Hadoop configuration
  list and directly use them in your connection accordingly. However, because of
  the ongoing evolution of the different Hadoop-related projects, you might not be
  able to find the configuration zip corresponding to your distribution from this
  list; then it is recommended to use the Import from
  existing version option to take an existing distribution as base
  to add the jars required by your distribution.
  
  Note that custom versions are not officially supported by
  
  Talend
  .
  Talend
  and its community provide you with the opportunity to connect to
  custom versions from the Studio but cannot guarantee that the configuration of
  whichever version you choose will be easy. As such, you should only attempt to
  set up such a connection if you have sufficient Hadoop and Spark experience to
  handle any issues on your own.
For a step-by-step example about how to connect to a custom
distribution and share this connection, see Hortonworks.

Configuring the connection to the file system to be used by Spark

Skip this section if you are using Google Dataproc or HDInsight, as for these two
distributions, this connection is configured in the Spark
configuration tab.

Double-click tHDFSConfiguration to open its Component view.

Spark uses this component to connect to the HDFS system to which the jar
files dependent on the Job are transferred.
If you have defined the HDFS connection metadata under the Hadoop
cluster node in Repository, select
Repository from the Property
type drop-down list and then click the
[…] button to select the HDFS connection you have
defined from the Repository content wizard.

For further information about setting up a reusable
HDFS connection, search for centralizing HDFS metadata on Talend Help Center
(https://help.talend.com).

If you complete this step, you can skip the following steps about configuring
tHDFSConfiguration because all the required fields
should have been filled automatically.
In the Version area, select
the Hadoop distribution you need to connect to and its version.
In the NameNode URI field,
enter the location of the machine hosting the NameNode service of the cluster.
If you are using WebHDFS, the location should be
webhdfs://masternode:portnumber; WebHDFS with SSL is not
supported yet.
In the Username field, enter
the authentication information used to connect to the HDFS system to be used.
Note that the user name must be the same as you have put in the Spark configuration tab.

Creating survivors from the suspect pairs labeled by tMatchPredict

Configuring the input component

Double-click tFileInputDelimited to open its
Basic settings view.

The input data must be the suspect pairs labeled and grouped by the
tMatchPredict component.
Click the […] button next to Edit
schema and use the [+] button in the
dialog box to add columns.

The input schema must be the same as the suspect pairs outputted by the
tMatchPredict component.
Click OK in the dialog box and accept to propagate the
changes when prompted.
In the Folder/File field, set the path to the input
file.
Set the row and field separators in the corresponding fields and the header and
footer, if any.

Selecting the Spark mode

Depending on the Spark cluster to be used, select a Spark mode for your Job.

Click Run to open its view and then click the
Spark Configuration tab to display its view
for configuring the Spark connection.
Select the Use local mode check box to test your Job locally.

In the local mode, the Studio builds the Spark environment in itself on the fly in order to
run the Job in. Each processor of the local machine is used as a Spark
worker to perform the computations.

In this mode, your local file system is used; therefore, deactivate the
configuration components such as tS3Configuration or
tHDFSConfiguration that provides connection
information to a remote file system, if you have placed these components
in your Job.

You can launch
your Job without any further configuration.
Clear the Use local mode check box to display the
list of the available Hadoop distributions and from this list, select
the distribution corresponding to your Spark cluster to be used.
This distribution could be:
- Databricks
- Qubole
- Amazon EMR
  For this distribution, Talend supports:
  
  Yarn client
  
  Yarn cluster
- Cloudera
  For this distribution, Talend supports:
  
  Standalone
  
  Yarn client
  
  Yarn cluster
- Google Cloud
  Dataproc
  For this distribution, Talend supports:
  
  Yarn client
- Hortonworks
  For this distribution, Talend supports:
  
  Yarn client
  
  Yarn cluster
- MapR
  For this distribution, Talend supports:
  
  Standalone
  
  Yarn client
  
  Yarn cluster
- Microsoft HD
  Insight
  For this distribution, Talend supports:
  
  Yarn cluster
- Cloudera Altus
  For this distribution, Talend supports:
  
  Yarn cluster
  
  Your Altus cluster should run on the following Cloud
  providers:
  
  Azure
  
  The support for Altus on Azure is a technical
  preview feature.
  
  AWS
As a Job relies on Avro to move data among its components, it is recommended to set your
cluster to use Kryo to handle the Avro types. This not only helps avoid
this Avro known issue but also
brings inherent preformance gains. The Spark property to be set in your
cluster is:

spark.serializer org.apache.spark.serializer.KryoSerializer

1

spark.serializer org.apache.spark.serializer.KryoSerializer

If you cannot find the distribution corresponding to yours from this
drop-down list, this means the distribution you want to connect to is not officially
supported by
Talend
. In this situation, you can select Custom, then select the Spark
version of the cluster to be connected and click the
[+] button to display the dialog box in which you can
alternatively:
1. Select Import from existing
  version to import an officially supported distribution as base
  and then add other required jar files which the base distribution does not
  provide.
2. Select Import from zip to
  import the configuration zip for the custom distribution to be used. This zip
  file should contain the libraries of the different Hadoop/Spark elements and the
  index file of these libraries.
  
  In
  Talend
  
  Exchange, members of
  Talend
  community have shared some ready-for-use configuration zip files
  which you can download from this Hadoop configuration
  list and directly use them in your connection accordingly. However, because of
  the ongoing evolution of the different Hadoop-related projects, you might not be
  able to find the configuration zip corresponding to your distribution from this
  list; then it is recommended to use the Import from
  existing version option to take an existing distribution as base
  to add the jars required by your distribution.
  
  Note that custom versions are not officially supported by
  
  Talend
  .
  Talend
  and its community provide you with the opportunity to connect to
  custom versions from the Studio but cannot guarantee that the configuration of
  whichever version you choose will be easy. As such, you should only attempt to
  set up such a connection if you have sufficient Hadoop and Spark experience to
  handle any issues on your own.
For a step-by-step example about how to connect to a custom
distribution and share this connection, see Hortonworks.

Configuring the connection to the file system to be used by Spark

Skip this section if you are using Google Dataproc or HDInsight, as for these two
distributions, this connection is configured in the Spark
configuration tab.

Double-click tHDFSConfiguration to open its Component view.

Spark uses this component to connect to the HDFS system to which the jar
files dependent on the Job are transferred.
If you have defined the HDFS connection metadata under the Hadoop
cluster node in Repository, select
Repository from the Property
type drop-down list and then click the
[…] button to select the HDFS connection you have
defined from the Repository content wizard.

For further information about setting up a reusable
HDFS connection, search for centralizing HDFS metadata on Talend Help Center
(https://help.talend.com).

If you complete this step, you can skip the following steps about configuring
tHDFSConfiguration because all the required fields
should have been filled automatically.
In the Version area, select
the Hadoop distribution you need to connect to and its version.
In the NameNode URI field,
enter the location of the machine hosting the NameNode service of the cluster.
If you are using WebHDFS, the location should be
webhdfs://masternode:portnumber; WebHDFS with SSL is not
supported yet.
In the Username field, enter
the authentication information used to connect to the HDFS system to be used.
Note that the user name must be the same as you have put in the Spark configuration tab.

Configuring the filtering process to keep only the suspect pairs labeled as
duplicates

Double-click the first tFilterRow component to open its
Basic settings view.
Click the Sync columns button to retrieve the schema
from the previous component.
In the Conditions table, add a
condition and fill in the filtering parameters:
1. From the Input Column list, select the column
  which holds the labels set on the records, LABEL
  in this example.
2. From the Function list, select
  Empty.
3. From the Operator list, select =
  =.
4. From the Value list, set the label used to
  identify duplicates, "YES" in this example.

Defining the survivor validation flow

Double-click tRuleSurvivorship to open its
Basic settings view.
Click the Sync columns button to retrieve the schema
from the previous component.
From the list, select the column to be used as a Group
Identifier.
In the Rule package name field, enter the name of the
rule package you need to create to define the survivor validation flow of
interest, org.talend.survivorship.sample in this
example.
In the Rule table, click the [+]
button to add as many rows as required and complete them using the corresponding
rule definitions.
Next to Generate rules and
survivorship flow, click the icon to generate the rule package with its contents you
have defined.

You can find the generated rule package in the Metadata > Rules Management > Survivorship Rules directory of Talend Studio
Repository. From there, you can open the survivor
validation flow created in this example and read its diagram.

Configuring the filtering process to keep only the survivors

Double-click tFilterRow to open its Basic
settings view.
Click the Sync columns button to retrieve the schema
from the previous component.
In the Conditions table, add a condition
and fill in the filtering parameters:
1. From the Input Column list, select the
  SURVIVOR column.
2. From the Function list, select
  Empty.
3. From the Operator list, select =
  =.
4. From the Value list, enter
  "true".
From the Action list, select the operation for writing
data:
- Select Create when you run the Job for the first
  time.
- Select Overwrite to replace the file every time
  you run the Job.
Set the row and field separators in the corresponding fields.

Configuring the output component for the survivors

Double-click the first tFileOutputDelimited component to
display the Basic settings view and define the component
properties.

You have already accepted to propagate the schema to the output components
when you defined the input component.
Clear the Define a storage configuration component check
box to use the local system as your target file system.
In the Folder field, set the path to the folder which
will hold the output data.

Merging the survivors with the unique rows

Configuring the input components

Double-click the first tFileInputDelimited to open its
Basic settings view.

The input data must be the survivors from the first subJob of this
scenario.
Click the […] button next to Edit
schema and use the [+] button in the
dialog box to add columns.

The input schema must be the same as the survivors outputted in the first
subJob.
Click OK in the dialog box and accept to propagate the
changes when prompted.
In the Folder/File field, set the path to the input
files.
Set the row and field separators in the corresponding fields and the header and
footer, if any.
Double-click the second tFileInputDelimited to open its
Basic settings view and define its properties.

The input data must be the unique rows outputted by the
tMatchPairing.

The input schema must be the same as the survivors outputted in the first
subJob.

Configuring the filtering process to remove unwanted columns

Double-click tFilterColumns to open its Basic
settings view.
Click the […] button next to Edit
schema.
In the dialog box, select the unwanted columns and click
[x] to remove from the output schema.

In this example, keep the columns corresponding to the list of education
centers: Original_Id, Source,
Site_name and Address.

Configuring the merging process

Double-click tUnite to open its Basic
settings view.
Click […] next to Edit schema
to check that the output schema corresponds to the schema from the input
tFileInputDelimited components.
Double-click the first tFileOutputDelimited component to
display the Basic settings view and define the component
properties.

You have already accepted to propagate the schema to the output components
when you defined the input component.
Clear the Define a storage configuration component check
box to use the local system as your target file system.
In the Folder field, set the path to the folder which
will hold the output data.
From the Action list, select the operation for writing
data:
- Select Create when you run the Job for the first
  time.
- Select Overwrite to replace the file every time
  you run the Job.
Set the row and field separators in the corresponding fields.
Select the Merge results to single file check box, and
in the Merge file path field set the path where to output
the file of the clean and deduplicated data set.

Executing the Job

Press F6 to save and execute the Job.

A single representation for each duplicates group is created and merged with the
unique rows in a single file.

The data set is now clean and deduplicated.

You can use tMatchIndex to index this reference data set in
Elasticsearch for continuous matching purposes.

You can find an example of how to index a reference data set
in Elasticsearch on Talend Help Center (https://help.talend.com).

Converting the Standard Job to a Spark Batch Job

You can convert your standard Job to a Spark Job by several clicks.

Procedure

In the Repository tree view of the
Integration
perspective of
Talend Studio
,
right-click the Job you have created in the earlier scenario to open its contextual
menu and select Edit properties.

Then the Edit properties dialog box is
displayed. The Job must be closed before you are able to make any changes in this
dialog box.

Note that you can change the Job name as well as the other descriptive information
about the Job from this dialog box.
From the Job Type list, select Big Data Batch.
From the Framework list, select Spark. Then a Spark Job using the same name appears under the
Big Data Batch sub-node of the Job Design node.

If you need to create this Spark Job from scratch, you have to right-click the Job Design node or the Big Data
Batch sub-node and select Create Big Data Batch
Job from the contextual menu. Then an empty Job is opened in the
workspace.

Selecting the Spark mode

Depending on the Spark cluster to be used, select a Spark mode for your Job.

Click Run to open its view and then click the
Spark Configuration tab to display its view
for configuring the Spark connection.
Select the Use local mode check box to test your Job locally.

In the local mode, the Studio builds the Spark environment in itself on the fly in order to
run the Job in. Each processor of the local machine is used as a Spark
worker to perform the computations.

In this mode, your local file system is used; therefore, deactivate the
configuration components such as tS3Configuration or
tHDFSConfiguration that provides connection
information to a remote file system, if you have placed these components
in your Job.

You can launch
your Job without any further configuration.
Clear the Use local mode check box to display the
list of the available Hadoop distributions and from this list, select
the distribution corresponding to your Spark cluster to be used.
This distribution could be:
- Databricks
- Qubole
- Amazon EMR
  For this distribution, Talend supports:
  
  Yarn client
  
  Yarn cluster
- Cloudera
  For this distribution, Talend supports:
  
  Standalone
  
  Yarn client
  
  Yarn cluster
- Google Cloud
  Dataproc
  For this distribution, Talend supports:
  
  Yarn client
- Hortonworks
  For this distribution, Talend supports:
  
  Yarn client
  
  Yarn cluster
- MapR
  For this distribution, Talend supports:
  
  Standalone
  
  Yarn client
  
  Yarn cluster
- Microsoft HD
  Insight
  For this distribution, Talend supports:
  
  Yarn cluster
- Cloudera Altus
  For this distribution, Talend supports:
  
  Yarn cluster
  
  Your Altus cluster should run on the following Cloud
  providers:
  
  Azure
  
  The support for Altus on Azure is a technical
  preview feature.
  
  AWS
As a Job relies on Avro to move data among its components, it is recommended to set your
cluster to use Kryo to handle the Avro types. This not only helps avoid
this Avro known issue but also
brings inherent preformance gains. The Spark property to be set in your
cluster is:

spark.serializer org.apache.spark.serializer.KryoSerializer

1

spark.serializer org.apache.spark.serializer.KryoSerializer

If you cannot find the distribution corresponding to yours from this
drop-down list, this means the distribution you want to connect to is not officially
supported by
Talend
. In this situation, you can select Custom, then select the Spark
version of the cluster to be connected and click the
[+] button to display the dialog box in which you can
alternatively:
1. Select Import from existing
  version to import an officially supported distribution as base
  and then add other required jar files which the base distribution does not
  provide.
2. Select Import from zip to
  import the configuration zip for the custom distribution to be used. This zip
  file should contain the libraries of the different Hadoop/Spark elements and the
  index file of these libraries.
  
  In
  Talend
  
  Exchange, members of
  Talend
  community have shared some ready-for-use configuration zip files
  which you can download from this Hadoop configuration
  list and directly use them in your connection accordingly. However, because of
  the ongoing evolution of the different Hadoop-related projects, you might not be
  able to find the configuration zip corresponding to your distribution from this
  list; then it is recommended to use the Import from
  existing version option to take an existing distribution as base
  to add the jars required by your distribution.
  
  Note that custom versions are not officially supported by
  
  Talend
  .
  Talend
  and its community provide you with the opportunity to connect to
  custom versions from the Studio but cannot guarantee that the configuration of
  whichever version you choose will be easy. As such, you should only attempt to
  set up such a connection if you have sufficient Hadoop and Spark experience to
  handle any issues on your own.
For a step-by-step example about how to connect to a custom
distribution and share this connection, see Hortonworks.

Configuring the connection to the file system to be used by Spark

Skip this section if you are using Google Dataproc or HDInsight, as for these two
distributions, this connection is configured in the Spark
configuration tab.

Double-click tHDFSConfiguration to open its Component view.

Spark uses this component to connect to the HDFS system to which the jar
files dependent on the Job are transferred.
If you have defined the HDFS connection metadata under the Hadoop
cluster node in Repository, select
Repository from the Property
type drop-down list and then click the
[…] button to select the HDFS connection you have
defined from the Repository content wizard.

For further information about setting up a reusable
HDFS connection, search for centralizing HDFS metadata on Talend Help Center
(https://help.talend.com).

If you complete this step, you can skip the following steps about configuring
tHDFSConfiguration because all the required fields
should have been filled automatically.
In the Version area, select
the Hadoop distribution you need to connect to and its version.
In the NameNode URI field,
enter the location of the machine hosting the NameNode service of the cluster.
If you are using WebHDFS, the location should be
webhdfs://masternode:portnumber; WebHDFS with SSL is not
supported yet.
In the Username field, enter
the authentication information used to connect to the HDFS system to be used.
Note that the user name must be the same as you have put in the Spark configuration tab.

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 7.x

0 Comments

Inline Feedbacks

View all comments

tRuleSurvivorship

tRuleSurvivorship Standard properties

Basic settings

Advanced settings

Global Variables

Usage

Selecting the best-of-breed data from a group of duplicates to create a survivor

Dropping and linking the components

Configuring the process of grouping the input data

Setting up the input records

Grouping the duplicate records

Defining the survivor validation flow

Selecting the columns of interest

Executing the Job

Modifying the rule file manually to code the conditions you want to use to create a survivor

Dropping the components and linking them together

Setting up the input records

Defining the survivor validation flow

Executing the Job

tRuleSurvivorship properties for Apache Spark Batch

Basic settings

Advanced settings

Global Variables

Usage

Creating a clean data set from the suspect pairs labeled by tMatchPredict and the unique rows computed by tMatchPairing

Setting up the Job

Selecting the Spark mode

Configuring the connection to the file system to be used by Spark

Creating survivors from the suspect pairs labeled by tMatchPredict

Configuring the input component

Selecting the Spark mode

Configuring the connection to the file system to be used by Spark

Configuring the filtering process to keep only the suspect pairs labeled as duplicates

Defining the survivor validation flow

Configuring the filtering process to keep only the survivors

Configuring the output component for the survivors

Merging the survivors with the unique rows

Configuring the input components

Configuring the filtering process to remove unwanted columns

Configuring the merging process

Executing the Job

Converting the Standard Job to a Spark Batch Job

Procedure

Selecting the Spark mode

Configuring the connection to the file system to be used by Spark

Selecting the best-of-breed data from a group of duplicates to create a
survivor

Modifying the rule file manually to code the conditions you want to use
to create a survivor

Creating a clean data set from the suspect pairs labeled by tMatchPredict and the
unique rows computed by tMatchPairing

Configuring the filtering process to keep only the suspect pairs labeled as
duplicates