Component family
|
Data Quality/Matching
|
This component is deprecated as tGenKey can be used now in both standard and
Map/Reduce Jobs. tGenKeyHadoop will
continue to work in Jobs you import from older releases.
|
Function
|
This component connects to a given Hadoop distribution to process
the data stored in HDFS. It enables you to apply several kinds of
algorithms on each input column and use the computed results to
generate a functional key. These algorithms can be key or optional
algorithms.
Note
The values returned by the key algorithms will be
concatenated, according to the column order in the Key composition table.
|
Purpose
|
This component generates a functional key from the input columns,
by applying different types of algorithms on each column and
grouping the computed results in one key. It outputs this key
together with the input columns.
This component helps, using the generated functional key, to
narrow down your data filter/matching results for example.
|
Basic settings
|
Property type
|
Either Built-in or Repository.
Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.
|
|
Schema and Edit
schema
|
A schema is a row description, it defines the number of fields to
be processed and passed on to the next component. The schema is
either Built-in or stored remotely
in the Repository.
Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.
Click Sync columns to retrieve
the schema from the previous component in the Job.
|
|
|
Built-in: You create and store
the schema locally for this component only. Related topic: see
Talend Studio User
Guide.
|
|
|
Repository: You have already
created and stored the schema in the Repository. You can reuse it in
other projects and job designs. Related topic: see Talend Studio User Guide .
|
|
Link with a tGenKeyHadoop
|
Select this check box to reuse the connection, created by another
tGenKeyHadoop, to a given file
in HDFS. From the Component list,
select the relevant tGenKeyHadoop
component to reuse the Hadoop connection details you already
defined.
|
|
Use an existing connection
|
Select this check box and in the Component List click the
HDFS connection component from which you want to reuse the connection details already
defined.
Note
When a Job contains the parent Job and the child Job, Component
list presents only the connection components in the same Job
level.
|
Version
Note
Unavailable if you use an existing link.
|
Distribution
|
Select the cluster you are using from the drop-down list. The options in the list vary
depending on the component you are using. Among these options, the following ones requires
specific configuration:
-
If available in this Distribution drop-down list, the
Microsoft HD Insight option allows you to use a
Microsoft HD Insight cluster. For this purpose, you need to configure the
connections to the WebHCat service, the HD Insight service and the Windows Azure
Storage service of that cluster in the areas that are displayed. A demonstration
video about how to configure this connection is available in the following link:
https://www.youtube.com/watch?v=A3QTT6VsNoM
-
The Custom option allows you to connect to a
cluster different from any of the distributions given in this list, that is to
say, to connect to a cluster not officially supported by Talend.
In order to connect to a custom distribution, once selecting Custom, click the button to display the dialog box in which you can
alternatively:
-
Select Import from existing version to import an
officially supported distribution as base and then add other required jar files
which the base distribution does not provide.
-
Select Import from zip to import a custom
distribution zip that, for example, you can download from http://www.talendforge.org/exchange/index.php.
Note
In this dialog box, the active check box must be kept selected so as to import
the jar files pertinent to the connection to be created between the custom
distribution and this component.
For an step-by-step example about how to connect to a custom distribution and
share this connection, see Connecting to a custom Hadoop distribution.
|
|
Hadoop version
|
Select the version of the Hadoop distribution you are using. The available options vary
depending on the component you are using. Along with the evolution of Hadoop, please note
the following changes:
-
If you use Hortonworks Data Platform V2.2, the
configuration files of your cluster might be using environment variables such as
${hdp.version}. If this is your situation, you
need to set the mapreduce.application.framework.path property in the Hadoop properties table of this component with the path value
explicitly pointing to the MapReduce framework archive of your cluster. For
example:
|
mapreduce.application.framework.path=/hdp/apps/2.2.0.0-2041/mapreduce/mapreduce.tar.gz#mr-framework |
-
If you use Hortonworks Data Platform V2.0.0, the
type of the operating system for running the distribution and a Talend
Job must be the same, such as Windows or Linux. Otherwise, you have to use Talend
Jobserver to execute the Job in the same type of operating system in which the
Hortonworks Data Platform V2.0.0 distribution you
are using is run. For further information about Talend Jobserver, see
Talend Installation
and Upgrade Guide.
|
|
NameNode URI
|
Select this check box to indicate the location of the NameNode of the Hadoop cluster to be
used. The NameNode is the master node of a Hadoop cluster. For example, we assume that you
have chosen a machine called masternode as the NameNode
of an Apache Hadoop distribution, then the location is hdfs://masternode:portnumber.
For further information about the Hadoop Map/Reduce framework, see the Map/Reduce tutorial
in Apache’s Hadoop documentation on http://hadoop.apache.org.
|
|
JobTracker URI
|
Select this check box to indicate the location of the Jobtracker service within the Hadoop
cluster to be used. For example, we assume that you have chosen a machine called machine1 as the JobTracker, then set its location as machine1:portnumber. A Jobtracker is the service that assigns
Map/Reduce tasks to specific nodes in a Hadoop cluster. Note that the notion job in this
term JobTracker does not designate a Talend Job, but rather a Hadoop job
described as MR or MapReduce job in Apache’s Hadoop documentation on http://hadoop.apache.org.
If you use YARN in your Hadoop cluster such as Hortonworks Data
Platform V2.0.0 or Cloudera CDH4.3 + (YARN
mode), you need to specify the location of the Resource
Manager instead of the Jobtracker. Then you can continue to set the following
parameters depending on the configuration of the Hadoop cluster to be used (if you leave the
check box of a parameter clear, then at runtime, the configuration about this parameter in
the Hadoop cluster to be used will be ignored ):
-
Select the Set resourcemanager scheduler
address check box and enter the Scheduler address in the field
that appears.
-
Allocate proper memory volumes to the Map and
the Reduce computations and the ApplicationMaster of YARN by selecting the Set memory check box in the Advanced settings view.
-
Select the Set jobhistory address check box
and enter the location of the JobHistory server of the Hadoop cluster to be
used. This allows the metrics information of the current Job to be stored in
that JobHistory server.
-
Select the Set staging directory check box
and enter this directory defined in your Hadoop cluster for temporary files
created by running programs. Typically, this directory can be found under the
yarn.app.mapreduce.am.staging-dir
property in the configuration files such as yarn-site.xml or mapred-site.xml of your distribution.
-
Select the Set Hadoop user check box and
enter the user name under which you want to execute the Job. Since a file or a
directory in Hadoop has its specific owner with appropriate read or write
rights, this field allows you to execute the Job directly under the user name
that has the appropriate rights to access the file or directory to be
processed.
-
Select the Use datanode hostname check box to
allow the Job to access datanodes via their hostnames. This actually sets the
dfs.client.use.datanode.hostname property
to true.
For further information about these parameters, see the documentation or
contact the administrator of the Hadoop cluster to be used.
For further information about the Hadoop Map/Reduce framework, see the Map/Reduce tutorial
in Apache’s Hadoop documentation on http://hadoop.apache.org.
|
|
User name
|
Enter the Hadoop user authentication.
For some Hadoop versions, you also need to enter the name of the
supergroup to which the user belongs in the Group field that is displayed.
|
|
HDFS directory
|
Enter the HDFS directory where the data to be processed is.
At runtime, this component will clean up any data in this
directory if it exists, write the input data in and perform the
operations.
If you want to reuse the existing data in HDFS, use:
|
|
|
Click the import icon to select a match rule from the Studio
repository.
When you click the import icon, a [Match
Rule Selector] wizard is opened to help you import
blocking keys from match rules in the Studio repository and use them
in your Job.
You can only import rules created with the VSR algorithm. For
further information, see Importing match rules from the studio repository
|
Algorithm
|
Column
|
Select the column(s) from the main flow on which you want to
define certain algorithms to set the functional key
Note
When you select a date column on which to apply an algorithm or a matching algorithm,
you can decide what to compare in the date format.
For example, if you want to only compare the year in the date, in the component schema
set the type of the date column to Date and then enter
“yyyy” in the Date
Pattern field. The component then converts the date format to a string
according to the pattern defined in the schema before starting a string
comparison.
|
|
Pre-Algorithm
|
If required, select the relevant matching algorithm from the
list:
remove diacritical marks:removes
any diacritical mark.
remove diacritical marks and lower
case: removes any diacritical mark and converts to
lower case before generating the code of the column.
remove diacritical marks and upper case:
removes any diacritical mark and converts to upper case
before generating the code of the column.
lower case: converts the field to
lower case before applying the key algorithm.
upper case: converts the field to
upper case before applying the key algorithm.
add left position character:
enables you to add a character to the left of the column.
add right position character:
enables you to add a character to the right of the column.
|
|
Value
|
Set the algorithm value, where applicable.
|
|
Algorithm
|
Select the relevant algorithm from the list:
first character of each word:
includes in the functional key the first character of each word in
the column.
N first characters of each word:
includes in the functional key N first characters of each word in
the column.
first N characters of the string:
includes in the functional key N first characters of the
string.
last N characters of the string:
includes in the functional key N last characters of the
string.
first N consonants of the string:
includes in the functional key N first consonants of the
string.
first N vowels of the string:
includes in the functional key N first vowels of the string.
pick characters: includes in the
functional key the characters located at a fixed position
(corresponding to the set digital/range).
exact: includes in the functional
key the full string.
substring(a,b): includes in the
functional key character according to the set index.
soundex code: generates a code
according to a standard English phonetic algorithm. This code
represents the character string that will be included in the
functional key.
metaphone code: generates a code
according to the character pronunciation. This code represents the
character string that will be included in the functional key.
double-metaphone code: generates a
code according to the character pronunciation using a new version of
the Metaphone phonetic algorithm, that produces more accurate
results than the original algorithm. This code represents the
character string that will be included in the functional key.
fingerPrintkey: generates the
functional key from a string value through the following sequential process:
-
remove leading and trailing whitespace,
-
change all characters to their lowercase
representation,
-
remove all punctuation and control characters,
-
split the string into whitespace-separated
tokens,
-
sort the tokens and remove duplicates,
-
join the tokens back together,
Because the string parts are sorted, the given order
of tokens does not matter. So, Cruise,
Tom and Tom Cruise
both end up with a fingerprint cruise
tom and therefore end up in the same
cluster.
-
normalize extended western characters to their ASCII
representation ,for example gödel
to godel.
This reproduce data entry mistakes performed when
entering extended characters with an ASCII-only
keyboard. However, this procedure can also lead to false
positives, for example gödel and
godél would both end up with
godel as their fingerprint but
they are likely to be different names. So this might
work less effectively for datasets where extended
characters play substantial differentiation role.
nGramkey: this algorithm is similar to the fingerPrintkey method described above.
But instead of using whitespace separated tokens, it uses n-grams,
where the n can be specified by the user. This
method generates the functional key from a string value through the
following sequential process:
-
change all characters to their lowercase
representation,
-
remove all punctuation and control characters,
-
obtain all the string n-grams,
-
sort the n-grams and remove duplicates,
-
join the sorted n-grams back together,
-
normalize extended western characters to their ASCII
representation, for example gödel
to godel.
For example, the 2-gram fingerprint of
Paris is
arispari and the 1-gram
fingerprint is aiprs.
The delivered implementation of this algorithm is
2-grams.
Note
If the column on which you want to use the nGramkey algorithm can have data,
with only 0 or 1
characters, you must filter this data before generating the
functional key. This way you avoid potentially comparing
records to each other that should not be match
candidates.
colognPhonetic: a soundex
phonetic algorithm optimized for the German language. It encodes a
string into a Cologne phonetic value. This code represents the
character string that will be included in the functional key.
|
|
Value
|
Set the algorithm value, where applicable.
|
|
Post-Algorithm
|
If required, select the relevant matching algorithm from the
list:
use default value (string): enables
you to choose a string to replace null or empty data.
add left position character:
enables you to add a character to the left of the column.
add right position character:
enables you to add a character to the right of the column.
|
|
Value
|
Set the option value, where applicable.
|
|
Show help
|
Select this check box to display instructions on how to set
algorithms/options parameters.
|
Advanced settings
|
Hadoop Properties
|
Talend Studio uses a default configuration for its engine to perform
operations in a Hadoop distribution. If you need to use a custom configuration in a specific
situation, complete this table with the property or properties to be customized. Then at
runtime, the customized property or properties will override those default ones.
For further information about the properties required by Hadoop and its related systems such
as HDFS and Hive, see the documentation of the Hadoop distribution you
are using or see Apache’s Hadoop documentation on http://hadoop.apache.org/docs and then select the version of the documentation you want. For demonstration purposes, the links to some properties are listed below:
This table will not be available when you select the Link with a tGenKeyHadoop check
box.
|
|
Keep data in Hadoop
|
Select this check box to keep the data processed by this component
in the HDFS file.
If you leave this check box unselected, the component processes
data and then retrieves it from the HDFS file and output it in the
Job flow.
|
|
Use existing HDFS file
|
Select this check box to enable the component to directly process
the data located in a given directory in HDFS. When this check box
is selected, this component can act as a start component in your
Job.
HDFS file URI: set the URI of the
HDFS file holding the data you want to process.
Field delimiter: set the
character used as a field delimiter in the HDFS file.
If you leave this check box unselected, this component receives
the data flow to be processed and loads it into an HDFS file
specified in the HDFS directory
field.
|
|
tStatCatcher Statistics
|
Select this check box to collect log data at the component
level.
|
Dynamic settings
|
Click the [+] button to add a row
in the table and fill the Code
field with a context variable to choose your HDFS connection
dynamically from multiple connections planned in your Job.
The Dynamic settings table is available only when the
Use an existing connection check box is selected in the
Basic settings view. Once a dynamic parameter is
defined, the Component List box in the Basic settings view becomes unusable.
For more information on Dynamic settings and context
variables, see Talend Studio User Guide.
|
Global Variables
|
ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.
A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.
To fill up a field or expression with a variable, press Ctrl +
Space to access the variable list and choose the variable to use from it.
For further information about variables, see Talend Studio
User Guide.
|
Usage
|
This component can be a start or an intermediary component to read
and process given data in HDFS. This component can be used with
other components, such as tMatchGroupHadoop, in order to create a blocking key
for the data stored in HDFS.
|
Limitation/prerequisite
|
You need to use Linux to run the Job using this component.
|