tHBaseInput
Reads data from a given HBase database and extracts columns of
selection.
HBase is a distributed, column-oriented database that hosts very large,
sparsely populated tables on clusters.
tHBaseInput extracts columns corresponding to schema
definition. Then it passes these columns to the next component via a Main row link.
Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:
-
Standard: see tHBaseInput Standard properties.
The component in this framework is available in all Talend products with Big Data
and in Talend Data Fabric. -
MapReduce: see tHBaseInput MapReduce properties (deprecated).
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric. -
Spark Batch: see tHBaseInput properties for Apache Spark Batch.
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.
HBase filters
This table presents the HBase filters available in
Talend Studio
and the
parameters required by those filters.
Filter type |
Filter column |
Filter family | Filter operation | Filter value | Filter comparator type | Objective |
---|---|---|---|---|---|---|
Single Column Value Filter |
Yes |
Yes |
Yes |
Yes |
Yes |
It compares the values of a given column against the value defined |
Family filter |
 |
Yes |
Yes |
 |
Yes |
It returns the columns of the family that meets the filtering |
Qualifier filter |
Yes |
 |
Yes |
 |
Yes |
It returns the columns whose column qualifiers match the filtering |
Column prefix filter |
Yes |
Yes |
 |  |  |
It returns all columns of which the qualifiers have the prefix |
Multiple column prefix filter |
Yes (Multiple prefixes are separated by comma, for |
Yes |
 |  |  |
It works the same way as a Column prefix |
Column range filter |
Yes (The ends of a range are separated by comma. ) |
Yes |
 |  |  | It allows intra row scanning and returns all matching columns of a scanned row. |
Row filter |
 |  |
Yes |
Yes |
Yes | It filters on row keys and returns all rows that matches the filtering condition. |
Value filter |
 |  |
Yes |
Yes |
Yes |
It returns only columns that have a specific value. |
The use explained above of the listed HBase filters is subject to revisions made by
Apache in its Apache HBase project; therefore, in order to fully understand how to use
these HBase filters, we recommend reading Apache’s HBase documentation.
tHBaseInput Standard properties
These properties are used to configure tHBaseInput running in the Standard Job framework.
The Standard
tHBaseInput component belongs to the Big Data and the Databases NoSQL families.
The component in this framework is available in all Talend products with Big Data
and in Talend Data Fabric.
Basic settings
Property type |
Either Built-In or Repository. Built-In: No property data stored centrally.
Repository: Select the repository file where the |
|
Click this icon to open a database connection wizard and store the For more information about setting up and storing database |
Use an existing connection |
Select this check box and in the Component List click the relevant connection component to |
Distribution |
Select the cluster you are using from the drop-down list. The options in the
list vary depending on the component you are using. Among these options, the following ones requires specific configuration:
|
HBase version |
Select the version of the Hadoop distribution you are using. The available |
Hadoop version of the |
This list is displayed only when you have selected Custom from the distribution list to connect to a cluster not yet |
Zookeeper quorum |
Type in the name or the URL of the Zookeeper service you use to coordinate the transaction |
Zookeeper client port |
Type in the number of the client listening port of the Zookeeper service you are |
Use kerberos authentication |
If the database to be used is running with Kerberos security, select this
check box, then, enter the principal names in the displayed fields. You should be able to find the information in the hbase-site.xml file of the cluster to be used.
If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains Note that the user that executes a keytab-enabled Job is not necessarily |
Schema and Edit |
A schema is a row description. It defines the number of fields Click Edit
|
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Set table Namespace mappings |
Enter the string to be used to construct the mapping between an Apache HBase table and a For the valid syntax you can use, see http://doc.mapr.com/display/MapR40x/Mapping+Table+Namespace+Between+Apache+HBase+Tables+and+MapR+Tables. |
Table name |
Type in the name of the table from which you need to extract columns. |
Define a row selection |
Select this check box and then in the Start row and the Different from the filters you can set using Is by |
Mapping |
Complete this table to map the columns of the table to be used with the schema columns you |
Die on error |
Select the check box to stop the execution of the Job when an error Clear the check box to skip any rows on error and complete the process for |
Advanced settings
tStatCatcher Statistics |
Select this check box to collect log data at the component |
Properties |
If you need to use custom configuration for your database, complete this table with the For example, you need to define the value of the dfs.replication property as 1 for the Note:
This table is not available when you are using an existing |
Is by filter |
Select this check box to use filters to perform fine-grained data selection from your Once selecting it, the Filter table that is used to This feature leverages filters provided by HBase and subject to constraints explained in |
Logical operation |
Select the operator you need to use to define the logical relation between filters. This
available operators are:
|
Filter |
Click the button under this table to add as many rows as required, each row representing a
filter. The parameters you may need to set for a filter are:
Depending on the Filter type you are using, |
Retrieve timestamps |
Select this check box to load the timestamps of an HBase column into the
|
Global Variables
Global Variables |
NB_LINE: the number of rows read by an input component or
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
Usage rule |
This component is a start component of a Job and always needs an |
Prerequisites |
Before starting, ensure that you have met the Loopback IP prerequisites expected by your The Hadoop distribution must be properly installed, so as to guarantee the interaction
For further information about how to install a Hadoop distribution, see the manuals |
Exchanging customer data with HBase
This scenario applies only to Talend products with Big Data.
In this scenario, a six-component Job is used to exchange customer data with a given
HBase.
-
tHBaseConnection: creates a connection to
your HBase database. -
tFixedFlowInput: creates the data to be
written into your HBase. In the real use case, this component could be
replaced by the other input components like tFileInputDelimited. -
tHBaseOutput: writes the data it receives
from the preceding component into your HBase. -
tHBaseInput: extracts the columns of
interest from your HBase. -
tLogRow: presents the execution
result. -
tHBaseClose: closes the
transaction.
To replicate this scenario, proceed as the following sections illustrate.
Before starting the replication, your Hbase and Zookeeper service should have been
correctly installed and well configured. This scenario explains only how to use
Talend solution to make data transaction with a given HBase.
Dropping and linking the components
To do this, proceed as follows:
-
Drop tHBaseConnection, tFixedFlowInput, tHBaseOutput, tHBaseInput, tLogRow and
tHBaseClose from Palette onto the
Design workspace. -
Right-click tHBaseConnection to open its
contextual menu and select the Trigger >
On Subjob Ok link from this menu to
connect this component to tFixedFlowInput. -
Do the same to create the OnSubjobOk link
from tFixedFlowInput to tHBaseInput and then to tHBaseClose. -
Right-click tFixedFlowInput and select
the Row > Main link to connect this component to tHBaseOutput. -
Do the same to create the Main link from
tHBaseInput to tLogrow.
The components to be used in this scenario are all placed and linked. Then you
need continue to configure them sucessively.
Configuring the connection
To configure the connection to your Zookeeper service and thus to the HBase of
interest, proceed as follows:
-
On the Design workspace of your Studio, double-click the tHBaseConnection component to open its Component view.
-
Select Hortonworks Data Platform 1.0 from
the HBase version list. -
In the Zookeeper quorum field, type in
the name or the URL of the Zookeeper service you are using. In this example,
the name of the service in use is hbase. -
In the Zookeeper client port field, type
in the number of client listening port. In this example, it is 2181. -
If the Zookeeper znode parent location has been defined in the Hadoop
cluster you are connecting to, you need to select the Set zookeeper znode parent check box and enter the value of
this property in the field that is displayed.
Configuring the process of writing data into the HBase
To do this, proceed as follows:
-
On the Design workspace, double-click the tFixedFlowInput component to open its Component view.
-
In this view, click the three-dot button next to Edit schema to open the schema editor.
-
Click the plus button three times to add three rows and in the Column column, rename the three rows respectively
as: id,
name and age. -
In the Type column, click each of these
rows and from the drop-down list, select the data type of every row. In this
scenario, they are Integer for id and age,
String for name. -
Click OK to validate these changes and
accept the propagation prompted by the pop-up dialog box. -
In the Mode area, select the Use Inline Content (delimited file) to display
the fields for editing. -
In the Content field, type in the
delimited data to be written into the HBase, separated with the semicolon
“;
“. In this example, they are:12345678910111213141516171;Albert;232;Alexandre;243;Alfred-Hubert;224;Andre;405;Didier;286;Anthony;357;Artus;328;Catherine;349;Charles;2110;Christophe;3611;Christian;6712;Danniel;5413;Elisabeth;5814;Emile;3215;Gregory;30Â -
Double-click tHBaseOutput to open its
Component view.Note: If this component does not have the same schema of the preceding
component, a warning icon appears. In this case, click the Sync columns button to retrieve the schema
from the preceding one and once done, the warning icon disappears. -
Select the Use an existing connection
check box and then select the connection you have configured earlier. In
this example, it is tHBaseConnection_1. -
In the Table name field, type in the name
of the table to be created in the HBase. In this example, it is customer. -
In the Action on table field, select the
action of interest from the drop-down list. In this scenario, select
Drop table if exists and create. This
way, if a table named customer exists already in the HBase, it will be
disabled and deleted before creating this current table. -
Click the Advanced settings tab to open
the corresponding view. -
In the Family parameters table, add two
rows by clicking the plus button, rename them as family1 and family2
respectively and then leave the other columns empty. These two column
families will be created in the HBase using the default family performance
options.Note: The Family parameters table is
available only when the action you have selected in the Action on table field is to create a table in
HBase. For further information about this Family
parameters table, see tHBaseOutput. -
In the Families table of the Basic settings view, enter the family names in
the Family name column, each corresponding
to the column this family contains. In this example, the id and the age columns belong to family1 and the name
column to family2.Note: These column families should already exist in the HBase to be
connected to; if not, you need to define them in the Family parameters table of the Advanced settings view for creating them at
runtime.
Configuring the process of extracting data from the HBase
To do this, perform the following operations:
-
Double-click tHBaseInput to open its
Component view. -
Select the Use an existing connection
check box and then select the connection you have configured earlier. In
this example, it is tHBaseConnection_1.
-
Click the three-dot button next to Edit
schema to open the schema editor. -
Click the plus button three times to add three rows and rename them as
id, name and age respectively
in the Column column. This means that you
extract these three columns from the HBase. -
Select the types for each of the three columns. In this example, Integer for id
and age, String for name. -
Click OK to validate these changes and
accept the propagation prompted by the pop-up dialog box. -
In the Table name field, type in the
table from which you extract the columns of interest. In this scenario, the
table is customer. -
In the Mapping table, the Column column has been already filled
automatically since the schema was defined, so simply enter the name of
every family in the Column family column,
each corresponding to the column it contains. -
Double-click tHBaseClose to open its
Component view. -
In the Component List field, select the
connection you need to close. In this example, this connection is tHBaseConnection_1.
Executing the Job
To execute this Job, press F6.
Once done, the Run view is opened automatically,
where you can check the execution result.
These columns of interest are extracted and you can process them according to
your needs.
Login to your HBase database, you can check the customer table this Job has created.
tHBaseInput MapReduce properties (deprecated)
These properties are used to configure tHBaseInput running in the MapReduce Job framework.
The MapReduce
tHBaseInput component belongs to the MapReduce and the Databases families.
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.
The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.
Basic settings
Property type |
Either Built-In or Repository. Built-In: No property data stored centrally.
Repository: Select the repository file where the |
|
Click this icon to open a database connection wizard and store the For more information about setting up and storing database |
Distribution |
Select the cluster you are using from the drop-down list. The options in the
list vary depending on the component you are using. Among these options, the following ones requires specific configuration:
In the Map/Reduce version of this component, the distribution you select must be the same |
HBase version |
Select the version of the Hadoop distribution you are using. The available |
Hadoop version of the |
This list is displayed only when you have selected Custom from the distribution list to connect to a cluster not yet |
Zookeeper quorum |
Type in the name or the URL of the Zookeeper service you use to coordinate the transaction |
Zookeeper client port |
Type in the number of the client listening port of the Zookeeper service you are |
Use kerberos authentication |
If the database to be used is running with Kerberos security, select this
check box, then, enter the principal names in the displayed fields. You should be able to find the information in the hbase-site.xml file of the cluster to be used.
If you need to use a Kerberos keytab file to log in, select Use a keytab to authenticate. A keytab file contains Note that the user that executes a keytab-enabled Job is not necessarily |
Schema et Edit schema |
A schema is a row description. It defines the number of fields Click Edit
|
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Set table Namespace mappings |
Enter the string to be used to construct the mapping between an Apache HBase table and a For the valid syntax you can use, see http://doc.mapr.com/display/MapR40x/Mapping+Table+Namespace+Between+Apache+HBase+Tables+and+MapR+Tables. |
Table name |
Type in the name of the table from which you need to extract columns. |
Mapping |
Complete this table to map the columns of the table to be used with the schema columns you |
Die on error |
Select the check box to stop the execution of the Job when an error Clear the check box to skip any rows on error and complete the process for |
Advanced settings
Properties |
If you need to use custom configuration for your database, complete this table with the For example, you need to define the value of the dfs.replication property as 1 for the |
Is by filter |
Select this check box to use filters to perform fine-grained data selection from your Once selecting it, the Filter table that is used to This feature leverages filters provided by HBase and subject to constraints explained in |
Logical operation |
Select the operator you need to use to define the logical relation between filters. This
available operators are:
|
Filter |
Click the button under this table to add as many rows as required, each row representing a
filter. The parameters you may need to set for a filter are:
Depending on the Filter type you are using, |
Global Variables
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
Usage rule |
In a You need to use the Hadoop Configuration tab in the The Hadoop configuration you use for the whole Job and the Hadoop distribution you use for For further information about a Note that in this documentation, unless otherwise |
Prerequisites |
Before starting, ensure that you have met the Loopback IP prerequisites expected by your The Hadoop distribution must be properly installed, so as to guarantee the interaction
For further information about how to install a Hadoop distribution, see the manuals |
Hadoop Connection |
You need to use the Hadoop Configuration tab in the This connection is effective on a per-Job basis. |
Related scenarios
No scenario is available for the Map/Reduce version of this component yet.
tHBaseInput properties for Apache Spark Batch
These properties are used to configure tHBaseInput running in the Spark Batch Job framework.
The Spark Batch
tHBaseInput component belongs to the Databases family.
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.
Basic settings
Storage configuration |
Select the tHBaseConfiguration component from which the |
Property type |
Either Built-In or Repository. Built-In: No property data stored centrally.
Repository: Select the repository file where the |
|
Click this icon to open a database connection wizard and store the For more information about setting up and storing database |
Schema et Edit schema |
A schema is a row description. It defines the number of fields Click Edit
|
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Table name |
Type in the name of the table from which you need to extract columns. |
Mapping |
Complete this table to map the columns of the table to be used with the schema columns you |
Is by filter |
Select this check box to use filters to perform fine-grained data selection from your Once selecting it, the Filter table that is used to This feature leverages filters provided by HBase and subject to constraints explained in |
Logical operation |
Select the operator you need to use to define the logical relation between filters. This
available operators are:
|
Filter |
Click the button under this table to add as many rows as required, each row representing a
filter. The parameters you may need to set for a filter are:
Depending on the Filter type you are using, |
Die on HBase error |
Select the check box to stop the execution of the Job when an error Clear the check box to skip any rows on error and complete the process for |
Usage
Usage rule |
This component is used as a start component and requires an output This component uses a tHBaseConfiguration component present in the same Job to connect to This component, along with the Spark Batch component Palette it belongs to, Note that in this documentation, unless otherwise explicitly stated, a |
Spark Connection |
In the Spark
Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Related scenarios
For a scenario about how to use the same type of component in a Spark Batch Job, see Writing and reading data from MongoDB using a Spark Batch Job.