
The three ELT Hive components are closely related, in terms of their operating
conditions. These components should be used to handle Hive DB schemas to generate Insert
statements, including clauses, which are to be executed in the DB output table
defined.
Component family |
ELT/Map/Hive |
|||
Function |
This component uses the tables provided as input, to feed the |
|||
Purpose |
This component helps to graphically build the Hive QL statement in |
|||
Basic settings |
Property type |
Either Built-in or Repository. |
||
|
|
Built-in: No property data is |
||
|
|
Repository: Select the repository |
||
|
Use an existing connection |
Select this check box and in the Component List click the NoteWhen a Job contains the parent Job and the child Job, if you need to share an existing
For an example about how to share a database connection across Job levels, see |
||
|
ELT Hive Map editor |
The ELT Map editor helps you to define the output schema as well |
||
|
Style link |
Select the way in which links are displayed.
Auto: By default, the links between
Bezier curve: Links between the
Line: Links between the schema and This option slightly optimizes performance. |
||
Version |
Distribution |
Select the cluster you are using from the drop-down list. The options in the list vary
In order to connect to a custom distribution, once selecting Custom, click the
|
||
|
Hive version |
Select the version of the Hadoop distribution you are using. The available options vary
|
||
Hive server |
Select the Hive server through which you want the Job using this component to execute This Hive server list is available only when the Hadoop For further information about HiveServer2, see https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2. |
|||
|
Connection mode |
Select a connection mode from the list. The options vary depending |
||
|
Host |
Database server IP address. |
||
|
Port |
Listening port number of DB server. |
||
|
Database |
Name of the database. According to the documentation of Hive, the For further information, see https://cwiki.apache.org/confluence/display/Hive/HiveClient. |
||
|
Username and |
DB user authentication data. To enter the password, click the […] button next to the |
||
Authentication |
Use kerberos authentication |
If you are accessing a Hive Metastore running with Kerberos security, select this check The values of the following parameters can be found in the hive-site.xml file of the Hive system to be used.
For the other parameters that are displayed, please consult the Hadoop This check box is available depending on the Hadoop distribution you are connecting |
||
Use a keytab to authenticate |
Select the Use a keytab to authenticate check box to log Note that the user that executes a keytab-enabled Job is not necessarily the one a |
|||
Hadoop properties |
Set Jobtracker URI |
Select this check box to indicate the location of the Jobtracker service within the Hadoop This property is required when the query you want to use is executed in Windows and it is If you use YARN in your Hadoop cluster such as Hortonworks Data
For further information about these parameters, see the documentation or For further information about the Hadoop Map/Reduce framework, see the Map/Reduce tutorial |
||
Set NameNode URI |
Select this check box to indicate the location of the NameNode of the Hadoop cluster to be This property is required when the query you want to use is executed in Windows and it is For further information about the Hadoop Map/Reduce framework, see the Map/Reduce tutorial |
|||
Advanced settings |
Temporary path |
If you do not want to set the Jobtracker and the NameNode when you execute the query |
||
Hadoop properties |
Talend Studio uses a default configuration for its engine to perform
For further information about the properties required by Hadoop and its related systems such
|
|||
Hive properties |
Talend Studio uses a default
|
|||
Mapred job map memory mb and |
If the Hadoop distribution to be used is Hortonworks Data Platform V1.2 or Hortonworks In that situation, you need to enter the values you need in the Mapred |
|||
Path separator in server |
Leave the default value of the Path separator in server as |
|||
|
tStatCatcher Statistics |
Select this check box to collect log data at the component |
||
Temporary path |
If you do not want to set the Jobtracker and the NameNode when you execute the query |
|||
Dynamic settings |
Click the [+] button to add a row in the table and fill The Dynamic settings table is available only when the For more information on Dynamic settings and context |
|||
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see Talend Studio |
|||
Usage |
tELTHiveMap is used along with a If the Studio used to connect to a Hive database is operated on Windows, you must manually NoteThe ELT components do not handle actual data flow but only |
|||
Prerequisites |
The Hadoop distribution must be properly installed, so as to guarantee the interaction
For further information about how to install a Hadoop distribution, see the manuals |
The ELT components do not handle any data as such but table schema information
that will be used to build the Hive QL query to execute.
Therefore the only connection required to connect these components together is a
simple link.
Note
The output name you give to this link when creating it should always be the
exact name of the table to be accessed as this parameter will be used in the
Hive QL statement generated.
Related topic: see Talend Studio User
Guide.
In the ELT Mapper, you can select specific columns from input schemas and include
them in the output schema.
-
As you would do it in the regular Map editor, simply drag & drop the
content from the input schema towards the output table defined. -
Use the Ctrl and Shift keys for multiple selection of contiguous or non
contiguous table columns.
You can implement explicit joins to retrieve various data from different tables.
-
Select the Explicit join check box for
the relevant column, and select a type of join from the Join list. -
Possible joins include: Join (Inner
Join), Left Outer Join, Right Outer Join or Full
Outer Join and Left Semi
Join. -
By default the Join option is
selected.
You can also create Alias tables to retrieve
various data from the same table.
-
In the Input area, click the
button to create an Alias.
-
Define the table to base the alias on.
-
Type in a new name for the alias table, preferably not the same as the
main table.
You can also restrict the Select statement based on a Where clause and/or other clauses such
as Group By, Order By, etc. by clicking the Add filter
row button at the top of the output table in the map editor.
To add a restriction based on a Where clause, click the Add
filter row button and select Add a WHERE
clause from the popup menu.
To add a restriction based on Group By, Order By etc., click the Add filter row button and select Add an
other(GROUP…) clause from the popup menu.

Make sure that all input components are linked correctly to the ELT Map component to
be able to implement all inclusions, joins and clauses.
This scenario uses a four-component Job to join the columns selected from two Hive
tables and write them into another Hive table.

-
Create the Hive table you want to write data in. In this scenario, this
table is named as agg_result, and you can
create it using the following statement in tHiveRow:1create table agg_result (id int, name string, address string, sum1 string, postal string, state string, capital string, mostpopulouscity string) partitioned by (type string) row format delimited fields terminated by ';' location '/user/ychen/hive/table/agg_result'In this statement,
‘/user/ychen/hive/table/agg_result’ is the directory used in
this scenario to store this created table in HDFS. You need to replace it
with the directory you want to use in your environment.For further information about tHiveRow,
see tHiveRow. -
Create two input Hive tables containing the columns you want to join and
aggregate these columns into the output Hive table, agg_result. The statements to be used are:1create table customer (id int, name string, address string, idState int, id2 int, regTime string, registerTime string, sum1 string, sum2 string) row format delimited fields terminated by ';' location '/user/ychen/hive/table/customer'and
1create table state_city (id int, postal string, state string, capital int, mostpopulouscity string) row format delimited fields terminated by ';' location '/user/ychen/hive/table/state_city' -
Use tHiveRow to load data into the two
input tables, customer and state_city. The statements to be used are:1"LOAD DATA LOCAL INPATH 'C:/tmp/customer.csv' OVERWRITE INTO TABLE customer"and
1"LOAD DATA LOCAL INPATH 'C:/tmp/State_City.csv' OVERWRITE INTO TABLE state_city"The two files, customer.csv and
State_City.csv, are two local files
we created for this scenario. You need to create your own files to provide
data to the input Hive tables. The data schema of each file should be
identical with their corresponding table.You can use tRowGenerator and tFileOutputDelimited to create these two files
easily. For further information about these two components, see tRowGenerator and tFileOutputDelimited.For further information about the Hive query language, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual.
-
In the Integration perspective
of Talend Studio,
create an empty Job from the Job Designs
node in the Repository tree view.For further information about how to create a Job, see Talend Studio User Guide.
-
Drop two tELTHiveInput components and
tELTHiveMap and tELTHiveOutput onto the workspace. -
Connect them using the Row > Main
link.Each time when you connect two components, a wizard pops up to prompt you
to name the link you are creating. This name must be the same as that of the
Hive table you want the active component to process. In this scenario, the
input tables the two tELTHiveInput
components will handle are customer and
state_city and the output table
tELTHiveOutput will handle is agg_result.
-
Double-click the tELTHiveInput component
using the customer link to open its
Component view. -
Click the […] button next to Edit schema to open the schema editor.
-
Click the
button as many times as required to add columns and
rename them to replicate the schema of the customer table we created earlier in Hive.To set up this schema, you can as well use the
customer schema you retrieve and
store in the Repository. For further
information about how to set up a connection to Hive and retrieve and store
the schema in Repository, see Talend Studio User Guide. -
In the Default table name field, enter
the name of the input table, customer, to
be processed by this component. -
Double-click the other tELTHiveInput
component using the state_city link to
open its Component view. -
Click the […] button next to Edit schema to open the schema editor.
-
Click the
button as many times as required to add columns and
rename them to replicate the schema of the state_city table we created earlier in Hive. -
In the Default table name field, enter
the name of the input table, state_city,
to be processed by this component.
Configuring the connection to Hive
-
Click tELTHiveMap, then, click Component to open its Component view.
-
In the Version area, select the Hadoop
distribution you are using and the Hive version. -
In the Connection mode list, select the
connection mode you want to use. If your distribution is HortonWorks, this
mode is Embedded only. -
In the Host field and the Port field, enter the authentication information
for the component to connect to Hive. For example, the host is talend-hdp-all and the port is 9083. -
Select the Set Jobtracker URI check box
and enter the location of the Jobtracker. For example, talend-hdp-all:50300. -
Select the Set NameNode URI check box and
enter the location of the NameNode. For example, hdfs://talend-hdp-all:8020.
Mapping the schemas
-
Click ELT Hive Map Editor to map the
schemas -
On the input side (left in the figure), click the Add alias button to add the table to be used.
-
In the pop-up window, select the customer table, then click OK.
-
Repeat the operations to select the state_city table.
-
Drag and drop the idstate column from
the customer table onto the id column of the state_city table. Thus an inner join is created
automatically. -
On the output side (the right side in the figure), the agg_result table is empty at first. Click
at the bottom of this side to add as many columns as
required and rename them to replicate the schema of the agg_result table you created earlier in Hive.Note
The type column is the partition
column of the agg_result table and
should not be replicated in this schema. For further information about
the partition column of the Hive table, see the Hive manual. -
From the customer table, drop id, name,
address, and sum1 to the corresponding columns in the agg_result table. -
From the state_city table, drop
postal, state, capital and
mostpopulouscity to the corresponding
columns in the agg_result table. -
Click OK to validate these
changes.
-
Double-click tELTHiveOutput to open its
Component view. -
If this component does not have the same schema of the preceding
component, a warning icon appears. In this case, click the Sync columns button to retrieve the schema from
the preceding one and once done, the warning icon disappears. -
In the Default table name field, enter
the output table you want to write data in. In this example, it is agg_result. -
In the Field partition table, click
to add one row. This allows you to write data in the
partition column of the agg_result
table.This partition column was defined the moment we created the agg_result table using
partitioned by
in the Create statement presented earlier. This
(type string)
partition column is type, which describes
the type of a customer. -
In Partition column, enter type without any quotation marks and in
Partition value, enter prospective in single quotation marks.
-
Press F6 to run this Job.
Once done, verify agg_result in Hive using, for
example,
1 |
select * from agg_result; |

This figure present only a part of the table. You can find that the selected input
columns are aggregated and written into the
agg_result table and the partition column is filled with the value
prospective.
For a related scenario using subquery, see Scenario: Mapping data using a subquery.