
Warning
This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend
solutions with Big Data.
Component family |
Big Data / HCatalog |
|||
Function |
This component allows you to manage the data stored in HCatalog |
|||
Purpose |
The tHCatalogOperation component |
|||
Basic settings |
Property type |
Either Built-in or Repository
Built-in: No property data stored
Repository: Select the repository |
||
Version |
Distribution |
Select the cluster you are using from the drop-down list. The options in the list vary
In order to connect to a custom distribution, once selecting Custom, click the
|
||
HCatalog version |
Select the version of the Hadoop distribution you are using. The available options vary
|
|||
Templeton Configuration |
Templeton hostname |
Fill this field with the URL of Templeton Webservice. NoteTempleton is a webservice API for HCatalog. It facilitates the access to HCatalog and |
||
|
Templeton port |
Fill this field with the port of URL of Templeton Webservice. By NoteTempleton is a webservice API for HCatalog. It facilitates the access to HCatalog and |
||
|
Use kerberos authentication |
If you are accessing the Hadoop cluster running with Kerberos security, select this check This check box is available depending on the Hadoop distribution you are connecting |
||
Use a keytab to authenticate |
Select the Use a keytab to authenticate check box to log Note that the user that executes a keytab-enabled Job is not necessarily the one a |
|||
|
Operation on |
Select an object from the list for the DB operation as Database: The HCatalog managed Table: The HCatalog managed table Partition: The partition |
||
|
Operation |
Select an action from the list for the DB operation. For further |
||
Create the table only it doesn’t exist already |
Select this check box to avoid creating duplicate table when you NoteThis check box is enabled only when you have selected |
|||
HCatalog Configuration |
Database |
Fill this field with the name of the database in which the |
||
|
Table |
Fill this field to operate on one or multiple tables in a database NoteThis field is enabled only when you have selected |
||
|
Partition |
Fill this field to specify one or more partitions for the NoteThis field is enabled only when you select Partition from the Operation on list. For further |
||
Username |
Fill this field with the username for the DB |
|||
|
Database location |
Fill this field with the location of the database file in HDFS. NoteThis field is enabled only when you select Database from the Operation on list. |
||
|
Database description |
The description for the database to be created. NoteThis field is enabled only when you select Database from the Operation on list. |
||
Create an external table |
Select this field to create an external table in an alternative NoteThis check box is enabled only when you select Table from the Operation on list and Create/Drop and create/Drop if exist and |
|||
Format |
Select a file format from the list to specify the format of the TEXTFILE: Plain text RCFILE: Record Columnar files. NoteRCFILE is only available |
|||
Set partitions |
Select this check box to set the partition schema by clicking the NoteThis check box is enabled only when you select Table from the Operation on list and Create/Drop and |
|||
|
|
Built-in: The schema will be |
||
|
|
Repository: The schema already |
||
Set the user group to use |
Select this check box to specify the user group. NoteThis check box is enabled only when you select Drop/Drop |
|||
Option |
Select a clause when you drop a database. NoteThis list is enabled only when you select Database from the Operation on list and Drop/Drop |
|||
Set the permissions to use |
Select this check box to specify the permissions needed by the NoteThis check box is enabled only when you select Drop/Drop |
|||
Set File location |
Enter the directory in which partitioned data is stored. NoteThis check box is enabled only when you select Partition from the Operation on list and Create/Drop |
|||
|
Die on error |
This check box is cleared by default, meaning to skip the row on |
||
Advanced settings | Comment |
Fill this field with the comment for the table you want to create. NoteThis field is enabled only when you select Table from the Operation on list and Create/Drop |
||
Set HDFS location |
Select this check box to specify an HDFS location to which the NoteThis check box is enabled only when you select Table from the Operation on list and Create/Drop |
|||
Set row format(terminated by) |
Select this check box to use and define the row formats when you Field: Select this check box to Collection Item: Select this Map Key: Select this check box to Line: Select this check box to NoteThis check box is enabled only when you select Table from the Operation on list and Create/Drop |
|||
Properties |
Click [+] to add one or more NoteThis table is enabled only when you select Database/Table from the Operation on list and Create/Drop and |
|||
Retrieve the HCatalog logs | Select this check box to retrieve log files generated during HCatalog operations. |
|||
Standard Output Folder |
Browse to, or enter the directory where the log files are NoteThis field is enabled only when you selected Retrieve the HCatalog logs check |
|||
Error Output Folder |
Browse to, or enter the directory where the error log files are stored. NoteThis field is enabled only when you selected Retrieve the HCatalog logs check |
|||
tStatCatcher Statistics |
Select this check box to gather the Job processing metadata at the |
|||
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see Talend Studio |
|||
Usage |
This component is commonly used in a single-component HCatalog is built on top of the Hive metastore to provide read and write interface for Pig and MapReduce, so that the latter systems can use the metadata of Hive to easily read and write data in HDFS. For further information, see Apache’s documentation about HCatalog. For example, https://cwiki.apache.org/confluence/display/HCATALOG/Design+Document+-+Java+APIs+for+HCatalog+DDL+Commands. |
|||
Prerequisites |
The Hadoop distribution must be properly installed, so as to guarantee the interaction
For further information about how to install a Hadoop distribution, see the manuals |
|||
Log4j |
The activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html. |
|||
Limitation |
When Use kerberos authentication |
This scenario describes a six-component Job that includes the common operations for
the HCatalog table management on Hortonworks Data Platform. Sub-sections in this
scenario covers DB operations including:
-
Creating a table to the database in HDFS;
-
Writing data to the HCatalog managed table;
-
Writing data to the partitioned table using tHCatalogLoad;
-
Reading data from the HCatalog managed table;
-
Outputting the data read from the table in HDFS.
Note
Knowledge of Hive Data Definition Language and HCatalog Data Definition
Language is required. For further information about Hive Data Definition
Language, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL.
For further information about HCatalog Data Definition Language, see https://cwiki.apache.org/confluence/display/HCATALOG/Design+Document+-+Java+APIs+for+HCatalog+DDL+Commands.
-
Drop the following components from the Palette to the design workspace: tHCatalogOperation, tHCatalogLoad, tHCatalogInput, tHCatalogOutput, tFixedFlowInput, and tFileOutputDelimited.
-
Right-click tHCatalogOperation to connect
it to tFixedFlowInput component using a
Trigger>OnSubjobOk
connection. -
Right-click tFixedFlowInput to connect it
to tHCatalogOutput using a Row > Main
connection. -
Right-click tFixedFlowInput to connect it
to tHCatalogLoad using a Trigger > OnSubjobOk connection. -
Right-click tHCatalogLoad to connect it
to the tHCatalogInput component using a
Trigger > OnSubjobOk connection. -
Right-click tHCatalogInput to connect it
to tFileOutputDelimited using a Row > Main
connection.
-
Double-click tHCatalogOperation to open
its Basic settings view. -
Click Edit schema to define the schema
for the table to be created. -
Click [+] to add at least one column to
the schema and click OK when you finish
setting the schema. In this scenario, the columns added to the schema are:
name, country and age. -
Fill the Templeton hostname field with
URL of the Templeton webservice you are using. In this scenario, fill this
field with “192.168.0.131“. -
Fill the Templeton port field with the
port for Templeton hostname. By default,
the value for this field is “50111“ -
Select Table from the Operation on list and Drop
if exist and create from the Operation list to create a table in HDFS. -
Fill the Database field with an existing
database name in HDFS. In this scenario, the database name is “talend“. -
Fill the Table field with the name of the
table to be created. In this scenario, the table name is “Customer“. -
Fill the Username field with the username
for the DB authentication. -
Select the Set the user group to use
check box to specify the user group. The default user group is “root“, you need to specify the value for this
field according to real practice. -
Select the Set the permissions to use
check box to specify the user permission. The default value for this field
is “rwxrwxr-x“. -
Select the Set partitions check box to
enable the partition schema. -
Click the Edit schema button next to
the Set partitions check box to define
the partition schema. -
Click [+] to add one column to the schema
and click OK when you finish setting the
schema. In this scenario, the column added to the partition schema is:
match_age.
-
Double-click tFixedFlowInput to open its
Basic settings view. -
Click Edit schema to define a same schema
as the one you defined in tHCatalogOperation. -
Fill the Number of rows field with
integer 8. -
Select Use Inline Table in the Mode area.
-
Click [+] to add new lines in the inline
table. -
Double-click tHCatalogOutput to open its
Basic settings view. -
Click Sync columns to retrieve the schema
defined in the preceding component. -
Fill the NameNode URI field with the URI
to the NameNode. In this scenario, this URL is “192.168.0.131“. -
Fill the File name field with the HDFS
location of the file you write data to. In this scenario, the file location
is “/user/hdp/Customer/Customer.csv“. -
Select Overwrite from the Action list.
-
Fill the Templeton hostname field with
URL of the Templeton webservice you are using. In this scenario, fill this
field with “192.168.0.131“. -
Fill the Templeton port field with the
port for Templeton hostname. By default,
the value for this field is “50111“ -
Fill the Database field, the Table field, the Username field with the same value you specified in
tHCatalogOperation. -
Fill the Partition field with “match_age=27“.
-
Fill the File location field with the
HDFS location to which the table will be saved. In this example, use
“hdfs://192.168.0.131:8020/user/hdp/Customer“.
-
Double-click tHCatalogLoad to open its
Basic settings view. -
Fill the Partition field with “match_age=26“.
-
Do the rest of the settings in the same way as configuring tHCatalogOperation.
-
Double-click tHCatalogInput to open its
Basic settings view. -
Click Edit schema to define the schema of
the table to be read from the database. -
Click [+] to add at least one column to
the schema. In this scenario, the columns added to the schema are age and name. -
Fill the Partition field with “match_age=26“.
-
Do the rest of the settings in the same way as configuring tHCatalogOperation.
-
Double-click tLogRow to open its
Basic settings view. -
Click Sync columns to retrieve the schema
defined in the preceding component. -
Select Table from the Mode area.
Press CTRL+S to
save your Job and F6 to execute it.

The data of the restricted table read from the HDFS is displayed onto the
console.
Type in http://talend-hdp:50075/browseDirectory.jsp?dir=/user/hdp/Customer&namenodeInfoPort=50070
to the address bar of your browser to view the table you created:

Click the Customer.csv link to view the content
of the table you created.
