tHCatalogOperation
Prepares the HCatalog managed database/table/partition to be processed.
tHCatalogOperation manages the data
stored in HCatalog managed Hive database/table/partition.
tHCatalogOperation Standard properties
These properties are used to configure tHCatalogOperation running in the Standard Job framework.
The Standard
tHCatalogOperation component belongs to the Big Data family.
The component in this framework is available in all Talend products with Big Data
and in Talend Data Fabric.
Basic settings
Property type |
Either Built-in or Repository
Built-in: No property data stored
Repository: Select the repository |
Distribution |
Select the cluster you are using from the drop-down list. The options in the
list vary depending on the component you are using. Among these options, the following ones requires specific configuration:
|
HCatalog version |
Select the version of the Hadoop distribution you are using. The available |
Templeton hostname |
Fill this field with the URL of Templeton Webservice. Note:
Templeton is a webservice API for HCatalog. It has been renamed to WebHCat by the |
Templeton port |
Fill this field with the port of URL of Templeton Webservice. By Note:
Templeton is a webservice API for HCatalog. It has been renamed to WebHCat by the |
Use kerberos authentication |
If you are accessing the Hadoop cluster running
with Kerberos security, select this check box, then, enter the Kerberos principal name for the NameNode in the field displayed. This enables you to use your user name to authenticate against the credentials stored in Kerberos.
This check box is available depending on the Hadoop distribution you are |
Use a keytab to authenticate |
Select the Use a keytab to authenticate Note that the user that executes a keytab-enabled Job is not necessarily |
Operation on |
Select an object from the list for the DB operation as
Database: The HCatalog managed
Table: The HCatalog managed table
Partition: The partition |
Operation |
Select an action from the list for the DB operation. For further |
Create the table only it doesn’t exist already |
Select this check box to avoid creating duplicate table when you
create a table. Note:
This check box is enabled only when you have selected |
Database |
Fill this field with the name of the database in which the |
Table |
Fill this field to operate on one or multiple tables in a database
or on a specified HDFS location. Note:
This field is enabled only when you have selected |
Partition |
Fill this field to specify one or more partitions for the If you are reading a non-partitioned table, leave this field empty. Note:
This field is enabled only when you select Partition from the Operation on list. For further |
Username |
Fill this field with the username for the DB |
Database location |
Fill this field with the location of the database file in HDFS.
Note:
This field is enabled only when you select Database from the Operation on list. |
Database description |
The description for the database to be created.
Note:
This field is enabled only when you select Database from the Operation on list. |
Create an external table |
Select this field to create an external table in an alternative
path defined in the Set HDFS location field in the Advanced settings view. For further information about creating external table, see https://cwiki.apache.org/Hive/. Note:
This check box is enabled only when you select Table from the Operation on list and Create/Drop and create/Drop if exist and |
Format |
Select a file format from the list to specify the format of the
TEXTFILE: Plain text
RCFILE: Record Columnar files.
For further information about RCFILE, see https://cwiki.apache.org/confluence/display/Hive/RCFile. Note:
RCFILE is only available |
Set partitions |
Select this check box to set the partition schema by clicking the Note:
This check box is enabled only when you select Table from the Operation on list and Create/Drop and |
 |
Built-in: The schema will be |
 |
Repository: The schema already |
Set the user group to use |
Select this check box to specify the user group.
Note:
This check box is enabled only when you select Drop/Drop |
Option |
Select a clause when you drop a database.
Note:
This list is enabled only when you select Database from the Operation on list and Drop/Drop |
Set the permissions to use |
Select this check box to specify the permissions needed by the
operation you select from the Operation list. Note:
This check box is enabled only when you select Drop/Drop |
Set File location |
Enter the directory in which partitioned data is stored.
Note:
This check box is enabled only when you select Partition from the Operation on list and Create/Drop |
Die on error |
This check box is cleared by default, meaning to skip the row on |
Advanced settings
Comment |
Fill this field with the comment for the table you want to create.
Note:
This field is enabled only when you select Table from the Operation on list and Create/Drop |
Set HDFS location |
Select this check box to specify an HDFS location to which the
table you want to create is saved. Deselect it to save the table you want to create in the warehouse directory defined in the key hive.metastore.warehouse.dir in Hive configuration file hive-site.xml. Note:
This check box is enabled only when you select Table from the Operation on list and Create/Drop |
Set row format(terminated by) |
Select this check box to use and define the row formats when you
Field: Select this check box to
Collection Item: Select this
Map Key: Select this check box to
Line: Select this check box to
use Line as the row format. The default value for this field is ” “. You can also specify a customized char in this field. Note:
This check box is enabled only when you select Table from the Operation on list and Create/Drop |
Properties |
Click [+] to add one or more
lines to define table properties. The table properties allow you to tag the table definition with your own metadata key/value pairs. Make sure that values in both Key row and Value row must be quoted in double quotation marks. Note:
This table is enabled only when you select Database/Table from the Operation on list and Create/Drop and |
Retrieve the HCatalog logs | Select this check box to retrieve log files generated during HCatalog operations. |
Standard Output Folder |
Browse to, or enter the directory where the log files are Note:
This field is enabled only when you selected Retrieve the HCatalog logs check |
Error Output Folder |
Browse to, or enter the directory where the error log files are stored.
Note:
This field is enabled only when you selected Retrieve the HCatalog logs check |
tStatCatcher Statistics |
Select this check box to gather the Job processing metadata at the |
Global Variables
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
Usage rule |
This component is commonly used in a single-component HCatalog is built on top of the Hive metastore to provide read and write interface for Pig For further information, see Apache documentation about HCatalog: https://cwiki.apache.org/confluence/display/Hive/HCatalog. |
Prerequisites |
The Hadoop distribution must be properly installed, so as to guarantee the interaction
For further information about how to install a Hadoop distribution, see the manuals |
Limitation |
When Use kerberos authentication |
Managing HCatalog tables on Hortonworks Data Platform
This scenario applies only to Talend products with Big Data.
This scenario describes a six-component Job that includes the common
operations for the HCatalog table management on Hortonworks Data Platform. Sub-sections
in this scenario covers DB operations including:
-
Creating a table to the database in HDFS;
-
Writing data to the HCatalog managed table;
-
Writing data to the partitioned table using tHCatalogLoad;
-
Reading data from the HCatalog managed table;
-
Outputting the data read from the table in HDFS.
Knowledge of Hive Data Definition Language and HCatalog Data
Definition Language is required. For further information about Hive Data
Definition Language, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL.
For further information about HCatalog Data Definition Language, see https://cwiki.apache.org/confluence/display/HCATALOG/Design+Document+-+Java+APIs+for+HCatalog+DDL+Commands.
Setting up the Job
-
Drop the following components from the Palette to the design workspace: tHCatalogOperation, tHCatalogLoad, tHCatalogInput, tHCatalogOutput, tFixedFlowInput, and tFileOutputDelimited.
-
Right-click tHCatalogOperation to connect
it to tFixedFlowInput component using a
Trigger>OnSubjobOk
connection. -
Right-click tFixedFlowInput to connect it
to tHCatalogOutput using a Row > Main
connection. -
Right-click tFixedFlowInput to connect it
to tHCatalogLoad using a Trigger > OnSubjobOk connection. -
Right-click tHCatalogLoad to connect it
to the tHCatalogInput component using a
Trigger > OnSubjobOk connection. -
Right-click tHCatalogInput to connect it
to tFileOutputDelimited using a Row > Main
connection.
Creating a table in HDFS
-
Double-click tHCatalogOperation to open
its Basic settings view. -
Click Edit schema to define the schema
for the table to be created. -
Click [+] to add at least one column to
the schema and click OK when you finish
setting the schema. In this scenario, the columns added to the schema are:
name, country and age. -
Fill the Templeton hostname field with
URL of the Templeton webservice you are using. In this scenario, fill this
field with “192.168.0.131“. -
Fill the Templeton port field with the
port for Templeton hostname. By default,
the value for this field is “50111“ -
Select Table from the Operation on list and Drop
if exist and create from the Operation list to create a table in HDFS. -
Fill the Database field with an existing
database name in HDFS. In this scenario, the database name is “talend“. -
Fill the Table field with the name of the
table to be created. In this scenario, the table name is “Customer“. -
Fill the Username field with the username
for the DB authentication. -
Select the Set the user group to use
check box to specify the user group. The default user group is “root“, you need to specify the value for this
field according to real practice. -
Select the Set the permissions to use
check box to specify the user permission. The default value for this field
is “rwxrwxr-x“. -
Select the Set partitions check box to
enable the partition schema. -
Click the Edit schema button next to
the Set partitions check box to define
the partition schema. -
Click [+] to add one column to the schema
and click OK when you finish setting the
schema. In this scenario, the column added to the partition schema is:
match_age.
Writing data to the existing table
-
Double-click tFixedFlowInput to open its
Basic settings view. -
Click Edit schema to define a same schema
as the one you defined in tHCatalogOperation. -
Fill the Number of rows field with
integer 8. - Select Use Inline Table in the Mode area.
-
Click [+] to add new lines in the inline
table. -
Double-click tHCatalogOutput to open its
Basic settings view. -
Click Sync columns to retrieve the schema
defined in the preceding component. -
Fill the NameNode URI
field with the URI to the NameNode. If you are using WebHDFS, the location should be
webhdfs://masternode:portnumber; WebHDFS with SSL is not
supported yet. -
Fill the File name field with the HDFS
location of the file you write data to. In this scenario, the file location
is “/user/hdp/Customer/Customer.csv“. - Select Overwrite from the Action list.
-
Fill the Templeton hostname field with
URL of the Templeton webservice you are using. In this scenario, fill this
field with “192.168.0.131“. -
Fill the Templeton port field with the
port for Templeton hostname. By default,
the value for this field is “50111“ -
Fill the Database field, the Table field, the Username field with the same value you specified in
tHCatalogOperation. - Fill the Partition field with “match_age=27“.
-
Fill the File location field with the
HDFS location to which the table will be saved. In this example, use
“hdfs://192.168.0.131:8020/user/hdp/Customer“.
Writing data to the partitioned table using tHCatalogLoad
-
Double-click tHCatalogLoad to open its
Basic settings view. - Fill the Partition field with “match_age=26“.
- Do the rest of the settings in the same way as configuring tHCatalogOperation.
Reading data from the table in HDFS
-
Double-click tHCatalogInput to open its
Basic settings view. -
Click Edit schema to define the schema of
the table to be read from the database. -
Click [+] to add at least one column to
the schema. In this scenario, the columns added to the schema are age and name. - Fill the Partition field with “match_age=26“.
- Do the rest of the settings in the same way as configuring tHCatalogOperation.
Outputting the data read from the table in HDFS to the console
-
Double-click tLogRow to open its
Basic settings view. -
Click Sync columns to retrieve the schema
defined in the preceding component. - Select Table from the Mode area.
Job execution
Press CTRL+S to
save your Job and F6 to execute it.
The data of the restricted table read from the HDFS is displayed onto the
console.
Type in http://talend-hdp:50075/browseDirectory.jsp?dir=/user/hdp/Customer&namenodeInfoPort=50070
to the address bar of your browser to view the table you created:
Click the Customer.csv link to view the content
of the table you created.