tHiveOutput
Connects to a given Hive database and writes the data it receives into a given Hive
table or a directory in HDFS.
When ACID is enabled on the Hive side, a Spark Job cannot delete or
update a table and unless data is compacted, this Job cannot correctly read
aggregated data from a Hive table, either. This is a known limitation described in
the Spark bug tracking system: https://issues.apache.org/jira/browse/SPARK-15348.
Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:
-
Spark Batch: see tHiveOutput properties for Apache Spark Batch.
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric. -
Spark Streaming: see tHiveOutput properties for Apache Spark Streaming.
This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.
tHiveOutput properties for Apache Spark Batch
These properties are used to configure tHiveOutput running in the Spark Batch Job framework.
The Spark Batch
tHiveOutput component belongs to the Databases family.
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.
Basic settings
Hive storage configuration |
Select the tHiveConfiguration |
HDFS Storage configuration |
Select the tHDFSConfiguration component from which you want Spark to use the configuration |
Schema and Edit Schema |
A schema is a row description. It defines the number of fields Click Edit
|
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Output source |
Select the type of the output data you want tHiveOutput to change:
|
Save mode |
Select the type of changes you want to make regarding the target Hive
table.
|
Enable Hive partitions |
Select the Enable Hive partitions check box and in Bear in mind that:
|
Advanced settings
Sort columns alphabetically | Select this check box to sort the schema columns in the alphabetical order. If you leave this check box clear, these columns stick to the order defined in the schema editor. |
Use Timestamp format for Date type |
Select the check box to output dates, hours, minutes and seconds contained in your The format used by Deltalake is |
Usage
Usage rule |
This component is used as a start component and requires an output This component should use a tHiveConfiguration component present in the same Job to connect to This component, along with the Spark Batch component Palette it belongs to, Note that in this documentation, unless otherwise explicitly stated, a |
Spark Connection |
In the Spark
Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Related scenarios
For a scenario about how to use the same type of component in a Spark Batch Job, see Writing and reading data from MongoDB using a Spark Batch Job.
tHiveOutput properties for Apache Spark Streaming
These properties are used to configure tHiveOutput running in the Spark Streaming Job framework.
The Spark Streaming
tHiveOutput component belongs to the Databases family.
This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.
Basic settings
Hive storage configuration |
Select the tHiveConfiguration |
HDFS Storage configuration |
Select the tHDFSConfiguration component from which you want Spark to use the configuration |
Schema and Edit |
A schema is a row description. It defines the number of fields Click Edit
|
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Output source |
Select the type of the output data you want tHiveOutput to change:
|
Save mode |
Select the type of changes you want to make regarding the target Hive
table.
|
Enable Hive partitions |
Select the Enable Hive partitions check box and in Bear in mind that:
|
Advanced settings
Sort columns alphabetically | Select this check box to sort the schema columns in the alphabetical order. If you leave this check box clear, these columns stick to the order defined in the schema editor. |
Use Timestamp format for Date type |
Select the check box to output dates, hours, minutes and seconds contained in your The format used by Deltalake is |
Usage
Usage rule |
This component is used as a start component and requires an output link. This component should use a tHiveConfiguration component present in the same Job to connect to This component, along with the Spark Streaming component Palette it belongs to, appears Note that in this documentation, unless otherwise explicitly stated, a scenario presents |
Spark Connection |
In the Spark
Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Related scenarios
For a scenario about how to use the same type of component in a Spark Streaming Job, see
Reading and writing data in MongoDB using a Spark Streaming Job.