tDeltaLakeOutput
Writes records in the Delta Lake layer of your Data Lake system in the Parquet format.
Delta Lake is an open source storage layer that brings ACID (Atomicity,
Consistency, Isolation, Durability) transactions, scalable metadata handling, and
unifies streaming and batch data processing to Data Lakes. To put the concept visual,
data stored in Delta Lake takes the shape of versioned Parquet files with their
transaction logs.
For further information, see the Delta Lake documentation on https://docs.delta.io/latest/index.html.
tDeltaLakeOutput properties for Apache Spark Batch
These properties are used to configure tDeltaLakeOutput running in the Spark Batch Job framework.
The Spark Batch
tDeltaLakeOutput component belongs to the Technical family.
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.
Basic settings
Define a storage configuration component |
Select the configuration component to be used to provide the configuration If you leave this check box clear, the target file system is the local The configuration component to be used must be present in the same Job. |
Property type |
Either Built-In or Repository. |
 |
Built-In: No property data stored centrally. |
 |
Repository: Select the repository file where the The properties are stored centrally under the Hadoop The fields that come after are pre-filled in using the fetched data. For further information about the Hadoop |
Schema and Edit Schema |
A schema is a row description. It defines the number of fields Click Edit
Spark automatically infers |
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Folder/File |
Browse to, or enter the path pointing to the data to be used in the file system. The button for browsing does not work with the Spark tHDFSConfiguration |
Action |
Select an operation for writing data to the filesystem to which the
configuration component in your Job provides the connection information:
Delta Lake systematically creates slight differences between the upload time of a file and the metadata timestamp of this file. Bear in mind these differences when you need to filter data. |
Advanced settings
Define column partitions | Select this check box and complete the table that is displayed using columns from the schema of the incoming data. The records of the selected columns are used as keys to partition your data. |
Sort columns alphabetically | Select this check box to sort the schema columns in the alphabetical order. If you leave this check box clear, these columns stick to the order defined in the schema editor. |
Use Timestamp format for Date type |
Select the check box to output dates, hours, minutes and seconds contained in your The format used by Deltalake is |
Merge Schema | The schema of your datasets often evolves through time. Select this check box to merge the schemas of the incoming data and the existing data when their schemas are different. If you leave this check |
Overwrite Schema |
The schema of your datasets often evolves through time. Select this check If you leave this check box and the Merge |
Usage
Usage rule |
This component is used as an end component and requires an input link. Delta Lake systematically creates slight differences between the upload time of a file and |
Spark Connection |
In the Spark
Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |