July 30, 2023

tRedshiftConfiguration – Docs for ESB 7.x

tRedshiftConfiguration

Reuses the connection configuration to a Redshift database in the same
Job.

tRedshiftConfiguration provides the
connection information to a given Redshift database for the Redshift
related components used in the same Spark Job. The Spark cluster to be
used reads this configuration to eventually connect to Redshift.

Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:

tRedshiftConfiguration properties for Apache Spark Batch

These properties are used to configure tRedshiftConfiguration running in the Spark Batch Job framework.

The Spark Batch
tRedshiftConfiguration component belongs to the Storage and the Databases families.

The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.

Basic settings

Property type

Either Built-In or Repository.

Built-In: No property data stored centrally.

Repository: Select the repository file where the
properties are stored.

Host

Enter the endpoint of the database you need to connect to in
Redshift.

Port

Enter the port number of the database you need to connect to in
Redshift.

The related information can be found in the Cluster Database
Properties area in the Web console of your Redshift.

For further information, see Managing clusters console.

Username and Password

Enter the authentication information to the Redshift database you
need to connect to.

To enter the password, click the […] button next to the
password field, and then in the pop-up dialog box enter the password between double quotes
and click OK to save the settings.

Database

Enter the name of the database you need to connect to in
Redshift.

The related information can be found in the Cluster Database
Properties area in the Web console of your Redshift.

For further information, see Managing clusters console.

The bucket and the Redshift database to be used
must be in the same region on Amazon. This could avoid the S3ServiceException errors known to
Amazon. For further information about these errors, see S3ServiceException Errors.

Schema

Enter the name of the database schema to be used in Redshift. The
default schema is called PUBLIC.

A schema in terms of Redshift is similar to a operating system
directory. For further information about a Redshift schema, see Schemas.

Additional JDBC Parameters

Specify additional JDBC properties for the connection you are creating. The
properties are separated by ampersand & and each property is a key-value pair. For
example, ssl=true &
sslfactory=com.amazon.redshift.ssl.NonValidatingFactory
, which means the
connection will be created using SSL.

S3 configuration

Select the tS3Configuration component from which you want Spark to use the
configuration details to connect to S3.

You need drop the tS3Configuration component to be used alongside
tRedshiftConfiguration in
the same Job so that this tS3Configuration is displayed on the S3 configuration list.

S3 temp path

Enter the location in S3 in which the data to be
transferred from or to Redshift is temporarily stored.

This path is independent of the temporary path you need
to set in the Basic settings
tab of tS3Configuration.

Usage

Usage rule

This component is used with no need to be connected to other
components.

You need to drop tRedshiftConfiguration alongside the other Redshift related Subjobs to be run
in the same Job so that the configuration is used by the whole Job at runtime.

Since Redshift uses S3 to store temporary data, you need
to drop a tS3Configuration
component alongside tRedshiftConfiguration in the same Job so that the S3 configuration
is used by the whole Job at runtime.

This component, along with the Spark Batch component Palette it belongs to,
appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a
scenario presents only Standard Jobs, that is to
say traditional
Talend
data integration Jobs.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Related scenarios

For a scenario about how to use the same type of component in a Spark Batch Job, see Writing and reading data from MongoDB using a Spark Batch Job.

tRedshiftConfiguration properties for Apache Spark Streaming

These properties are used to configure tRedshiftConfiguration running in the Spark Streaming Job framework.

The Spark Streaming
tRedshiftConfiguration component belongs to the Storage and the Databases families.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

Basic settings

Property type

Either Built-In or Repository.

Built-In: No property data stored centrally.

Repository: Select the repository file where the
properties are stored.

Host

Enter the endpoint of the database you need to connect to in
Redshift.

Port

Enter the port number of the database you need to connect to in
Redshift.

The related information can be found in the Cluster Database
Properties area in the Web console of your Redshift.

For further information, see Managing clusters console.

Username and Password

Enter the authentication information to the Redshift database you
need to connect to.

To enter the password, click the […] button next to the
password field, and then in the pop-up dialog box enter the password between double quotes
and click OK to save the settings.

Database

Enter the name of the database you need to connect to in
Redshift.

The related information can be found in the Cluster Database
Properties area in the Web console of your Redshift.

For further information, see Managing clusters console.

Schema

Enter the name of the database schema to be used in Redshift. The
default schema is called PUBLIC.

A schema in terms of Redshift is similar to a operating system
directory. For further information about a Redshift schema, see Schemas.

Additional JDBC Parameters

Specify additional JDBC properties for the connection you are creating. The
properties are separated by ampersand & and each property is a key-value pair. For
example, ssl=true &
sslfactory=com.amazon.redshift.ssl.NonValidatingFactory
, which means the
connection will be created using SSL.

S3 configuration

Select the tS3Configuration component from which you want Spark to use the
configuration details to connect to S3.

You need drop the tS3Configuration component to be used alongside
tRedshiftConfiguration in
the same Job so that this tS3Configuration is displayed on the S3 configuration list.

S3 temp path

Enter the location in S3 in which the data to be
transferred from or to Redshift is temporarily stored.

This path is independent of the temporary path you need
to set in the Basic settings
tab of tS3Configuration.

Advanced settings

Connection pool

In this area, you configure, for each Spark executor, the connection pool used to control
the number of connections that stay open simultaneously. The default values given to the
following connection pool parameters are good enough for most use cases.

  • Max total number of connections: enter the maximum number
    of connections (idle or active) that are allowed to stay open simultaneously.

    The default number is 8. If you enter -1, you allow unlimited number of open connections at the same
    time.

  • Max waiting time (ms): enter the maximum amount of time
    at the end of which the response to a demand for using a connection should be returned by
    the connection pool. By default, it is -1, that is to say, infinite.

  • Min number of idle connections: enter the minimum number
    of idle connections (connections not used) maintained in the connection pool.

  • Max number of idle connections: enter the maximum number
    of idle connections (connections not used) maintained in the connection pool.

Evict connections

Select this check box to define criteria to destroy connections in the connection pool. The
following fields are displayed once you have selected it.

  • Time between two eviction runs: enter the time interval
    (in milliseconds) at the end of which the component checks the status of the connections and
    destroys the idle ones.

  • Min idle time for a connection to be eligible to
    eviction
    : enter the time interval (in milliseconds) at the end of which the idle
    connections are destroyed.

  • Soft min idle time for a connection to be eligible to
    eviction
    : this parameter works the same way as Min idle
    time for a connection to be eligible to eviction
    but it keeps the minimum number
    of idle connections, the number you define in the Min number of idle
    connections
    field.

Usage

Usage rule

This component is used with no need to be connected to other components.

You need to drop tRedshiftConfiguration alongside the other Redshift related Subjobs to be run
in the same Job so that the configuration is used by the whole Job at runtime.

Since Redshift uses S3 to store temporary data, you need
to drop a tS3Configuration
component alongside tRedshiftConfiguration in the same Job so that the S3 configuration
is used by the whole Job at runtime.

This component, along with the Spark Streaming component Palette it belongs to, appears
only when you are creating a Spark Streaming Job.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents
only Standard Jobs, that is to say traditional
Talend
data
integration Jobs.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Related scenarios

For a scenario about how to use the same type of component in a Spark Streaming Job, see
Reading and writing data in MongoDB using a Spark Streaming Job.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x