August 17, 2023

tSparkConnection – Docs for ESB 5.x

tSparkConnection

tsparkconnection_icon32_white.png

Warning

This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend
solutions with Big Data.

tSparkConnection properties

Component family

Big Data / Spark

 

Function

tSparkConnection creates a
connection to a given Spark environment, such as a Spark-enabled
Hadoop cluster.

Purpose

tSparkConnection creates a Spark
connection that the other Spark components can reuse within the same
Job.

Basic settings

Spark mode

Select the type of the Spark environment you need to connect to.

  • Local: the Studio
    builds the Spark environment in itself at runtime to run
    the Job locally within the Studio. With this mode, each
    processor of the local machine is used as a Spark worker
    to perform the computations.

    Note this local machine is the machine in which the
    Job is actually run.

  • Standalone: the
    Studio connects to a Spark-enabled cluster to run the
    Job from this cluster.

  • Yarn client: the
    Studio runs the Spark driver to orchestrate how the Job
    should be performed and then send the orchestration to
    the Yarn service of a given Hadoop cluster so that the
    Resource Manager of this Yarn service requests execution
    resources accordingly.

Version

Distribution

Select the cluster you are using from the drop-down list. The options in the list vary
depending on the component you are using. Among these options, the following ones requires
specific configuration:

  • If available in this Distribution drop-down list, the
    Microsoft HD Insight option allows you to use a
    Microsoft HD Insight cluster. For this purpose, you need to configure the
    connections to the WebHCat service, the HD Insight service and the Windows Azure
    Storage service of that cluster in the areas that are displayed. A demonstration
    video about how to configure this connection is available in the following link:
    https://www.youtube.com/watch?v=A3QTT6VsNoM

  • The Custom option allows you to connect to a
    cluster different from any of the distributions given in this list, that is to
    say, to connect to a cluster not officially supported by Talend.

In order to connect to a custom distribution, once selecting Custom, click the dotbutton.png button to display the dialog box in which you can
alternatively:

  1. Select Import from existing version to import an
    officially supported distribution as base and then add other required jar files
    which the base distribution does not provide.

  2. Select Import from zip to import a custom
    distribution zip that, for example, you can download from http://www.talendforge.org/exchange/index.php.

    Note

    In this dialog box, the active check box must be kept selected so as to import
    the jar files pertinent to the connection to be created between the custom
    distribution and this component.

    For an step-by-step example about how to connect to a custom distribution and
    share this connection, see Connecting to a custom Hadoop distribution.

 

Hadoop version

Select the version of the Hadoop distribution you are using along with Spark.

 

Spark host

Enter the URI of the Spark Master of the Hadoop cluster to be used.

This field is available only when you are using the Standalone mode.

  Spark home

Enter the location of the Spark executable installed in the Hadoop cluster to be
used.

This field is available only when you are using the Standalone mode.

 

Hadoop configuration

This field is available only when you are using the Yarn client mode. From this field, you
need to browse to the local jar file that contains the configuration of the Yarn service to
be used.

The configuration files that must present in this jar file are:

  • core-site.xml

  • hdfs-site.xml

  • mapred-site.xml

  • yarn-site.xml

At runtime, the jar file is added to the Studio and becomes available in the
drop-down list of jar files that you can access, for example, from this Hadoop configuration field. Therefore, when you create a second
Spark connection using the tSparkConnection component, you
can select this jar file directly from this jar file list.

 

Execute this Job as a streaming application

Select this check box to make the Job continuously run within a
specified time frame in order to process data in real time.

If you leave this check box clear, the Job switches to the batch
mode to process data in batches. With this mode, the Job uses no
more the continuous fashion but runs only when you start it and
stops once the execution is done.

 

Define the driver hostname or IP
address

Enter the host name or the IP address of the machine in which the
Job is to be run. This allows the Spark master and its workers to
recognize this machine to find the Job and thus its driver.

Using this feature, you are actually defining the
spark.driver.host property. For further information
about this property, see Apache’s documentation about Spark.

Streaming

Batch size

This parameter is required by the streaming mode and thus is
available only when you have selected the Execute this Job as a streaming application check
box.

Enter the time interval at the end of which the Job reviews the
source data to identify changes and processes the detected
changes.

 

Define a streaming timeout

Enter the time frame at the end of which the streaming Job
automatically stops running.

Before the timeout, you can manually stop the Job by killing it
when you need to.

Advanced settings

Advanced properties

Add any Spark properties you need to use to override their default
counterparts used by the Studio.

When you are using the Standalone
Spark mode, you need to set the spark.driver.host parameter to let the Spark cluster
recognize the machine hosting the Studio and fetch the Job you are
creating from there. For example, if the IP address of the machine
hosting the Studio is 192.168.1.16, then you need to put spark.driver.host in the Property column and 192.168.1.16 in the Value column.

 

tStatCatcher Statistics

Select this check box to collect log data at the component
level.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

SPARK_CONTEXT: the connection configuration to the Spark
service being used. This is an AFTER variable and it returns an object.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component is used standalone to establish a Spark connection
and the other Spark Subjobs reuse this connection to proceed with
more sophisticate operations.

Prerequisites

When you use the Standalone or the Yarn client mode, you need to connect to a given Hadoop cluster.

The Hadoop distribution must be properly installed, so as to guarantee the interaction
with Talend Studio. The following list presents MapR related information for
example.

  • Ensure that you have installed the MapR client in the machine where the Studio is,
    and added the MapR client library to the PATH variable of that machine. According
    to MapR’s documentation, the library or libraries of a MapR client corresponding to
    each OS version can be found under MAPR_INSTALL
    hadoophadoop-VERSIONlib
    ative
    . For example, the library for
    Windows is lib
    ativeMapRClient.dll
    in the MapR
    client jar file. For further information, see the following link from MapR: http://www.mapr.com/blog/basic-notes-on-configuring-eclipse-as-a-hadoop-development-environment-for-mapr.

    Without adding the specified library or libraries, you may encounter the following
    error: no MapRClient in java.library.path.

  • Set the -Djava.library.path argument, for example, in the Job Run VM arguments area
    of the Run/Debug view in the [Preferences] dialog box. This argument provides to the Studio the
    path to the native library of that MapR client. This allows the subscription-based
    users to make full use of the Data viewer to view
    locally in the Studio the data stored in MapR. For further information about how to
    set this argument, see the section describing how to view data of Talend Big Data Getting Started Guide.

For further information about how to install a Hadoop distribution, see the manuals
corresponding to the Hadoop distribution you are using.

Limitations

It is strongly recommended to use this component in a Spark-only Job, that is to say, to
design and run a Spark Job separately from the non Spark components or Jobs. For example, it
is not recommended to use the tRunJob component to
coordinate a Spark Job and a non Spark Job, or to use the tHDFSPut component along with the Spark components in the same Job.

Related scenario

For a related scenario, see Scenario: streaming Twitter feed of a given Twitter account.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x