Component family
|
Big Data / Spark
|
|
Function
|
tSparkConnection creates a
connection to a given Spark environment, such as a Spark-enabled
Hadoop cluster.
|
Purpose
|
tSparkConnection creates a Spark
connection that the other Spark components can reuse within the same
Job.
|
Basic settings
|
Spark mode
|
Select the type of the Spark environment you need to connect to.
-
Local: the Studio
builds the Spark environment in itself at runtime to run
the Job locally within the Studio. With this mode, each
processor of the local machine is used as a Spark worker
to perform the computations.
Note this local machine is the machine in which the
Job is actually run.
-
Standalone: the
Studio connects to a Spark-enabled cluster to run the
Job from this cluster.
-
Yarn client: the
Studio runs the Spark driver to orchestrate how the Job
should be performed and then send the orchestration to
the Yarn service of a given Hadoop cluster so that the
Resource Manager of this Yarn service requests execution
resources accordingly.
|
Version
|
Distribution
|
Select the cluster you are using from the drop-down list. The options in the list vary
depending on the component you are using. Among these options, the following ones requires
specific configuration:
-
If available in this Distribution drop-down list, the
Microsoft HD Insight option allows you to use a
Microsoft HD Insight cluster. For this purpose, you need to configure the
connections to the WebHCat service, the HD Insight service and the Windows Azure
Storage service of that cluster in the areas that are displayed. A demonstration
video about how to configure this connection is available in the following link:
https://www.youtube.com/watch?v=A3QTT6VsNoM
-
The Custom option allows you to connect to a
cluster different from any of the distributions given in this list, that is to
say, to connect to a cluster not officially supported by Talend.
In order to connect to a custom distribution, once selecting Custom, click the button to display the dialog box in which you can
alternatively:
-
Select Import from existing version to import an
officially supported distribution as base and then add other required jar files
which the base distribution does not provide.
-
Select Import from zip to import a custom
distribution zip that, for example, you can download from http://www.talendforge.org/exchange/index.php.
Note
In this dialog box, the active check box must be kept selected so as to import
the jar files pertinent to the connection to be created between the custom
distribution and this component.
For an step-by-step example about how to connect to a custom distribution and
share this connection, see Connecting to a custom Hadoop distribution.
|
|
Hadoop version
|
Select the version of the Hadoop distribution you are using along with Spark.
|
|
Spark host
|
Enter the URI of the Spark Master of the Hadoop cluster to be used.
This field is available only when you are using the Standalone mode.
|
|
Spark home
|
Enter the location of the Spark executable installed in the Hadoop cluster to be
used.
This field is available only when you are using the Standalone mode.
|
|
Hadoop configuration
|
This field is available only when you are using the Yarn client mode. From this field, you
need to browse to the local jar file that contains the configuration of the Yarn service to
be used.
The configuration files that must present in this jar file are:
-
core-site.xml
-
hdfs-site.xml
-
mapred-site.xml
-
yarn-site.xml
At runtime, the jar file is added to the Studio and becomes available in the
drop-down list of jar files that you can access, for example, from this Hadoop configuration field. Therefore, when you create a second
Spark connection using the tSparkConnection component, you
can select this jar file directly from this jar file list.
|
|
Execute this Job as a streaming application
|
Select this check box to make the Job continuously run within a
specified time frame in order to process data in real time.
If you leave this check box clear, the Job switches to the batch
mode to process data in batches. With this mode, the Job uses no
more the continuous fashion but runs only when you start it and
stops once the execution is done.
|
|
Define the driver hostname or IP
address
|
Enter the host name or the IP address of the machine in which the
Job is to be run. This allows the Spark master and its workers to
recognize this machine to find the Job and thus its driver.
Using this feature, you are actually defining the
spark.driver.host property. For further information
about this property, see Apache’s documentation about Spark.
|
Streaming
|
Batch size
|
This parameter is required by the streaming mode and thus is
available only when you have selected the Execute this Job as a streaming application check
box.
Enter the time interval at the end of which the Job reviews the
source data to identify changes and processes the detected
changes.
|
|
Define a streaming timeout
|
Enter the time frame at the end of which the streaming Job
automatically stops running.
Before the timeout, you can manually stop the Job by killing it
when you need to.
|
Advanced settings
|
Advanced properties
|
Add any Spark properties you need to use to override their default
counterparts used by the Studio.
When you are using the Standalone
Spark mode, you need to set the spark.driver.host parameter to let the Spark cluster
recognize the machine hosting the Studio and fetch the Job you are
creating from there. For example, if the IP address of the machine
hosting the Studio is 192.168.1.16, then you need to put spark.driver.host in the Property column and 192.168.1.16 in the Value column.
|
|
tStatCatcher Statistics
|
Select this check box to collect log data at the component
level.
|
Global Variables
|
ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.
SPARK_CONTEXT: the connection configuration to the Spark
service being used. This is an AFTER variable and it returns an object.
A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.
To fill up a field or expression with a variable, press Ctrl +
Space to access the variable list and choose the variable to use from it.
For further information about variables, see Talend Studio
User Guide.
|
Usage
|
This component is used standalone to establish a Spark connection
and the other Spark Subjobs reuse this connection to proceed with
more sophisticate operations.
|
Prerequisites
|
When you use the Standalone or the Yarn client mode, you need to connect to a given Hadoop cluster.
The Hadoop distribution must be properly installed, so as to guarantee the interaction
with Talend Studio. The following list presents MapR related information for
example.
-
Ensure that you have installed the MapR client in the machine where the Studio is,
and added the MapR client library to the PATH variable of that machine. According
to MapR’s documentation, the library or libraries of a MapR client corresponding to
each OS version can be found under MAPR_INSTALL
hadoophadoop-VERSIONlib
ative. For example, the library for
Windows is lib
ativeMapRClient.dll in the MapR
client jar file. For further information, see the following link from MapR: http://www.mapr.com/blog/basic-notes-on-configuring-eclipse-as-a-hadoop-development-environment-for-mapr.
Without adding the specified library or libraries, you may encounter the following
error: no MapRClient in java.library.path .
-
Set the -Djava.library.path argument, for example, in the Job Run VM arguments area
of the Run/Debug view in the [Preferences] dialog box. This argument provides to the Studio the
path to the native library of that MapR client. This allows the subscription-based
users to make full use of the Data viewer to view
locally in the Studio the data stored in MapR. For further information about how to
set this argument, see the section describing how to view data of Talend Big Data Getting Started Guide.
For further information about how to install a Hadoop distribution, see the manuals
corresponding to the Hadoop distribution you are using.
|
Limitations
|
It is strongly recommended to use this component in a Spark-only Job, that is to say, to
design and run a Spark Job separately from the non Spark components or Jobs. For example, it
is not recommended to use the tRunJob component to
coordinate a Spark Job and a non Spark Job, or to use the tHDFSPut component along with the Spark components in the same Job.
|