tSparkConnection

Warning

This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend
solutions with Big Data.

tSparkConnection properties

Component family	Big Data / Spark
Function	tSparkConnection creates a connection to a given Spark environment, such as a Spark-enabled Hadoop cluster.
Purpose	tSparkConnection creates a Spark connection that the other Spark components can reuse within the same Job.
Basic settings	Spark mode	Select the type of the Spark environment you need to connect to. Local: the Studio builds the Spark environment in itself at runtime to run the Job locally within the Studio. With this mode, each processor of the local machine is used as a Spark worker to perform the computations. Note this local machine is the machine in which the Job is actually run. Standalone: the Studio connects to a Spark-enabled cluster to run the Job from this cluster. Yarn client: the Studio runs the Spark driver to orchestrate how the Job should be performed and then send the orchestration to the Yarn service of a given Hadoop cluster so that the Resource Manager of this Yarn service requests execution resources accordingly.
Version	Distribution	Select the cluster you are using from the drop-down list. The options in the list vary depending on the component you are using. Among these options, the following ones requires specific configuration: If available in this Distribution drop-down list, the Microsoft HD Insight option allows you to use a Microsoft HD Insight cluster. For this purpose, you need to configure the connections to the WebHCat service, the HD Insight service and the Windows Azure Storage service of that cluster in the areas that are displayed. A demonstration video about how to configure this connection is available in the following link: https://www.youtube.com/watch?v=A3QTT6VsNoM The Custom option allows you to connect to a cluster different from any of the distributions given in this list, that is to say, to connect to a cluster not officially supported by Talend. In order to connect to a custom distribution, once selecting Custom, click the button to display the dialog box in which you can alternatively: Select Import from existing version to import an officially supported distribution as base and then add other required jar files which the base distribution does not provide. Select Import from zip to import a custom distribution zip that, for example, you can download from http://www.talendforge.org/exchange/index.php. Note In this dialog box, the active check box must be kept selected so as to import the jar files pertinent to the connection to be created between the custom distribution and this component. For an step-by-step example about how to connect to a custom distribution and share this connection, see Connecting to a custom Hadoop distribution.
	Hadoop version	Select the version of the Hadoop distribution you are using along with Spark.
	Spark host	Enter the URI of the Spark Master of the Hadoop cluster to be used. This field is available only when you are using the Standalone mode.
	Spark home	Enter the location of the Spark executable installed in the Hadoop cluster to be used. This field is available only when you are using the Standalone mode.
	Hadoop configuration	This field is available only when you are using the Yarn client mode. From this field, you need to browse to the local jar file that contains the configuration of the Yarn service to be used. The configuration files that must present in this jar file are: core-site.xml hdfs-site.xml mapred-site.xml yarn-site.xml At runtime, the jar file is added to the Studio and becomes available in the drop-down list of jar files that you can access, for example, from this Hadoop configuration field. Therefore, when you create a second Spark connection using the tSparkConnection component, you can select this jar file directly from this jar file list.
	Execute this Job as a streaming application	Select this check box to make the Job continuously run within a specified time frame in order to process data in real time. If you leave this check box clear, the Job switches to the batch mode to process data in batches. With this mode, the Job uses no more the continuous fashion but runs only when you start it and stops once the execution is done.
	Define the driver hostname or IP address	Enter the host name or the IP address of the machine in which the Job is to be run. This allows the Spark master and its workers to recognize this machine to find the Job and thus its driver. Using this feature, you are actually defining the `spark.driver.host` property. For further information about this property, see Apache’s documentation about Spark.
Streaming	Batch size	This parameter is required by the streaming mode and thus is available only when you have selected the Execute this Job as a streaming application check box. Enter the time interval at the end of which the Job reviews the source data to identify changes and processes the detected changes.
	Define a streaming timeout	Enter the time frame at the end of which the streaming Job automatically stops running. Before the timeout, you can manually stop the Job by killing it when you need to.
Advanced settings	Advanced properties	Add any Spark properties you need to use to override their default counterparts used by the Studio. When you are using the Standalone Spark mode, you need to set the spark.driver.host parameter to let the Spark cluster recognize the machine hosting the Studio and fetch the Job you are creating from there. For example, if the IP address of the machine hosting the Studio is 192.168.1.16, then you need to put spark.driver.host in the Property column and 192.168.1.16 in the Value column.
	tStatCatcher Statistics	Select this check box to collect log data at the component level.
Global Variables	ERROR_MESSAGE: the error message generated by the component when an error occurs. This is an After variable and it returns a string. This variable functions only if the Die on error check box is cleared, if the component has this check box. SPARK_CONTEXT: the connection configuration to the Spark service being used. This is an AFTER variable and it returns an object. A Flow variable functions during the execution of a component while an After variable functions after the execution of the component. To fill up a field or expression with a variable, press Ctrl + Space to access the variable list and choose the variable to use from it. For further information about variables, see Talend Studio User Guide.
Usage	This component is used standalone to establish a Spark connection and the other Spark Subjobs reuse this connection to proceed with more sophisticate operations.
Prerequisites	When you use the Standalone or the Yarn client mode, you need to connect to a given Hadoop cluster. The Hadoop distribution must be properly installed, so as to guarantee the interaction with Talend Studio. The following list presents MapR related information for example. Ensure that you have installed the MapR client in the machine where the Studio is, and added the MapR client library to the PATH variable of that machine. According to MapR’s documentation, the library or libraries of a MapR client corresponding to each OS version can be found under MAPR_INSTALL hadoophadoop-VERSIONlib ative. For example, the library for Windows is lib ativeMapRClient.dll in the MapR client jar file. For further information, see the following link from MapR: http://www.mapr.com/blog/basic-notes-on-configuring-eclipse-as-a-hadoop-development-environment-for-mapr. Without adding the specified library or libraries, you may encounter the following error: `no MapRClient in java.library.path`. Set the `-Djava.library.path` argument, for example, in the Job Run VM arguments area of the Run/Debug view in the [Preferences] dialog box. This argument provides to the Studio the path to the native library of that MapR client. This allows the subscription-based users to make full use of the Data viewer to view locally in the Studio the data stored in MapR. For further information about how to set this argument, see the section describing how to view data of Talend Big Data Getting Started Guide. For further information about how to install a Hadoop distribution, see the manuals corresponding to the Hadoop distribution you are using.
Limitations	It is strongly recommended to use this component in a Spark-only Job, that is to say, to design and run a Spark Job separately from the non Spark components or Jobs. For example, it is not recommended to use the tRunJob component to coordinate a Spark Job and a non Spark Job, or to use the tHDFSPut component along with the Spark components in the same Job.

Related scenario

For a related scenario, see Scenario: streaming Twitter feed of a given Twitter account.

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 5.x

0 Comments

Inline Feedbacks

View all comments

tSparkConnection – Docs for ESB 5.x

tSparkConnection

Warning

tSparkConnection properties

Note

Related scenario

My Website Links

Tags