tAzureFSConfiguration
Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:
-
Spark Batch: see tAzureFSConfiguration properties for Apache Spark Batch.
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric. -
Spark Streaming: see tAzureFSConfiguration properties for Apache Spark Streaming.
This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.
tAzureFSConfiguration properties for Apache Spark Batch
These properties are used to configure tAzureFSConfiguration running in the Spark Batch Job framework.
The Spark Batch
tAzureFSConfiguration component belongs to the Storage family.
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.
Basic settings
Azure FileSystem |
Select the file system to be used. Then the parameters to be defined are This component is designed to store your actual user data or business data in |
When you use this component with Azure Blob Storage:
Blob storage account |
Enter the name of the storage account you need to access. A storage account |
Account key |
Enter the key associated with the storage account you need to access. Two |
Container |
Enter the name of the blob |
When you use this component with Azure Data Lake Storage Gen1:
Data Lake Storage account |
Enter the name of the Data Lake Storage account you need to access. Ensure that |
Client ID and Client key |
In the Ensure that the application to be used has appropriate |
Token endpoint |
In the |
File system |
In this field, enter the name of the ADLS Gen2 file system to be used. An ADLS Gen2 file system is hierarchical and so compatible with HDFS. |
Create remote file system during initialization |
If the ADLS Gen2 file system to be used does not exist, select this check box to create it on the fly. |
When you use this component with Azure Data Lake Storage Gen2:
Data Lake Storage account |
Enter the name of the Data Lake Storage account you need to access. Ensure that |
Account key |
Enter the key associated with the storage account you need to access. Two |
File system |
In this field, enter the name of the ADLS Gen2 file system to be used. An ADLS Gen2 file system is hierarchical and so compatible with HDFS. |
Create remote file system during initialization |
If the ADLS Gen2 file system to be used does not exist, select this check box to create it on the fly. |
Global Variables
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
Usage rule |
This component is used standalone in a subJob to provide Ony one tAzureFSConfiguration is allowed per Job.
tAzureFSConfiguration does not support SSL access to The output files of Spark cannot be |
Spark Connection |
In the Spark
Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
tAzureFSConfiguration properties for Apache Spark Streaming
These properties are used to configure tAzureFSConfiguration running in the Spark Streaming Job framework.
The Spark Streaming
tAzureFSConfiguration component belongs to the Storage family.
This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.
Basic settings
Azure FileSystem |
Select the file system to be used. Then the parameters to be defined are This component is designed to store your actual user data or business data in |
When you use this component with Azure Blob Storage:
Blob storage account |
Enter the name of the storage account you need to access. A storage account |
Account key |
Enter the key associated with the storage account you need to access. Two |
Container |
Enter the name of the blob |
When you use this component with Azure Data Lake Storage Gen1:
Data Lake Storage account |
Enter the name of the Data Lake Storage account you need to access. Ensure that |
Client ID and Client key |
In the Ensure that the application to be used has appropriate |
Token endpoint |
In the |
File system |
In this field, enter the name of the ADLS Gen2 file system to be used. An ADLS Gen2 file system is hierarchical and so compatible with HDFS. |
Create remote file system during initialization |
If the ADLS Gen2 file system to be used does not exist, select this check box to create it on the fly. |
When you use this component with Azure Data Lake Storage Gen2:
Data Lake Storage account |
Enter the name of the Data Lake Storage account you need to access. Ensure that |
Account key |
Enter the key associated with the storage account you need to access. Two |
File system |
In this field, enter the name of the ADLS Gen2 file system to be used. An ADLS Gen2 file system is hierarchical and so compatible with HDFS. |
Create remote file system during initialization |
If the ADLS Gen2 file system to be used does not exist, select this check box to create it on the fly. |
Global Variables
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
Usage rule |
This component is used standalone in a Only one tAzureFSConfiguration is allowed per Job. tAzureFSConfiguration does not support SSL access to The output files of Spark cannot be |
Spark Connection |
In the Spark
Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Writing and reading data from Azure Data Lake Storage using Spark (Azure Databricks)
In this scenario, you create a Spark Batch Job using
tAzureFSConfiguration and the Parquet components to write data on
Azure Data Lake Storage and then read the data from Azure.
This scenario applies only to subscription-based Talend products with Big
Data.

follows:
1 |
01;ychen |
This data contains a user name and the ID number distributed to this user.
Note that the sample data is created for demonstration purposes only.
Design the data flow of the Job working with Azure and Databricks
-
In the
Integration
perspective of the Studio, create an empty
Spark Batch Job from the Job Designs node in
the Repository tree view. -
In the workspace, enter the name of the component to be used and select this
component from the list that appears. In this scenario, the components are
tAzureFSConfiguration, tFixedFlowInput, tFileOutputParquet,
tFileInputParquet and tLogRow.The tFixedFlowInput component is used to load the
sample data into the data flow. In the real-world practice, you could use the File input components, as well as the processing components, to design a sophisticated process to
prepare your data to be processed. - Connect tFixedFlowInput to tFileOutputParquet using the Row > Main link.
- Connect tFileInputParquet to tLogRow using the Row > Main link.
- Connect tFixedFlowInput to tFileInputParquet using the Trigger > OnSubjobOk link.
-
Leave tAzureFSConfiguration alone without any
connection.
Grant your application the access to your ADLS Gen2
An Azure subscription is required.
-
Create your Azure Data Lake Storage Gen2 account if you do not have it
yet.- For more details, see Create an Azure Data Lake Storage
Gen2 account from the Azure documentation.
- For more details, see Create an Azure Data Lake Storage
-
Create an Azure Active Directory application on your Azure portal. For more
details about how to do this, see the “Create an Azure Active Directory
application” section in Azure documentation: Use portal to create an Azure Active Directory
application. -
Obtain the application ID, object ID and the client secret of the application
to be used from the portal.-
On the list of the registered applications, click the application you
created and registered in the previous step to display its information
blade. -
Click Overview to open its blade, and from the
top section of the blade, copy the Object ID and
the application ID displayed as Application (client)
ID. Keep them somewhere safe for later use. -
Click Certificates & secrets to open its
blade and then create the authentication key (client secret) to be used
on this blade in the Client secrets
section.
-
On the list of the registered applications, click the application you
-
Back to the Overview blade of the application to be
used, click Endpoints on the top of this blade, copy the
value of OAuth 2.0 token endpoint (v1) from the endpoint
list that appears and keep it somewhere safe for later use. -
Set the read and write permissions to the ADLS Gen2 filesystem to be used for
the service principal of your application.It is very likely that the administrator of your Azure system has included
your account and your applications in the group that has access to a given ADLS
Gen2 storage account and a given ADLS Gen2 filesystem. In this case, ask your
administrator to ensure that you have the proper access and then ignore this
step.-
Start your Microsoft Azure Storage Explorer and find your ADLS Gen2
storage account on the Storage Accounts
list.If you have not installed Microsoft Azure Storage Explorer, you can
download it from the Microsoft Azure official site. -
Expand this account and the Blob Containers node
under it; then click the ADLS Gen2 hierarchical filesystem to be used
under this node.The filesystem in this image is for demonstration purposes only.
Create the filesystem to be used under the Blob
Containers node in your Microsoft Azure Storage
Explorer, if you do not have one yet. -
On the blade that is opened, click Manage Access
to open its wizard. -
At the bottom of this wizard, add the object ID of your application to
the Add user or group field and click
Add. -
Select the object ID just added from the Users and
groups list and select all the permission for
Access and
Default. -
Click Save to validate these changes and close
this wizard.
-
Start your Microsoft Azure Storage Explorer and find your ADLS Gen2
Adding Azure specific properties to access the Azure storage system from Databricks
Add the Azure specific properties to the Spark configuration of your Databricks
cluster so that your cluster can access Azure Storage.
You need to do this only when you want your Talend
Jobs for Apache Spark to use Azure Blob Storage or Azure Data Lake Storage with
Databricks.
-
Ensure that your Spark cluster in Databricks has been properly created and is
running and its version is supported by the Studio. If you use Azure Data
Lake Storage Gen 2, only Databricks 5.4 is supported.For further information, see Create Databricks workspace from
Azure documentation. - You have an Azure account.
- The Azure Blob Storage or Azure Data Lake Storage service to be used has been
properly created and you have the appropriate permissions to access it. For
further information about Azure Storage, see Azure Storage tutorials from Azure
documentation.
-
On the Configuration tab of your Databricks cluster
page, scroll down to the Spark tab at the bottom of the
page. -
Click Edit to make the fields on this page
editable. -
In this Spark tab, enter the Spark properties regarding
the credentials to be used to access your Azure Storage system.Option Description Azure Blob Storage When you need to use Azure Blob Storage with Azure Databricks, add the
following Spark property:-
The parameter to provide account key:
1spark.hadoop.fs.azure.account.key.<storage_account>.blob.core.windows.net <key>Ensure that the account to be used has the appropriate read/write rights and permissions.
-
If you need to append data to an existing file, add this parameter:
1spark.hadoop.fs.azure.enable.append.support true
Azure Data Lake Storage (Gen 1) When you need to use Azure Data Lake Storage Gen1 with
Databricks, add the following Spark properties, each per
line:1234spark.hadoop.dfs.adls.oauth2.access.token.provider.type ClientCredentialspark.hadoop.dfs.adls.oauth2.client.id <your_app_id>spark.hadoop.dfs.adls.oauth2.credential <your_authentication_key>spark.hadoop.dfs.adls.oauth2.refresh.url https://login.microsoftonline.com/<your_app_TENANT-ID>/oauth2/tokenAzure Data Lake Storage (Gen 2) When you need to use Azure Data Lake Storage Gen2 with Databricks,
add the following Spark properties, each per line:-
The parameter to provide an account key:
1spark.hadoop.fs.azure.account.key.<storage_account>.dfs.core.windows.net <key>This key is associated with the storage account to be used.
You can find it in the Access keys
blade of this storage account. Two keys are available for
each account and by default, either of them can be used for
this access.Ensure that the account to be used has the appropriate read/write rights and permissions.
-
If the ADLS file system to be used does not exist yet, add the following parameter:
1spark.hadoop.fs.azure.createRemoteFileSystemDuringInitialization true
For further information about how to find your application ID and authentication
key, see Get application ID and authentication
key from the Azure documentation. In the same documentation, you can
also find details about how to find your tenant ID at Get tenant ID. -
-
If you need to run Spark Streaming Jobs with Databricks, in the same
Spark tab, add the following property to define a
default Spark serializer. If you do not plan to run Spark Streaming Jobs, you
can ignore this step.1spark.serializer org.apache.spark.serializer.KryoSerializer - Restart your Spark cluster.
-
In the Spark UI tab of your Databricks cluster page,
click Environment to display the list of properties and
verify that each of the properties you added in the previous steps is present on
that list.
Defining the Azure Databricks connection parameters for Spark Jobs
configuration tab of the Run view of your Job.
This configuration is effective on a per-Job basis.
- When running a Spark Streaming Job, only one Job is allowed to run on the same Databricks cluster per time.
- When running a Spark Batch Job, only if you have selected the Do not restart the cluster
when submitting check box, you can send more than one Job to run in parallel on the same Databricks cluster; otherwise, since each run
automatically restarts the cluster, the Jobs that are launched in parallel interrupt
each other and thus cause execution failure.
-
From the Cloud provider drop-down list, select
Azure. -
Enter the basic connection information to Databricks.
Standalone
-
In the Endpoint
field, enter the URL address of your Azure Databricks workspace.
This URL can be found in the Overview blade
of your Databricks workspace page on your Azure portal. For example,
this URL could look like https://westeurope.azuredatabricks.net. -
In the Cluster ID
field, enter the ID of the Databricks cluster to be used. This ID is
the value of the
spark.databricks.clusterUsageTags.clusterId
property of your Spark cluster. You can find this property on the
properties list in the Environment tab in the
Spark UI view of your cluster.You can also easily find this ID from
the URL of your Databricks cluster. It is present immediately after
cluster/ in this URL. -
Click the […] button
next to the Token field to enter the
authentication token generated for your Databricks user account. You
can generate or find this token on the User
settings page of your Databricks workspace. For
further information, see Token management from the
Azure documentation. -
In the DBFS dependencies
folder field, enter the directory that is used to
store your Job related dependencies on Databricks Filesystem at
runtime, putting a slash (/) at the end of this directory. For
example, enter /jars/ to store the dependencies
in a folder named jars. This folder is created
on the fly if it does not exist then. -
Poll interval when retrieving Job status (in
ms): enter, without the quotation marks, the time
interval (in milliseconds) at the end of which you want the Studio
to ask Spark for the status of your Job. For example, this status
could be Pending or Running.The default value is 300000, meaning 30
seconds. This interval is recommended by Databricks to correctly
retrieve the Job status. -
Use
transient cluster: you can select this check box to
leverage the transient Databricks clusters.The custom properties you defined in the Advanced properties table are automatically taken into account by the transient clusters at runtime.
- Autoscale: select or clear this check box to define
the number of workers to be used by your transient cluster.- If you select this check box,
autoscaling is enabled. Then define the minimum number
of workers in Min
workers and the maximum number of
worders in Max
workers. Your transient cluster is
scaled up and down within this scope based on its
workload.According to the Databricks
documentation, autoscaling works best with
Databricks runtime versions 3.0 or onwards. - If you clear this check box, autoscaling
is deactivated. Then define the number of workers a
transient cluster is expected to have. This number does
not include the Spark driver node.
- If you select this check box,
- Node type
and Driver node type:
select the node types for the workers and the Spark driver node.
These types determine the capacity of your nodes and their
pricing by Databricks.For details about
these node types and the Databricks Units they use, see
Supported Instance
Types from the Databricks documentation. - Elastic
disk: select this check box to enable your
transient cluster to automatically scale up its disk space when
its Spark workers are running low on disk space.For more details about this elastic disk
feature, search for the section about autoscaling local
storage from your Databricks documentation. - SSH public
key: if an SSH access has been set up for your
cluster, enter the public key of the generated SSH key pair.
This public key is automatically added to each node of your
transient cluster. If no SSH access has been set up, ignore this
field.For further information about SSH
access to your cluster, see SSH access to
clusters from the Databricks
documentation. - Configure cluster
log: select this check box to define where to
store your Spark logs for a long term. This storage system could
be S3 or DBFS.
- Autoscale: select or clear this check box to define
- Do not restart the cluster
when submitting: select this check box to prevent
the Studio restarting the cluster when the Studio is submitting your
Jobs. However, if you make changes in your Jobs, clear this check
box so that the Studio resarts your cluster to take these changes
into account.
-
If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
Spark checkpointing operation. In the field that is displayed, enter the
directory in which Spark stores, in the file system of the cluster, the context
data of the computations such as the metadata and the generated RDDs of this
computation.
For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .
Configuring the connection to the Azure Data Lake Storage service to be used by Spark
-
Double-click tAzureFSConfiguration to open its Component view.
Spark uses this component to connect to the Azure Data Lake Storage system to which your Job writes the actual business data.
- From the Azure FileSystem drop-down list, select Azure Datalake Storage to use Data Lake Storage as the target system to be used.
-
In the Datalake storage account field, enter the name of the Data Lake Storage account you need to access.
Ensure that the administrator of the system has granted your Azure account the appropriate access permissions to this Data Lake Storage account.
-
In the
Client ID and the Client
key fields, enter, respectively, the authentication
ID and the authentication key generated upon the registration of the
application that the current Job you are developing uses to access
Azure Data Lake Storage.Ensure that the application to be used has appropriate
permissions to access Azure Data Lake. You can check this on the
Required permissions view of this application on Azure. For further
information, see Azure documentation Assign the Azure AD application to
the Azure Data Lake Storage account file or folder.This application must be the one to
which you assigned permissions to access your Azure Data Lake Storage in
the previous step. -
In the
Token endpoint field, copy-paste the
OAuth 2.0 token endpoint that you can obtain from the
Endpoints list accessible on the
App registrations page on your Azure
portal.
Write the sample data to Azure Data Lake Storage
-
Double-click the tFixedFlowIput component to
open its Component view. - Click the […] button next to Edit schema to open the schema editor.
-
Click the [+] button to add the schema
columns as shown in this image. -
Click OK to validate these changes and accept
the propagation prompted by the pop-up dialog box. -
In the Mode area, select the Use Inline
Content radio button and paste the previously mentioned sample data
into the Content field that is displayed. -
In the Field separator field, enter a
semicolon (;). -
Double-click the tFileOutputParquet component to
open its Component view. -
Select the Define a storage configuration component
check box and then select the tAzureFSConfiguration
component you configured in the previous steps. -
Click Sync columns to ensure that
tFileOutputParquet has the same schema as
tFixedFlowInput. -
In the Folder/File field, enter the name of the Data
Lake storage folder to be used to store the sample data. -
From the Action drop-down list, select
Create if the folder to be used does not exist yet on
Azure Data Lake Storage; if this folder already exists, select Overwrite.
Reading the sample data from Azure Data Lake Storage
-
Double-click tFileInputParquet to open its
Component view. -
Select the Define a storage configuration component
check box and then select the tAzureFSConfiguration
component you configured in the previous steps. -
Click the […] button next to Edit
schema to open the schema editor. -
Click the [+] button to add the schema columns for
output as shown in this image. -
Click OK to validate these changes and accept the
propagation prompted by the pop-up dialog box. -
In the Folder/File field, enter the name of the folder
from which you need to read data. In this scenario, it is
sample_user. -
Double click tLogRow to open its
Component view and select the Table radio button to present the result in a table. - Press F6 to run this Job.
of your Databricks cluster and then check the execution log of your Job.
