July 30, 2023

tAzureFSConfiguration – Docs for ESB 7.x

tAzureFSConfiguration

Provides authentication information for Spark to connect to a given Azure file system.

Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:

tAzureFSConfiguration properties for Apache Spark Batch

These properties are used to configure tAzureFSConfiguration running in the Spark Batch Job framework.

The Spark Batch
tAzureFSConfiguration component belongs to the Storage family.

The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.

Basic settings

Azure FileSystem

Select the file system to be used. Then the parameters to be defined are
displayed accordingly.

This component is designed to store your actual user data or business data in
a Data Lake Storage system and it is not compatible with a Data Lake Storage
that is defined as primary storage in HDInsight. For this reason, if you are
using this component with HDInsight, then when you launch your HDInsight,
always set Blob storage, and do not set Data Lake Storage, as primary storage.

When you use this component with Azure Blob Storage:

Blob storage account

Enter the name of the storage account you need to access. A storage account
name can be found in the Storage accounts dashboard of the Microsoft Azure Storage
system to be used. Ensure that the administrator of the system has granted you the
appropriate access permissions to this storage account.

Account key

Enter the key associated with the storage account you need to access. Two
keys are available for each account and by default, either of them can be used for
this access.

Container

Enter the name of the blob
container you need to use.

When you use this component with Azure Data Lake Storage Gen1:

Data Lake Storage account

Enter the name of the Data Lake Storage account you need to access. Ensure that
the administrator of the system has granted you the appropriate access
permissions to this account.

Client ID and Client key

In the
Client ID and the Client
key
fields, enter, respectively, the authentication
ID and the authentication key generated upon the registration of the
application that the current Job you are developing uses to access
Azure Data Lake Storage.

Ensure that the application to be used has appropriate
permissions to access Azure Data Lake. You can check this on the
Required permissions view of this application on Azure. For further
information, see Azure documentation Assign the Azure AD application to
the Azure Data Lake Storage account file or folder
.

Token endpoint

In the
Token endpoint field, copy-paste the
OAuth 2.0 token endpoint that you can obtain from the
Endpoints list accessible on the
App registrations page on your Azure
portal.

File system

In this field, enter the name of the ADLS Gen2 file system to be used.

An ADLS Gen2 file system is hierarchical and so compatible with HDFS.

Create remote file system during
initialization
If the ADLS Gen2 file system to be used does not exist, select this check
box to create it on the fly.

When you use this component with Azure Data Lake Storage Gen2:

Data Lake Storage account

Enter the name of the Data Lake Storage account you need to access. Ensure that
the administrator of the system has granted you the appropriate access
permissions to this account.

Account key

Enter the key associated with the storage account you need to access. Two
keys are available for each account and by default, either of them can be used for
this access.

File system

In this field, enter the name of the ADLS Gen2 file system to be used.

An ADLS Gen2 file system is hierarchical and so compatible with HDFS.

Create remote file system during
initialization
If the ADLS Gen2 file system to be used does not exist, select this check
box to create it on the fly.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

This component is used standalone in a subJob to provide
connection configuration to your Azure file system for the whole Job.

Ony one tAzureFSConfiguration is allowed per Job.

tAzureFSConfiguration does not support SSL access to
Google Cloud Dataproc V1.1.

The output files of Spark cannot be
merged into one file on Azure Data Lake Storage, because this function is not
supported by Azure Data Lake Storage. In addition, this function has been deprecated
in the latest Hadoop
API.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • When using HDInsight, specify the blob to be used for Job
    deployment in the Windows Azure Storage
    configuration
    area in the Spark
    configuration
    tab.

  • When using Altus, specify the S3 bucket or the Azure
    Data Lake Storage for Job deployment in the Spark
    configuration
    tab.

This connection is effective on a per-Job basis.

tAzureFSConfiguration properties for Apache Spark Streaming

These properties are used to configure tAzureFSConfiguration running in the Spark Streaming Job framework.

The Spark Streaming
tAzureFSConfiguration component belongs to the Storage family.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

Basic settings

Azure FileSystem

Select the file system to be used. Then the parameters to be defined are
displayed accordingly.

This component is designed to store your actual user data or business data in
a Data Lake Storage system and it is not compatible with a Data Lake Storage
that is defined as primary storage in HDInsight. For this reason, if you are
using this component with HDInsight, then when you launch your HDInsight,
always set Blob storage, and do not set Data Lake Storage, as primary storage.

When you use this component with Azure Blob Storage:

Blob storage account

Enter the name of the storage account you need to access. A storage account
name can be found in the Storage accounts dashboard of the Microsoft Azure Storage
system to be used. Ensure that the administrator of the system has granted you the
appropriate access permissions to this storage account.

Account key

Enter the key associated with the storage account you need to access. Two
keys are available for each account and by default, either of them can be used for
this access.

Container

Enter the name of the blob
container you need to use.

When you use this component with Azure Data Lake Storage Gen1:

Data Lake Storage account

Enter the name of the Data Lake Storage account you need to access. Ensure that
the administrator of the system has granted you the appropriate access
permissions to this account.

Client ID and Client key

In the
Client ID and the Client
key
fields, enter, respectively, the authentication
ID and the authentication key generated upon the registration of the
application that the current Job you are developing uses to access
Azure Data Lake Storage.

Ensure that the application to be used has appropriate
permissions to access Azure Data Lake. You can check this on the
Required permissions view of this application on Azure. For further
information, see Azure documentation Assign the Azure AD application to
the Azure Data Lake Storage account file or folder
.

Token endpoint

In the
Token endpoint field, copy-paste the
OAuth 2.0 token endpoint that you can obtain from the
Endpoints list accessible on the
App registrations page on your Azure
portal.

File system

In this field, enter the name of the ADLS Gen2 file system to be used.

An ADLS Gen2 file system is hierarchical and so compatible with HDFS.

Create remote file system during
initialization
If the ADLS Gen2 file system to be used does not exist, select this check
box to create it on the fly.

When you use this component with Azure Data Lake Storage Gen2:

Data Lake Storage account

Enter the name of the Data Lake Storage account you need to access. Ensure that
the administrator of the system has granted you the appropriate access
permissions to this account.

Account key

Enter the key associated with the storage account you need to access. Two
keys are available for each account and by default, either of them can be used for
this access.

File system

In this field, enter the name of the ADLS Gen2 file system to be used.

An ADLS Gen2 file system is hierarchical and so compatible with HDFS.

Create remote file system during
initialization
If the ADLS Gen2 file system to be used does not exist, select this check
box to create it on the fly.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

This component is used standalone in a
subJob to provide connection configuration to your Azure file system for
the whole Job.

Only one tAzureFSConfiguration is allowed per Job.

tAzureFSConfiguration does not support SSL access to
Google Cloud Dataproc V1.1.

The output files of Spark cannot be
merged into one file on Azure Data Lake Storage, because this function is not
supported by Azure Data Lake Storage. In addition, this function has been deprecated
in the latest Hadoop
API.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • When using HDInsight, specify the blob to be used for Job
    deployment in the Windows Azure Storage
    configuration
    area in the Spark
    configuration
    tab.

  • When using Altus, specify the S3 bucket or the Azure
    Data Lake Storage for Job deployment in the Spark
    configuration
    tab.

This connection is effective on a per-Job basis.

Writing and reading data from Azure Data Lake Storage using Spark (Azure Databricks)

In this scenario, you create a Spark Batch Job using
tAzureFSConfiguration and the Parquet components to write data on
Azure Data Lake Storage and then read the data from Azure.

This scenario applies only to subscription-based Talend products with Big
Data
.

tAzureFSConfiguration_1.png
The sample data reads as
follows:

This data contains a user name and the ID number distributed to this user.

Note that the sample data is created for demonstration purposes only.

Design the data flow of the Job working with Azure and Databricks

  1. In the
    Integration
    perspective of the Studio, create an empty
    Spark Batch Job from the Job Designs node in
    the Repository tree view.
  2. In the workspace, enter the name of the component to be used and select this
    component from the list that appears. In this scenario, the components are
    tAzureFSConfiguration, tFixedFlowInput, tFileOutputParquet,
    tFileInputParquet and tLogRow.

    The tFixedFlowInput component is used to load the
    sample data into the data flow. In the real-world practice, you could use the File input components, as well as the processing components, to design a sophisticated process to
    prepare your data to be processed.
  3. Connect tFixedFlowInput to tFileOutputParquet using the Row > Main link.
  4. Connect tFileInputParquet to tLogRow using the Row > Main link.
  5. Connect tFixedFlowInput to tFileInputParquet using the Trigger > OnSubjobOk link.
  6. Leave tAzureFSConfiguration alone without any
    connection.

Grant your application the access to your ADLS Gen2

An Azure subscription is required.

  1. Create your Azure Data Lake Storage Gen2 account if you do not have it
    yet.

  2. Create an Azure Active Directory application on your Azure portal. For more
    details about how to do this, see the “Create an Azure Active Directory
    application” section in Azure documentation: Use portal to create an Azure Active Directory
    application
    .
  3. Obtain the application ID, object ID and the client secret of the application
    to be used from the portal.

    1. On the list of the registered applications, click the application you
      created and registered in the previous step to display its information
      blade.
    2. Click Overview to open its blade, and from the
      top section of the blade, copy the Object ID and
      the application ID displayed as Application (client)
      ID
      . Keep them somewhere safe for later use.
    3. Click Certificates & secrets to open its
      blade and then create the authentication key (client secret) to be used
      on this blade in the Client secrets
      section.
  4. Back to the Overview blade of the application to be
    used, click Endpoints on the top of this blade, copy the
    value of OAuth 2.0 token endpoint (v1) from the endpoint
    list that appears and keep it somewhere safe for later use.
  5. Set the read and write permissions to the ADLS Gen2 filesystem to be used for
    the service principal of your application.

    It is very likely that the administrator of your Azure system has included
    your account and your applications in the group that has access to a given ADLS
    Gen2 storage account and a given ADLS Gen2 filesystem. In this case, ask your
    administrator to ensure that you have the proper access and then ignore this
    step.
    1. Start your Microsoft Azure Storage Explorer and find your ADLS Gen2
      storage account on the Storage Accounts
      list.

      If you have not installed Microsoft Azure Storage Explorer, you can
      download it from the Microsoft Azure official site.
    2. Expand this account and the Blob Containers node
      under it; then click the ADLS Gen2 hierarchical filesystem to be used
      under this node.

      tAzureFSConfiguration_2.png

      The filesystem in this image is for demonstration purposes only.
      Create the filesystem to be used under the Blob
      Containers
      node in your Microsoft Azure Storage
      Explorer, if you do not have one yet.

    3. On the blade that is opened, click Manage Access
      to open its wizard.
    4. At the bottom of this wizard, add the object ID of your application to
      the Add user or group field and click
      Add.
    5. Select the object ID just added from the Users and
      groups
      list and select all the permission for
      Access and
      Default.
    6. Click Save to validate these changes and close
      this wizard.

Adding Azure specific properties to access the Azure storage system from Databricks

Add the Azure specific properties to the Spark configuration of your Databricks
cluster so that your cluster can access Azure Storage.

You need to do this only when you want your Talend
Jobs for Apache Spark to use Azure Blob Storage or Azure Data Lake Storage with
Databricks.

  • Ensure that your Spark cluster in Databricks has been properly created and is
    running and its version is supported by the Studio. If you use Azure Data
    Lake Storage Gen 2, only Databricks 5.4 is supported.

    For further information, see Create Databricks workspace from
    Azure documentation.

  • You have an Azure account.
  • The Azure Blob Storage or Azure Data Lake Storage service to be used has been
    properly created and you have the appropriate permissions to access it. For
    further information about Azure Storage, see Azure Storage tutorials from Azure
    documentation.
  1. On the Configuration tab of your Databricks cluster
    page, scroll down to the Spark tab at the bottom of the
    page.

    tAzureFSConfiguration_3.png

  2. Click Edit to make the fields on this page
    editable.
  3. In this Spark tab, enter the Spark properties regarding
    the credentials to be used to access your Azure Storage system.

    Option Description
    Azure Blob Storage

    When you need to use Azure Blob Storage with Azure Databricks, add the
    following Spark property:

    • The parameter to provide account key:

      Ensure that the account to be used has the appropriate read/write rights and permissions.

    • If you need to append data to an existing file, add this parameter:

    Azure Data Lake Storage (Gen 1) When you need to use Azure Data Lake Storage Gen1 with
    Databricks, add the following Spark properties, each per
    line:

    Azure Data Lake Storage (Gen 2)

    When you need to use Azure Data Lake Storage Gen2 with Databricks,
    add the following Spark properties, each per line:

    • The parameter to provide an account key:

      This key is associated with the storage account to be used.
      You can find it in the Access keys
      blade of this storage account. Two keys are available for
      each account and by default, either of them can be used for
      this access.

      Ensure that the account to be used has the appropriate read/write rights and permissions.

    • If the ADLS file system to be used does not exist yet, add the following parameter:

    For further information about how to find your application ID and authentication
    key, see Get application ID and authentication
    key
    from the Azure documentation. In the same documentation, you can
    also find details about how to find your tenant ID at Get tenant ID.
  4. If you need to run Spark Streaming Jobs with Databricks, in the same
    Spark tab, add the following property to define a
    default Spark serializer. If you do not plan to run Spark Streaming Jobs, you
    can ignore this step.

  5. Restart your Spark cluster.
  6. In the Spark UI tab of your Databricks cluster page,
    click Environment to display the list of properties and
    verify that each of the properties you added in the previous steps is present on
    that list.

Defining the Azure Databricks connection parameters for Spark Jobs

Complete the Databricks connection configuration in the Spark
configuration
tab of the Run view of your Job.
This configuration is effective on a per-Job basis.

  1. When running a Spark Streaming Job, only one Job is allowed to run on the same Databricks cluster per time.
  2. When running a Spark Batch Job, only if you have selected the Do not restart the cluster
    when submitting
    check box, you can send more than one Job to run in parallel on the same Databricks cluster; otherwise, since each run
    automatically restarts the cluster, the Jobs that are launched in parallel interrupt
    each other and thus cause execution failure.
  1. From the Cloud provider drop-down list, select
    Azure.
  2. Enter the basic connection information to Databricks.

    Standalone

    • In the Endpoint
      field, enter the URL address of your Azure Databricks workspace.
      This URL can be found in the Overview blade
      of your Databricks workspace page on your Azure portal. For example,
      this URL could look like https://westeurope.azuredatabricks.net.

    • In the Cluster ID
      field, enter the ID of the Databricks cluster to be used. This ID is
      the value of the
      spark.databricks.clusterUsageTags.clusterId
      property of your Spark cluster. You can find this property on the
      properties list in the Environment tab in the
      Spark UI view of your cluster.

      You can also easily find this ID from
      the URL of your Databricks cluster. It is present immediately after
      cluster/ in this URL.

    • Click the […] button
      next to the Token field to enter the
      authentication token generated for your Databricks user account. You
      can generate or find this token on the User
      settings
      page of your Databricks workspace. For
      further information, see Token management from the
      Azure documentation.

    • In the DBFS dependencies
      folder
      field, enter the directory that is used to
      store your Job related dependencies on Databricks Filesystem at
      runtime, putting a slash (/) at the end of this directory. For
      example, enter /jars/ to store the dependencies
      in a folder named jars. This folder is created
      on the fly if it does not exist then.

    • Poll interval when retrieving Job status (in
      ms)
      : enter, without the quotation marks, the time
      interval (in milliseconds) at the end of which you want the Studio
      to ask Spark for the status of your Job. For example, this status
      could be Pending or Running.

      The default value is 300000, meaning 30
      seconds. This interval is recommended by Databricks to correctly
      retrieve the Job status.

    • Use
      transient cluster
      : you can select this check box to
      leverage the transient Databricks clusters.

      The custom properties you defined in the Advanced properties table are automatically taken into account by the transient clusters at runtime.

      1. Autoscale: select or clear this check box to define
        the number of workers to be used by your transient cluster.

        1. If you select this check box,
          autoscaling is enabled. Then define the minimum number
          of workers in Min
          workers
          and the maximum number of
          worders in Max
          workers
          . Your transient cluster is
          scaled up and down within this scope based on its
          workload.

          According to the Databricks
          documentation, autoscaling works best with
          Databricks runtime versions 3.0 or onwards.

        2. If you clear this check box, autoscaling
          is deactivated. Then define the number of workers a
          transient cluster is expected to have. This number does
          not include the Spark driver node.
      2. Node type
        and Driver node type:
        select the node types for the workers and the Spark driver node.
        These types determine the capacity of your nodes and their
        pricing by Databricks.

        For details about
        these node types and the Databricks Units they use, see
        Supported Instance
        Types
        from the Databricks documentation.

      3. Elastic
        disk
        : select this check box to enable your
        transient cluster to automatically scale up its disk space when
        its Spark workers are running low on disk space.

        For more details about this elastic disk
        feature, search for the section about autoscaling local
        storage from your Databricks documentation.

      4. SSH public
        key
        : if an SSH access has been set up for your
        cluster, enter the public key of the generated SSH key pair.
        This public key is automatically added to each node of your
        transient cluster. If no SSH access has been set up, ignore this
        field.

        For further information about SSH
        access to your cluster, see SSH access to
        clusters
        from the Databricks
        documentation.

      5. Configure cluster
        log
        : select this check box to define where to
        store your Spark logs for a long term. This storage system could
        be S3 or DBFS.
    • Do not restart the cluster
      when submitting
      : select this check box to prevent
      the Studio restarting the cluster when the Studio is submitting your
      Jobs. However, if you make changes in your Jobs, clear this check
      box so that the Studio resarts your cluster to take these changes
      into account.

If you need the Job to be resilient to failure, select the Activate checkpointing check box to enable the
Spark checkpointing operation. In the field that is displayed, enter the
directory in which Spark stores, in the file system of the cluster, the context
data of the computations such as the metadata and the generated RDDs of this
computation.

For further information about the Spark checkpointing operation, see http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing .

Configuring the connection to the Azure Data Lake Storage service to be used by Spark

  1. Double-click tAzureFSConfiguration to open its Component view.

    Spark uses this component to connect to the Azure Data Lake Storage system to which your Job writes the actual business data.
  2. From the Azure FileSystem drop-down list, select Azure Datalake Storage to use Data Lake Storage as the target system to be used.
  3. In the Datalake storage account field, enter the name of the Data Lake Storage account you need to access.

    Ensure that the administrator of the system has granted your Azure account the appropriate access permissions to this Data Lake Storage account.
  4. In the
    Client ID and the Client
    key
    fields, enter, respectively, the authentication
    ID and the authentication key generated upon the registration of the
    application that the current Job you are developing uses to access
    Azure Data Lake Storage.

    Ensure that the application to be used has appropriate
    permissions to access Azure Data Lake. You can check this on the
    Required permissions view of this application on Azure. For further
    information, see Azure documentation Assign the Azure AD application to
    the Azure Data Lake Storage account file or folder
    .

    This application must be the one to
    which you assigned permissions to access your Azure Data Lake Storage in
    the previous step.

  5. In the
    Token endpoint field, copy-paste the
    OAuth 2.0 token endpoint that you can obtain from the
    Endpoints list accessible on the
    App registrations page on your Azure
    portal.

Write the sample data to Azure Data Lake Storage

  1. Double-click the tFixedFlowIput component to
    open its Component view.

    tAzureFSConfiguration_4.png

  2. Click the […] button next to Edit schema to open the schema editor.
  3. Click the [+] button to add the schema
    columns as shown in this image.

    tAzureFSConfiguration_5.png

  4. Click OK to validate these changes and accept
    the propagation prompted by the pop-up dialog box.
  5. In the Mode area, select the Use Inline
    Content
    radio button and paste the previously mentioned sample data
    into the Content field that is displayed.
  6. In the Field separator field, enter a
    semicolon (;).
  7. Double-click the tFileOutputParquet component to
    open its Component view.

    tAzureFSConfiguration_6.png
  8. Select the Define a storage configuration component
    check box and then select the tAzureFSConfiguration
    component you configured in the previous steps.
  9. Click Sync columns to ensure that
    tFileOutputParquet has the same schema as
    tFixedFlowInput.
  10. In the Folder/File field, enter the name of the Data
    Lake storage folder to be used to store the sample data.
  11. From the Action drop-down list, select
    Create if the folder to be used does not exist yet on
    Azure Data Lake Storage; if this folder already exists, select Overwrite.

Reading the sample data from Azure Data Lake Storage

  1. Double-click tFileInputParquet to open its
    Component view.

    tAzureFSConfiguration_7.png

  2. Select the Define a storage configuration component
    check box and then select the tAzureFSConfiguration
    component you configured in the previous steps.
  3. Click the […] button next to Edit
    schema
    to open the schema editor.
  4. Click the [+] button to add the schema columns for
    output as shown in this image.

    tAzureFSConfiguration_8.png

  5. Click OK to validate these changes and accept the
    propagation prompted by the pop-up dialog box.
  6. In the Folder/File field, enter the name of the folder
    from which you need to read data. In this scenario, it is
    sample_user.
  7. Double click tLogRow to open its
    Component view and select the Table radio button to present the result in a table.
  8. Press F6 to run this Job.
Once done, you can find your Job on the Job page on the Web UI
of your Databricks cluster and then check the execution log of your Job.

tAzureFSConfiguration_9.png


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x