July 30, 2023

tDataUnmasking properties for Apache Spark Batch – Docs for ESB 7.x

tDataUnmasking properties for Apache Spark Batch

These properties are used to configure tDataUnmasking running
in the Spark Batch Job framework.

The Spark Batch
tDataUnmasking component belongs to the Data Quality family.

Basic settings

Schema and
Edit Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Sync
columns
to retrieve the schema from the previous component connected in the
Job.

Click Edit
schema
to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this
    option to view the schema only.

  • Change to built-in property:
    choose this option to change the schema to Built-in for local changes.

  • Update repository connection:
    choose this option to change the schema stored in the repository and decide whether
    to propagate the changes to all the Jobs upon completion. If you just want to
    propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
    again in the Repository Content
    window.

The output schema of this component
contains one read-only column, ORIGINAL_MARK. This column identifies by true or false if the record is an original record
or a substitute record respectively.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Modifications

Define in the table what fields to unmask and how to unmask them:

Input Column: Select the column from the input
flow that contains the data to be unmasked.

These modifications are based on the function you select in the Function
column.

Category: select a category of unmasking functions
from the list.

Function: Select the function that will unmask
data.

The functions you can select from the Function list depend on the data
type of the input column.

Method: From this list, select the
Format-Preserving Encryption (FPE) algorithm that was used to mask data,
FF1 with AES or FF1 with
SHA-2
:

The FF1 with AES method is based on the Advanced
Encryption Standard in CBC mode. The FF1 with
SHA-2
method depends on the secure hash function
HMAC-256.

Java 8u161 is the minimum required version to use the FF1 with
AES
method. To be able to use this FPE method with Java
versions earlier than 8u161, download the Java Cryptography Extension
(JCE) unlimited strength jurisdiction policy files from Oracle website.

To unmask data, the FF1 with AES and
FF1 with SHA-2 methods require the password
specified in the Password for FF1 methods field when the data was masked
with the tDataMasking component.

When using the Replace all, Replace characters between two positions,
Replace n first digits and Replace n last digits with FPE methods, you
can select an alphabet.

Select the alphabet used to mask data with the
tDataMasking component.

Extra Parameter: This field is used
by some of the functions, it will be disabled when not applicable. When applicable,
enter a number or a letter to decide the behavior of the function you have
selected.

Keep
format
: this function is only used on Strings. Select this check box to
keep the input format when using the Bank Account
Unmasking
, Credit Card
Unmasking
, Phone Unmasking and
SSN Unmasking categories. That is to say, if
there are spaces, dots (‘.’), hyphens (‘-‘) or slashes (‘/’) in the input, those
characters are kept in the output. If you select this check box when using Phone Unmasking functions, the characters that are not
numbers from the input are copied to the output as is.

Advanced settings

Password for FF1 methods

To unmask data, the FF1 with AES and FF1 with SHA-2 methods require the
password specified in the Password for FF1 methods field when the data
was masked with the tDataMasking component.

Use tweaks

If tweaks have been generated while encrypting
the data, select this check box. When selected, the Column containing tweaks list is displayed. A tweak allows to
decrypt all data of a record.

Column containing tweaks

Available when the Use tweaks check box is selected. Select the column
that contains the tweaks. If you do not see it, make sure you have declared in the
input component the tweaks generated by the masking component.

Output the original row

Select this check box to output masked data rows in addition to the
original data. Having both data rows can be useful in debug or test
processes.

Should null input return null

This check box is selected by default. When selected, the component
outputs null when input values are null. Otherwise,
it returns the default value when the input is null, that is an empty
string for string values, 0 for numeric values and
the current date for date values.

Should empty input return empty

When this check box is selected, the component returns the input values
if they are empty. Otherwise, the selected functions are applied to the
input data.

Send invalid data to “Invalid” output
flow
This check box is selected by default.

  • Selected: When the data can be unmasked, they are sent to the
    main flow. Otherwise, the data are sent to the “Invalid” output flow.
  • Cleared: The data are sent to the main flow.

The data are considered invalid when:

tStat
Catcher
Statistics

Select this check box to gather the Job processing metadata at the Job
level as well as at each component level.

Usage

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to,
appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a
scenario presents only Standard Jobs, that is to
say traditional
Talend
data integration Jobs.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Parent topic: tDataUnmasking

Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x