July 30, 2023

tPatternUnmasking properties for Apache Spark Streaming – Docs for ESB 7.x

tPatternUnmasking properties for Apache Spark Streaming

These properties are used to configure tPatternUnmasking running in the Spark
Streaming
Job framework.

The Spark Streaming
tPatternUnmasking component belongs to the Data Quality family.

Basic settings

Schema and
Edit Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Sync
columns
to retrieve the schema from the previous component connected in the
Job.

Click Edit
schema
to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this
    option to view the schema only.

  • Change to built-in property:
    choose this option to change the schema to Built-in for local changes.

  • Update repository connection:
    choose this option to change the schema stored in the repository and decide whether
    to propagate the changes to all the Jobs upon completion. If you just want to
    propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
    again in the Repository Content
    window.

The output schema of this component contains one
read-only column, ORIGINAL_MARK. This column identifies by true or false if the record is an masked or and original
respectively.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Modifications

Define in the table what fields to unmask and how to
unmask them:

Use the same settings for the Field
type
, Values,
Path, Range and
Date Range columns as the ones used for
masking the input data with the tPatternMasking
component.

Column to unmask:
Select the column from the input flow that contains the data to be
unmasked.

Each column is processed sequentially, meaning that data
unmasking operations will be performed on the data from the first
column, the second column, and so on.

In a colum, each data field is a fixed length field,
except the last data field.

For fixed length fields, each value must contain the
same number of characters, for example: "30001,30002,30003" or "FR,EN".

In a column, the last Enumeration or Enumeration
from file
data field is a variable length field.

For variable length fields, each value might not always
contain the same number of characters, for example: "30001,300023,30003" or "FR,ENG".

Field type: Select
the field type the data belongs to.

  • Interval: When selected, set a range of
    numeric values used for masking purposes in the Range field, using the
    following syntax: "<min>,<max>".

    The number of unmasked characters from the
    input data corresponds to the number of characters of the
    maximum value.

    For example, "1,999" will be interpreted as "001,999", which means that
    three characters from the input data will be masked by a
    value randomly selected from the defined range of
    values.

  • Enumeration: When selected, enter a
    comma-separated list of values to be used for masking data
    in the Values field,
    using the following syntax: "value1,value2,value3".

  • Enumeration from file: When
    selected, set the path to the CSV file containing a list of
    values used for masking data in the Path field. The file must contain one value per
    row and each value must be unique.
  • Date pattern
    (YYYYMMDD)
    : When selected, set a range of
    years in the Date
    Range
    field, using the following syntax:
    "<min_year>,<max_year>".

    Years can only have four digits, for example:
    "1900,2100".

    The input dates to be masked must follow the
    YYYYMMDD pattern, for example: 20180101.

    For example, if the input date is 20180101 and the value in
    the Date Range is
    "1900,2100",
    19221221 could be the
    output date.

In the Values,
Path, Range and Date Range, values must be enclosed in double
quotes.

When the input data is invalid, meaning that a value does not match the pattern defined in
the component, the generated value is null.

Advanced settings

Method

From this list, select the Format-Preserving Encryption
(FPE) algorithm that was used to mask data, FF1 with AES or FF1 with
SHA-2
:

The FF1 with AES
method is based on the Advanced Encryption Standard in CBC mode. The
FF1 with SHA-2 method depends
on the secure hash function HMAC-256.

Java 8u161 is the minimum required version to use the
FF1 with AES method. To be
able to use this FPE method with Java versions earlier than 8u161,
download the Java Cryptography Extension (JCE) unlimited strength
jurisdiction policy files from Oracle website.

Password for FF1
methods

To unmask data, the FF1 with AES
and FF1 with SHA-2 methods require the password
specified in the Password for FF1 methods field
when the data was masked with the tPatternMasking component.

Use tweaks

If tweaks have been generated while encrypting
the data, select this check box. When selected, the Column containing tweaks list is displayed. A tweak allows to
decrypt all data of a record.

Column containing tweaks

Available when the Use tweaks check box is selected. Select the column
that contains the tweaks. If you do not see it, make sure you have declared in the
input component the tweaks generated by the masking component.

Seed for random
generator

Set a random number if you want to generate
the same sample of substitute data in each execution of the Job. The seed is not set by
default.

If you do not set the seed, the component
creates a new random seed for each Job execution. Repeating the execution with a
different seed will result in a different sample being generated.

Encoding

Select the encoding from the list or select Custom and define it manually. If you select Custom and leave the field empty, the supported
encodings depend on the JVM that you are using. This field is compulsory for the file
encoding.

When you set Field
type
to Enumeration from file,
define the file path in Path (CSV File).

Output the original
row?

Select this check box to output original data rows in addition to the
substitute data. Outputting both the original and substitute data can be useful in debug
or test processes.

Should Null input return
NULL?

This check box is selected by
default. When selected, the component outputs null when
input values are null. Otherwise, the component returns
the default value when the input is null, that is an
empty string for string values, 0 for numeric values
and the current date for date values.

If the input is
null, the Generate
Sequence
function will not return null,
even if the check box is selected.

Should EMPTY input return
EMPTY?

When this check box is selected, empty values are left unchanged in
the output data. Otherwise, the selected functions are applied to the input
data.

Send invalid data to “Invalid” output
flow
This check box is selected by default.

  • Selected: When the data can be unmasked, they are sent to the
    main flow. Otherwise, the data are sent to the “Invalid” output flow.
  • Cleared: The data are sent to the main flow.

Invalid data are any values that do not match the pattern.

tStat Catcher
Statistics

Select this check box to gather the Job processing metadata at the Job level
as well as at each component level.

Usage

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to,
appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a
scenario presents only Standard Jobs, that is to
say traditional
Talend
data integration Jobs.

Spark
Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Parent topic: tPatternUnmasking

Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x