August 15, 2023

tDataMasking – Docs for ESB 6.x

tDataMasking

Hides original data with random characters or figures to protect the actual data
while having a functional substitute for occasions when it is not advisable to show
sensitive real data.

Data will keep looking real and consistent and will remain usable for purposes such as
testing and training. The most common data type which may need masking method is where the
data contains Personally Identifiable Information (PII) or Sensitive Personal Data (SPD).
For further information, see Function behavior in common PII.

tDataMasking reads a data set row by
row and creates a structurally similar but inauthentic version of the
data after having applied specific functions on data fields. It
generates one row for each input row.

Depending on the Talend solution you
are using, this component can be used in one, some or all of the following Job
frameworks:

Function behavior in common PII

What is sensitive data

The definition of sensitive data is broad and may differ from one country to the
other or from one organization to the other. Basically, sensitive data can be
personal information or business information which includes anything that poses a
risk to the person or company in question.

Globally, Credit/Debit card data for example is considered to be sensitive. Also
an employee’s salary details any information that can be used to identify or locate
a person can be considered to be sensitive data. A non-exhaustive list of personal
sensitive data may include: first and last names, email addresses, addresses,
Security Social Number (SSN), credit card numbers, bank account numbers, race,
gender, date of birth, salary and geolocation combined with time.

For further information about personal sensitive data, check Personally Identifiable Information.

Also, business sensitive data may include trade secrets, acquisition plans,
financial data and customer information, among other possibilities.

Functions and common PII

There are several functions in the tDataMasking component which vary according to the type of the data
column.

It is advisable to use the functions predefined in the component with
columns that hold personal information, such as first and last names, email addresses,
addresses, SSN, credit card numbers, bank account numbers, race, gender, date of birth
and salary.

Functions that are not self-explanatory are explained in the below
table:

Function

Description

Set to null

This function returns null. It requires no
parameter.

Date Variance

This function applies only on Date values. It uses a
parameter which must be a number, this parameter represents a number of
days. It will then modify the input date by adding or retrieving a number of
days lower than the parameter.

For example, if the input date is
15-02-1992 and the parameter is
10, then the generated date is randomly selected
between 05-02-1992 (15 – 10) and
25-02-1992 (15 + 10).

If the input date is null, then the function returns the current date.

If the given parameter is 0 or
null or if it is not a number, then the parameter
is replaced by 31. For example, if the input date is
05-11-2016, then the generated date is randomly
selected between 04-10-2016 (31 days before the input date)
and 06-12-2016 (31 days after the input date).

Keep year and set day and month to 01/01

This function applies only on Date values. It requires no
parameter. It sets the month and day of the input date to January, 1 but
does not change the year.

For example, if the input date is
15-02-1992, the function returns
01-01-1992. If the input date is null, the function
returns January, 1 of the current year, for example
01-01-2017.

Generate Account Number

This function generates a valid French bank account number.
It requires no parameter and only applies on String values.

A French IBAN number is a 27-character code. The numbers are
randomly generated but against algorithms. The last digit of the IBAN is
known as the “clef RIB” and is generated with an algorithm and the third and
fourth digits of the IBAN are also generated through an algorithm.

Generate Account Number and keep original country

This function works like Generate Account
Number
, it generates a valid bank account number for the original
country.

If the input is a correct IBAN number, the function generates
an IBAN number from the same country as the input taking into account the
IBAN number which is different from one country to the other. If the input
is a correct American account number the function keeps the first nine
digits and randomly replaces the other.

Generate Credit Card

This function generates a valid credit card number. It
requires no parameter and can be applied on String or Long values.

There are three types of credit card that can be generated:
Visa, MasterCard or American Express. One of these types is randomly chosen
and a credit card number is generated. The number generated is randomly
generated and pass algorithms that detect false credit card number.

Generate credit card and keep original bank

This function works like Generate Credit
Card
, it generates a valid credit card number for the original
bank.

If the input is a correct Visa, MasterCard or American
Express credit card number, the function generates a credit card number from
the same company and keep the prefix. Otherwise, the function has the same
behavior as Generate Credit Card.

Generate from Pattern

This function is applied only on Strings and it requires a
parameter.

It generates a value that matches the pattern given as
parameter. The pattern must follow the below rules:

– the A character is replaced by a random
upper case letter.

– the a character is replaced by a random
lower case letter.

– the 9 figure is replaced by a random
digit.

– all other characters are kept as they are.

You can generate several strings with the same argument
(value) by using \1 in the pattern.

For example, if the given pattern is Aaaaa.Aaaaa99\1,@gmail.com, the function generates something
like Dsdf.Ksknt12@gmail.com. The @gmail.com value will be
kept unchanged.

This function does not work correctly if a comma ‘,’ is used
in the pattern.

Generate Phone Number

This function is applied only on Strings and requires no
parameter.

It generates a random phone number from different countries
(France, Germany, Japan, UK and US).

Generate Social Security Number (SSN)

This function is only used on Strings and requires no
parameter. It generates a correct random SSN for different countries (China,
France, Germany, India, Japan, UK and US).

Generate unique SSN

This function is only used on Strings and requires no
parameter. It generates a correct unique random SSN related to the input for
different countries (China, France, Germany, India, Japan, UK and US).

That is to say, if there are duplicates in the input data,
you will get the same duplicates in the generated SSNs. In the same way, if
there are no duplicates in the input data, there will be no duplicates in
the generated SSNs.

If the input is null or in a wrong format, the function
returns an empty String.

Generate Sequence

Note:

This function is not supported in the Spark version
of the component.

This function can be applied on everything that is not a date
(Integer, Long, Strings and so on). It requires a parameter that must be a
number. This function returns the parameter, and, for each row, will
increase this number by 1. If the parameter is not a
number, it is set to 0.

Generate Uuid

This function is only applied on Strings and requires no
parameter. It replaces the input value by a randomly generated UUID.

This function uses the UUID.randomUUID() provided by java, meaning that no seed is
used here, implying that if the user runs twice the job, the UUIDs generated
will be different.

Generate value between two values

This function generates a value randomly chosen between two
values you give as argument. The argument must be a string holding the
bounds, separated by comas, that is min and max.

This function can be applied to any types of fields. However,
if the field is a date the bounds must also be dates and they must have the
same format as in the schema, dd-MM-yyyy for example. Otherwise, the bounds
must be integers.

If the input is of Date type, the function returns the
current date if the parameter is not in the right format. Otherwise, it
returns an empty string for string values and 0 for numeric
values.

Keep characters between two positions

This function can be used on Strings and requires two
parameters separated by commas.

The two first parameters represent the places of two elements
in the input. The function returns a new String that only contains those
elements and what is in between.

If the input is null or if the parameter is in a wrong
format, the function returns an empty String. If the lower bound is lower
than 1, it will be set to 1 and if the higher bound is greater than the
length of the string, it is set to this length. The two parameters can be
given in any order. If the argument is 4, 2, it will
be replaced by 2, 4. For example, if the input is
Steven and the argument is 4,
2
, the result will be tev.

Remove Characters between two positions

This function has the same behavior as Keep characters between two positions but with a remove
statement.

Replace characters between two positions

This function has the same behavior as Keep characters between two positions but with a replace
statement.

When using the Replace characters between
two positions
, you can enter a third parameter which is the
character used for replacing the elements in the input. If you do not enter
a third parameter, each character is replaced with a randomly selected
character.

For example, if the input is Steven
and the argument is 2, 4, X, the result will be
SXXXen.

Keep n first digits and replace following ones

This function is used on Strings, Integers and Long values
and requires a number as a parameter.

If the parameter is n, the function
keeps the first n digits of the input and then replaces all
the digits that follow by other digits. Anything that is not a digit will
not be changed. A null input makes the function returns an empty string or
0.

If the parameter is bigger than the input length, no
modifications are applied.

Keep n last digits and replace previous ones

This function is the counterpart of Keep n
first digits and replace following ones
.

Mask Address

This function can only be used on String values. It replaces
digits by other digits and everything else by X.

Moreover, there is a list of key words that will not be
transformed: Rue, rue, r., strasse, Strasse, Street, street, St.,
St, Strae, Strada, Rua, Calle, Ave., avenue, Av., Allée, allée, alle,
Avenue, Avenida, Bvd., Bd., Boulevard, boulevard, Blv., Viale, Avenida,
Bulevar, Route, route, road, Road, Rd., Chemin, Way, Cour, Court, Ct.,
Place, place, Pl., Square, Impasse, Alle, Driveway, Auahrt, Viale,
Esplanade, Esplanade, Promenade, Lungomare, Esplanada, Esplanada,
Faubourg, faubourg, Suburb, Vorort, Periferia, Subrbio, Suburbio, Via,
Via, industrial, area, zone, industrielle, Périphérique, Peripheral,
Voie, voie, Track, Gleis, Carreggiata, Caminho, Pista, Forum, STREET,
RUE, ST., AVENUE, BOULEVARD, BLV., BD, ROAD, ROUTE, RD., RTE, WAY,
CHEMIN, COURT, CT., SQUARE, DRIVEWAY, ALLEE, DR., ESPLANADE, SUBURB,
BANLIEUE, VIA, PERIPHERAL, PERIPHERIQUE, TRACK, VOIE, FORUM, INDUSTRIAL,
AREA, ZONE, INDUSTRIELLE
.

You can give a parameter, it can either be a list of key
words to be added to the above list (separated by commas) or it can be a
path to a file containing the words.

Mask email full domain by character

This function can only be used on Strings. It replaces
everything after the @ character by the character you enter as parameter, or
by a series of X if you do not enter a parameter.

If you enter as a parameter something illegal like a string,
a list, multiple characters, a digit, etc. the full email domain will be
masked by a series of X by default.

For example, if the initial email is
example@talend.com and the given pattern is
B, the generated email looks like
example@BBBBBB.BBB.

Mask email full domain with consistent items

This function can only be used on Strings. It replaces
everything after the @ character randomly by one domain of the list given as
parameter (can also be a path to a file containing the domains you want to
use). If you do not enter a parameter, everything after the @ character is removed.

For example, if the initial email is
example@talend.com and the given parameter is
google.com, yahoo.fr, hotmail.com, the function
chooses randomly a domain from the list and outputs
example@google.com, example@yahoo.fr
or example@hotmail.com.

Mask email left part of domain by character

This function can only be used on Strings. It replaces the
part of the domain before the dot by the character you enter as parameter,
or by a series of X if you do not enter a parameter.

If you enter as a parameter something illegal like a string,
a list, multiple characters, a digit, etc. the full email domain will be
masked by a series of X by default.

For example, if the initial email is
example@talend.com and the given pattern is
B, the generated email looks like
example@BBBBBB.com.

Mask email left part of domain with consistent items

This function can only be used on Strings. It replaces the
part of the domain before the dot randomly by a domain name of the list
given as parameter (can also be a path to a file containing the domain names
you want to use). If you do not enter a parameter, the part of the domain
before the dot is removed.

For example, if the initial email is
example@talend.com and the given parameter is
google, yahoo.co, hotmail, the function chooses
randomly a domain from the list and outputs
example@google.com,
example@yahoo.co.com or
example@hotmail.com.

Mask email local part by character

This function can only be used on Strings. It replaces
everything before the @ character by the character you enter as parameter,
or by a series of X if you do not enter a parameter.

If you enter as a parameter something illegal like a string,
a list, multiple characters, a digit, etc. the local part of the email will
be masked by a series of X by default.

For example, if the initial email is
example@talend.com and the given pattern is
B, the generated email looks like
BBBBBBB@talend.com.

Mask email local part with consistent items

This function can only be used on Strings. It replaces
everything before the @ character randomly by one value of the list given as
parameter (can also be a path to a file containing the words you want to
use). If you do not enter a parameter, everything before the @ character is
removed.

For example, if the initial email is
example@talend.com and the given parameter is
jdoe, jsmith, pnewman, the function chooses
randomly a value from the list and outputs jdoe@talend.com
or jsmith@talend.com or
pnewman@talend.com.

Numeric Variance

This function applies only to numerical types (Integer,
Long, Float and Double).

It takes a parameter that must be a number, this parameter
represents a percentage of modification. The function modifies the input
data by multiplying it by a number between the parameter and its opposite.
For example, if the input is 100 and the parameter is
10, then the generated value will be a randomly
selected value between 90 (100 – 10%) and
110 (100 + 10%). If the input is null, then the
function will return 0. If the given parameter is
0, it will be replaced by
10.

Replace all

This function can be used on Strings and requires a
character as a parameter. If you do not enter a parameter, each character is
replaced with a randomly selected character.

If the parameter is X, the function
replaces all the characters of the input by X. A null
input makes the function returns an empty string.

Replace all digits

This function can be used on Strings and requires a
character as a parameter. If you do not enter a parameter, each digit is
replaced with a randomly selected digit.

Anything that is not a digit will not be changed. A null
input makes the function returns an empty string.

Replace all letters

This function can be used on Strings and requires a
character as a parameter. If you do not enter a parameter, all letters
replaced with a randomly selected character.

Anything that is not a letter will not be changed. A null
input makes the function returns an empty string.

Replace by consistent items from input list (or
file)

This function modifies the input value by randomly selecting
one of the values given as parameter. The values must be stored in a String
and separated by commas, for example (“item1, item2, item3, etc.”). It uses
the hashCode() function provided by Java
to choose an element from the list.

It is applied to Strings or numerical types and it ensures
that two similar inputs have the same output. It returns an empty String or
0 if no parameter is given.

For example, you could use this function to generate SSNs.
However, this function may generate duplicates even though there are no
duplicates in the input data. To prevent this from happening, use Generate Unique SSN.

When using Replace by consistent
items from input list (or file)
, the probability of
generating duplicates can be calculated using the following formulas:

  • P = 1 if K < N, or

  • P = 1-K*(K-1)*(K-2)*(K-N+1) / K^N

where P is the probability of generating
duplicates, N the input data size and K is
the size of the input list given as a parameter.

Using this approach, it is possible to calculate the
probability to find a pair sharing the same value within a group.

For example, the probability that, in a group of
n people, two people have the same birthday is the
following:

  • 2.7% in a group of 5 people,

  • 41.1% in a group of 20 people,

  • 100% in a group of 367 people, since there are
    366 possible birthdays, including February 29.

Replace by item from input list (or file)

This function has the same behavior as Replace by consistent item from input list, but it randomly select
the value from the list (or file), so outputs will be different.

Replace n first characters

If the parameter is n, the function
replaces the first n characters of the input and
keeps all the characters that follow. A null input makes the function
returns an empty string.

If the parameter is bigger than the input length, all the
characters are replaced.

You can enter a second parameter which is the replacement
character.

For example, if the input is Steven
and the argument is 2, X, the result will be
XXeven.

Replace n last characters

This function is the counterpart of Replace n first characters.

tDataMasking Standard properties

These properties are used to configure tDataMasking running in the Standard Job framework.

The Standard
tDataMasking component belongs to the Data Quality family.

The component in this framework is available when you have subscribed to one of
the Talend Platform products or Talend Data
Fabric.

Basic settings

Schema and Edit
Schema

A schema is a row description. It defines the number of fields (columns) to
be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Sync columns to retrieve the schema from
the previous component connected in the Job.

Click Edit schema to make changes to the schema.
If the current schema is of the Repository type, three
options are available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this
    option to change the schema to Built-in for
    local changes.

  • Update repository connection: choose this
    option to change the schema stored in the repository and decide whether to propagate
    the changes to all the Jobs upon completion. If you just want to propagate the
    changes to the current Job, you can select No
    upon completion and choose this schema metadata again in the [Repository Content] window.

The output schema of this component contains one read-only column,
ORIGINAL_MARK. This column identifies by
true or false if the record is an original
record or a substitute record respectively.

 

Built-In: You create and store the
schema locally for this component only. Related topic: see
Talend Studio

User Guide.

 

Repository: You have already created
the schema and stored it in the Repository. You can reuse it in various projects and
Job designs. Related topic: see
Talend Studio

User Guide.

Modifications

Define in the table what fields to change and how to change them:

Input Column: Select the column from the input flow for
which you want to generate similar data by modifying its values.

These modifications are based on the function you select in the Function column and the number of modifications you set in the Max Modification Count column.

Function: Select the function that will decide what
modification to do in order to generate similar substitutional data. For example, you can
decide to have similar values through replacing or adding letters or numbers, replacing
values with synonyms from an index file or deleting values by setting the function to
null.

The Function list will vary according to the column type.
For further information about function behavior, see Function behavior in common PII.

For example, a column of a Long type will have a Numeric
variance
option in the list while a column of a String
type will not have such function. Also, the Function list
for a Date column is date-specific, it allows you to decide the type of
modification you want to do on date values.

Extra Parameter: This field is used by some of the
functions, it will be disabled when not applicable. When applicable, enter a number or a
letter to decide the behavior of the function you have selected.

Keep format: this function is only used on Strings.
Select this check box to keep the input format when using the Generate
unique SSN number
, Generate account number and keep
original country
and Generate credit card number and keep
original bank
functions. That is to say, if there are spaces, dots (‘.’), hyphens
(‘-‘) or slashes (‘/’) in the input, the output will have the same characters.

Advanced settings

Seed for random generator

Set a random number if you want to generate the same sample of substitute data in each
execution of the Job. This field is set to 12345678 by default.

Repeating the execution with a different value for this field will result in a different
sample being generated. Keep this field empty if you want to generate a different sample
each time you execute the Job.

Output the original row

Select this check box to output original data rows in addition to the substitute data.
Having both data rows can be useful in debug or test processes.

Should null input return
null

This check box is selected by default. When selected, the component outputs
null when input values are null. Otherwise, it returns the default
value when the input is null, that is an empty string for string values,
0 for numeric values and the current date for date values.

This parameter does not have an effect on the Generate
Sequence
function. If the input is null, this function will not return null,
even if the box is checked.

Should empty input return empty

When this check box is selected, the component returns the input values if they are empty.
Otherwise, the selected functions are applied to the input data.

tStat
Catcher
Statistics

Select this check box to gather the Job processing metadata at the Job level
as well as at each component level.

Usage

Usage rule

This component is an intermediary step. It requires an input and
output flows.

Scenario: Altering data values to restrict the use of actual sensitive data

This scenario applies only to a subscription-based Talend Platform solution or Talend Data Fabric.

With the tDataMasking component, you can replace
sensitive information such as credit card or social security numbers with realistic values,
allowing production data to be safely used for purposes such as testing and training.

This scenario describes a Job which uses:

  • the tFixedFlowInput component to generate
    personal data including credit card numbers,

  • the tDataMasking component to hide specific
    original data with random characters or figures,

  • the tFileOutputExcel component to output the
    substitute data set.

use_case-tdatamasking.png

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tDataMasking and
    tFileOutputExcel.
  2. Connect the three components together using the Main links.

Configuring the input component

  1. Double-click tFixedFlowInput to open its
    Basic settings view in the Component tab.

    use_case-tdatamasking2.png

  2. Create the schema through the Edit Schema
    button.

    use_case-tdatamasking3.png

    In the open dialog box, click the [+] button
    and add the columns that will hold the initial input data.
  3. Click OK.
  4. In the Number of rows field, enter
    1.
  5. In the Mode area, select the Use Inline Content option.
  6. In the Content table, enter the customer data
    you want to replace with realistic values, for example:

Replacing actual data with realistic values

  1. Double-click tDataMasking to display the
    Basic settings view and define the component
    properties.

    use_case-tdatamasking4.png

  2. If required, click Sync columns to retrieve
    the schema defined in the input component.
  3. Click the Edit schema button to open the
    schema dialog box.

    tDataMasking proposes one predefined
    read-only column as shown in the below capture.
    use_case-tdatamasking5.png

    This column identifies by true or false if the
    output record is an original or a substitute record respectively.
  4. Move any of the input columns to the output schema if you want to show them in
    the results, click OK and accept to propagate
    the changes.
  5. In the Modifications table, click the
    [+] button to add four rows, and
    then:

    • in the Input Column, select the
      columns which content you want to substitute,

    • in the Function column, select from
      the predefined list the function you want to use to generate the
      substitute data,

    • in the Parameter column, enter a
      value, a pattern or a path to be used by the function to substitute
      data.

    The Job will generate inauthentic credit card numbers, replace the first three
    letters of first names, replace last names with names from a local file and
    finally replace the part before the @ sign in email addresses by a series of
    X.
  6. Click the Advanced settings tab and select
    the Output the original row check box.

    The Job will add the original data rows to the substitute data.

Configuring the output component and executing the Job

  1. Double-click the tFileOutputExcel component
    to display the Basic settings view and define
    the component properties.

    use_case-tdatamasking6.png

  2. Set the destination file name as well as the sheet name and then select the
    Define all columns auto size check box.
  3. Save your Job and press F6 to execute
    it.

    The tDataMasking component substitutes data
    in the selected columns and writes the result in an output file.
  4. Right-click the output component and select Data
    Viewer
    to display the original and substituted data.

    use_case-tdatamasking7.png

    tDataMasking outputs original and substitute
    rows marked respectively with true and false in the
    ORIGINAL_MARK column. It generates inauthentic credit
    card numbers, replaces the first three letters of first names, replaces last
    names with names from a local file and finally replaces the part before the @
    sign in email addresses by the names defined in the component basic
    settings.
    Sensitive personal information in the input data has been “hidden” but data
    keeps looking real and consistent. The substitute data is still usable for
    purposes other than production.

tDataMasking properties for Apache Spark Batch

These properties are used to configure tDataMasking running in the Spark Batch Job framework.

The Spark Batch
tDataMasking component belongs to the Data Quality family.

The component in this framework is available when you have subscribed to any Talend Platform product with Big Data or Talend Data
Fabric.

Basic settings

Schema and Edit
Schema

A schema is a row description. It defines the number of fields (columns) to
be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Sync columns to retrieve the schema from
the previous component connected in the Job.

Click Edit schema to make changes to the schema.
If the current schema is of the Repository type, three
options are available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this
    option to change the schema to Built-in for
    local changes.

  • Update repository connection: choose this
    option to change the schema stored in the repository and decide whether to propagate
    the changes to all the Jobs upon completion. If you just want to propagate the
    changes to the current Job, you can select No
    upon completion and choose this schema metadata again in the [Repository Content] window.

The output schema of this component contains one read-only column,
ORIGINAL_MARK. This column identifies by
true or false if the record is an original
record or a substitute record respectively.

 

Built-In: You create and store the
schema locally for this component only. Related topic: see
Talend Studio

User Guide.

 

Repository: You have already created
the schema and stored it in the Repository. You can reuse it in various projects and
Job designs. Related topic: see
Talend Studio

User Guide.

Modifications

Define in the table what fields to change and how to change them:

Input Column: Select the column from the input flow for
which you want to generate similar data by modifying its values.

These modifications are based on the function you select in the Function column and the number of modifications you set in the Max Modification Count column.

Function: Select the function that will decide what
modification to do in order to generate similar substitutional data. For example, you can
decide to have similar values through replacing or adding letters or numbers, replacing
values with synonyms from an index file or deleting values by setting the function to
null.

The Function list will vary according to the column type.
For further information about function behavior, see Function behavior in common PII.

For example, a column of a Long type will have a Numeric
variance
option in the list while a column of a String
type will not have such function. Also, the Function list
for a Date column is date-specific, it allows you to decide the type of
modification you want to do on date values.

Extra Parameter: This field is used by some of the
functions, it will be disabled when not applicable. When applicable, enter a number or a
letter to decide the behavior of the function you have selected.

Keep format: this function is only used on Strings.
Select this check box to keep the input format when using the Generate
unique SSN number
, Generate account number and keep
original country
and Generate credit card number and keep
original bank
functions. That is to say, if there are spaces, dots (‘.’), hyphens
(‘-‘) or slashes (‘/’) in the input, the output will have the same characters.

Advanced settings

Seed for random generator

Set a random number if you want to generate the same sample of substitute data in each
execution of the Job. This field is set to 12345678 by default.

Repeating the execution with a different value for this field will result in a different
sample being generated. Keep this field empty if you want to generate a different sample
each time you execute the Job.

Output the original row

Select this check box to output original data rows in addition to the substitute data.
Having both data rows can be useful in debug or test processes.

Should null input return
null

This check box is selected by default. When selected, the component outputs
null when input values are null. Otherwise, it returns the default
value when the input is null, that is an empty string for string values,
0 for numeric values and the current date for date values.

This parameter does not have an effect on the Generate
Sequence
function. If the input is null, this function will not return null,
even if the box is checked.

Should empty input return empty

When this check box is selected, the component returns the input values if they are empty.
Otherwise, the selected functions are applied to the input data.

tStat
Catcher
Statistics

Select this check box to gather the Job processing metadata at the Job level
as well as at each component level.

Usage

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to, appears only
when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise
explicitly stated, a scenario presents only Standard Jobs,
that is to say traditional
Talend
data integration Jobs.

Spark Connection

You need to use the Spark Configuration tab in
the Run view to define the connection to a given
Spark cluster for the whole Job. In addition, since the Job expects its dependent jar
files for execution, you must specify the directory in the file system to which these
jar files are transferred so that Spark can access these files:

  • Yarn mode: when using Google
    Dataproc, specify a bucket in the Google Storage staging
    bucket
    field in the Spark
    configuration
    tab; when using other distributions, use a
    tHDFSConfiguration
    component to specify the directory.

  • Standalone mode: you need to choose
    the configuration component depending on the file system you are using, such
    as tHDFSConfiguration
    or tS3Configuration.

This connection is effective on a per-Job basis.

Related scenarios

No scenario is available for the Spark Batch version of this component
yet.

tDataMasking properties for Apache Spark Streaming

These properties are used to configure tDataMasking running in the Spark Streaming Job framework.

The Spark Streaming
tDataMasking component belongs to the Data Quality family.

The component in this framework is available only if you have subscribed to Talend Real-time Big Data Platform or Talend Data
Fabric.

Basic settings

Schema and Edit
Schema

A schema is a row description. It defines the number of fields (columns) to
be processed and passed on to the next component. The schema is either Built-In or stored remotely in the Repository.

Click Sync columns to retrieve the schema from
the previous component connected in the Job.

Click Edit schema to make changes to the schema.
If the current schema is of the Repository type, three
options are available:

  • View schema: choose this option to view the
    schema only.

  • Change to built-in property: choose this
    option to change the schema to Built-in for
    local changes.

  • Update repository connection: choose this
    option to change the schema stored in the repository and decide whether to propagate
    the changes to all the Jobs upon completion. If you just want to propagate the
    changes to the current Job, you can select No
    upon completion and choose this schema metadata again in the [Repository Content] window.

The output schema of this component contains one read-only column,
ORIGINAL_MARK. This column identifies by
true or false if the record is an original
record or a substitute record respectively.

 

Built-In: You create and store the
schema locally for this component only. Related topic: see
Talend Studio

User Guide.

 

Repository: You have already created
the schema and stored it in the Repository. You can reuse it in various projects and
Job designs. Related topic: see
Talend Studio

User Guide.

Modifications

Define in the table what fields to change and how to change them:

Input Column: Select the column from the input flow for
which you want to generate similar data by modifying its values.

These modifications are based on the function you select in the Function column and the number of modifications you set in the Max Modification Count column.

Function: Select the function that will decide what
modification to do in order to generate similar substitutional data. For example, you can
decide to have similar values through replacing or adding letters or numbers, replacing
values with synonyms from an index file or deleting values by setting the function to
null.

The Function list will vary according to the column type.
For further information about function behavior, see Function behavior in common PII.

For example, a column of a Long type will have a Numeric
variance
option in the list while a column of a String
type will not have such function. Also, the Function list
for a Date column is date-specific, it allows you to decide the type of
modification you want to do on date values.

Extra Parameter: This field is used by some of the
functions, it will be disabled when not applicable. When applicable, enter a number or a
letter to decide the behavior of the function you have selected.

Keep format: this function is only used on Strings.
Select this check box to keep the input format when using the Generate
unique SSN number
, Generate account number and keep
original country
and Generate credit card number and keep
original bank
functions. That is to say, if there are spaces, dots (‘.’), hyphens
(‘-‘) or slashes (‘/’) in the input, the output will have the same characters.

Advanced settings

Seed for random generator

Set a random number if you want to generate the same sample of substitute data in each
execution of the Job. This field is set to 12345678 by default.

Repeating the execution with a different value for this field will result in a different
sample being generated. Keep this field empty if you want to generate a different sample
each time you execute the Job.

Output the original row

Select this check box to output original data rows in addition to the substitute data.
Having both data rows can be useful in debug or test processes.

Should null input return
null

This check box is selected by default. When selected, the component outputs
null when input values are null. Otherwise, it returns the default
value when the input is null, that is an empty string for string values,
0 for numeric values and the current date for date values.

This parameter does not have an effect on the Generate
Sequence
function. If the input is null, this function will not return null,
even if the box is checked.

Should empty input return empty

When this check box is selected, the component returns the input values if they are empty.
Otherwise, the selected functions are applied to the input data.

tStat
Catcher
Statistics

Select this check box to gather the Job processing metadata at the Job level
as well as at each component level.

Usage

Usage rule

This component, along with the Spark Streaming component Palette it belongs to, appears
only when you are creating a Spark Streaming Job.

This component is used as an intermediate step.

You need to use the Spark Configuration tab in the
Run view to define the connection to a given Spark cluster
for the whole Job.

This connection is effective on a per-Job basis.

For further information about a
Talend
Spark Streaming Job, see the sections
describing how to create, convert and configure a
Talend
Spark Streaming Job of the

Talend Open Studio for Big Data Getting Started
Guide

.

Note that in this documentation, unless otherwise
explicitly stated, a scenario presents only Standard Jobs,
that is to say traditional
Talend
data integration Jobs.

Spark Connection

You need to use the Spark Configuration tab in
the Run view to define the connection to a given
Spark cluster for the whole Job. In addition, since the Job expects its dependent jar
files for execution, you must specify the directory in the file system to which these
jar files are transferred so that Spark can access these files:

  • Yarn mode: when using Google
    Dataproc, specify a bucket in the Google Storage staging
    bucket
    field in the Spark
    configuration
    tab; when using other distributions, use a
    tHDFSConfiguration
    component to specify the directory.

  • Standalone mode: you need to choose
    the configuration component depending on the file system you are using, such
    as tHDFSConfiguration
    or tS3Configuration.

This connection is effective on a per-Job basis.

Related scenarios

No scenario is available for the Spark Streaming version of this component
yet.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x