July 30, 2023

tDataMasking – Docs for ESB 7.x

tDataMasking

Hides original data with random characters or figures to protect the actual data
while having a functional substitute for occasions when it is not advisable to show
sensitive real data.

tDataMasking reads a data set row by row
and creates a structurally similar but inauthentic version of the data after having applied
specific functions on data fields. It generates one row for each input row.

You will be able to use the functional substitute for purposes such as
testing and training. When manipulating Personally Identifiable Information (PII) or
Sensitive Personal Data (SPD), you might want to protect and mask this data.

The definition of sensitive data is broad and may differ from one country
to the other or from one organization to the other. Basically, sensitive data can be
personal information or business information which includes anything that poses a risk to
the person or company in question.

Globally, Credit/Debit card data for example is considered
to be sensitive. Sensitive data is any piece of information that can be used to identify or
locate a person. A non-exhaustive list of personal sensitive data may include: first and
last names, email addresses, addresses, Social Social Number (SSN), credit card numbers,
bank account numbers, race, gender, date of birth, salary and geolocation combined with
time.

For further information about personal sensitive data, see
Personally Identifiable Information.

Also, business sensitive data may include trade secrets,
acquisition plans, financial data and customer information, among other possibilities.

In local mode, Apache Spark 1.6.0 and later versions are supported.

Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:

Data masking capabilities

Masking functions in the tDataMasking component are consistent,
bijective and/or random functions, and they can check that the input data is in a valid
format.

Random data masking

Random masking consists of masking an input value with a randomly generated
value.

When there are multiple occurrences of the same value in the input dataset, it can be
masked with different values.

Different values from the input dataset can be masked with the same value.

For example, the following diagram shows an example of how the
tDataMasking component can mask data randomly:

  • The A value is masked with D
    when it first appears in the input dataset.
  • The B and C values are masked
    with E.
  • The A value is masked with F
    when it appears in the input dataset for the second time.

tDataMasking_1.png

Random data masking examples

The following table shows examples of generated masked values using the Replace n first chars function:

Input values Extra Parameter Examples of masked values
newuser@domain.com “4” ohsbser@domain.com
admin@company.com “4” lneen@company.com
newuser@domain.com “4” qzmaser@domain.com
The following table shows examples of generated masked values using the Generate from pattern function:

Input values Extra Parameter Examples of masked values
newuser@domain.com “aaaaaa” rxvsas
admin@company.com “aaaaaa” bbwpba
newuser@domain.com “a9aaa9” r8daw1
The following table shows examples of generated masked values for the Generate French SSN number function:

Input values Examples of masked values
190049418437621 2590459222147 22
271083561478941 1900846274448 17
190049418437621 2730364078284 70
117029 1750694861914 69

Consistent data masking

When the same value appears twice in the input data, consistent masking functions
output the same masked value in the same Job execution.

However, two different input values can be masked with the same value in the output.

For example, the following diagram shows an example of how the
tDataMasking component can mask data consistently:

  • The A value is masked with D,
    regardless of the number of occurrences in the input dataset.
  • The B and C values are masked
    with E.

tDataMasking_2.png

Consistent data masking examples

The following table shows examples of generated masked values using the Mask email
left part of domain with consistent items function:

Input values Extra Parameter Examples of masked values
newuser@domain.com “talend,value,newcompany” newuser@newcompany.com
admin@company.com “talend,value,newcompany” admin@value.com
newuser@domain.com “talend,value,newcompany” newuser@newcompany.com
user@company.com “talend,value,newcompany” user@value.com
user@domain.com “talend,value,newcompany” user@newcompany.com

Bijective data masking

Bijective masking functions have the following characteristics:

  • They are consistent masking functions.
  • They are injective, meaning that they output two different masked values for two
    different input values.
  • They check that the input data is in a valid format. If the input value is
    valid, bijective masking functions output a valid value. If the input value is
    not valid, they output an invalid value or replace values with
    null, depending of the masking function used.
For example, the following diagram shows an example of how the
tDataMasking component can mask data bijectively:

  • The A value is masked with D,
    regardless of the number of occurrences in the input dataset.
  • The B value is masked with
    E.
  • The C value is masked with
    F.

tDataMasking_3.png

Bijective data masking examples

The following table shows examples of generated masked values using the Mask French SSN number function:

Input values Example of masked values
190049418437621 289052428331901
271083561478941 234112758889352
190049418437621 289052428331901
117029 null

Repeatable data masking

To produce repetable masked values between Job executions, define a seed or a password
in the Advanced settings of the component

For a given combination of input and seed values, the same masked value is produced.

When using Format-Preserving Encryption methods, the same masked value is produced for a
given combination of an input value and a password.

Data masking functions in the masking components

There are several functions in the masking components which vary
according to the data type of the column.

It is advisable to use the functions predefined in the component with
columns that contain personally identifiable information, such as first and last names,
email addresses, addresses, SSNs, credit card numbers, bank account numbers, genders,
date of births and salaries.

Format-preserving
encryption in the masking components

The component uses Format-Preserving Encryption (FPE)
methods to generate masked output values in the same format as the input values.

Note: Java 8u161 is the minimum
required version to use the FF1 with
AES
method. To be able to use this FPE method with Java versions
earlier than 8u161, download the Java Cryptography Extension (JCE) unlimited
strength jurisdiction policy files from Oracle website.

The FPE methods are based on a National Institute of
Standards and Technology (NIST) standard:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

Important: The FPE methods encrypt data to perform pseudonymization. These methods are
less strong than classical encryption algorithms. If you want to keep the data
format, use the masking components. Otherwise, use the tDataEncrypt component. The encryption is stronger.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

You can use tweaks so that the
bijection is not performed. It makes the encryption stronger. A unique tweak is
generated for each record and applies to all data of a record. The tweaks change at
each Job execution. You can unmask the data by using the
tDataUnmasking component and the corresponding tweaks.

Format-preserving encryption in the tDatamasking
component

When using the FF1 with
AES
and FF1 with SHA-2 methods, input values must contain a
minimum number of characters to be masked. Otherwise, the function returns null.

For example, you want to mask
S426A789QQ using the Keep first n
digits and replace following ones
function with the following
parameters:

  • FF1 with AES or FF1 with SHA-2
  • The
    Digits alphabet
  • “2” as an extra-parameter

There are only 4 digits to be masked because you decided to keep the two first
digits. As a result, the function returns null.

The minimum number of characters
required in the input values varies depending on the selected Alphabet.

When selecting Best
guess
, the number varies depending on the represented alphabets in
the input values.

Alphabet Minimum number of characters to mask
Alphanumeric 4
Digits 6
Latin extended 3
Hiragana 4
Katakana 3
Kanji 2
Hangul 2

Alphabets

When using the Replace
all
, Replace characters between two positions,
Replace n first digits and Replace n
last digits
with FPE methods, you can select an alphabet.

Characters that belong to the
selected alphabet are masked with characters from the same alphabet.

When selecting the Best
guess
alphabet, masked values contain characters from all character
types represented in the input values. Best
guess
is the default alphabet.

Any unrecognized character is
copied to the output as is.

The following alphabets are
supported:

Alphabet Character Type Unicode Range (version 11.0) Corresponding characters
Alphanumeric Latin numbers [0030-0039] [0-9]
Latin lower-cased letters [0061-007A] [a-z]
Latin upper-cased letters [0041-005A] [A-Z]
Digits Latin numbers [0030-0039] [0-9]
Latin extended Latin numbers [0030-0039] [0-9]
Latin lower-cased letters [0061-007A] [a-z]
Latin extended lower-cased letters [00DF-00F6]
[00F8-00FF]
[ß-ö] [ø-ÿ]
Latin upper-cased letters [0041-005A] [A-Z]
Latin extended upper-cased letters [00C0-00D6]
[00D8-00DE]
[À-Ö] [Ø-Þ]
Hiragana Hiragana [3041-3096] 30FC 309D
309E
[ぁ-ゖ] ー ゝ ゞ
Katakana Half-with Katakana https://www.unicode.org/charts/PDF/UFF00.pdf [ヲ-ン][FF66-FF9D]
Full-width Katakana [30A1-30FA] 30FC 30FD
30FE
[ァ-ヺ] ー ヽ ヾ
Phonetic extension:
[31F0-31FF]
[ㇰ-ㇿ]
Kanji Kanji CJK Extension A[FF66-FF9D:
[4E00-9FEF] [3400-4DB5]
[一-tDataMasking_4.png] [㐀-䶵]
CJK Extension B:
[20000-2A6D6]
[?-?]
CJK Extension C:
[2A700-2B734]
[?-?]
CJK Extension D:
[2B740-2B81D]
[?-?]
CJK Extension E:
[2B820-2CEA1]
[tDataMasking_5.pngtDataMasking_6.png]
CJK Extension F:
[2CEB0-2EBE0]
[tDataMasking_7.pngtDataMasking_8.png]
CJK Compatibility
Ideographs
: [F900-FA6D] [FA70-FAD9]
[豈-舘] [tDataMasking_9.pngtDataMasking_10.png]
CJK Compatibility Ideographs
Supplement
: [2F800-2FA1D]
[tDataMasking_11.pngtDataMasking_12.png]
KangXi Radicals:
[2F00-2FD5]
[⼀-⿕]
CJK Radicals
Supplement
: [2E80-2E99] [2E9B-2EF3]
[⺀-⺙] [⺛-⻳]
CJK Symbols and
Punctuation
: [3005-3005] [3007-3007] [3021-3029]
[3038-303B]
[々-々] [〇-〇] [〡-〩] [〸-〻]
Hangul Hangul [AC00-D7AF] [가-힯]

Character handling functions

Function Random masking Consistent masking Format-preserving encryption Input data validation
Replace all tDataMasking_13.png tDataMasking_13.png tDataMasking_13.png tDataMasking_16.png
Replace n first chars tDataMasking_13.png tDataMasking_13.png tDataMasking_13.png tDataMasking_16.png
Replace n last chars tDataMasking_13.png tDataMasking_13.png tDataMasking_13.png tDataMasking_16.png
Replace characters between two positions tDataMasking_13.png tDataMasking_13.png tDataMasking_13.png tDataMasking_16.png
Replace all letters tDataMasking_13.png tDataMasking_13.png tDataMasking_13.png tDataMasking_16.png
Replace all digits tDataMasking_13.png tDataMasking_13.png tDataMasking_13.png tDataMasking_16.png
Keep n first digits and replace following
ones
tDataMasking_13.png tDataMasking_13.png tDataMasking_13.png tDataMasking_16.png
Keep n last digits and replace following
ones
tDataMasking_13.png tDataMasking_13.png tDataMasking_13.png tDataMasking_16.png
Keep characters between two positions tDataMasking_13.png tDataMasking_13.png tDataMasking_13.png tDataMasking_16.png
Remove n first chars N/A N/A N/A N/A
Remove n last chars N/A N/A N/A N/A
Remove characters between two positions N/A N/A N/A N/A

Replace all

This function masks all characters from the input values.

This function can be used on Strings.

When using the FF1
with AES
and FF1 with SHA-2
methods, the input values must contain at least two characters to mask. Otherwise, the
function returns null.

Option Description
Method The Randomly method randomly
selects a character. As a result, two identical input values can be masked with
the different output values.

When the same value appears
twice in the input data, the Consistently method ensures that the function outputs the
same masked value in the same Job execution.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter

The optional extra parameter must be a character.

In the first two examples, all characters are replaced with the character
define as an extra parameter.

In the third example, the masked value contains
characters from all alphabets represented in the input value.

Input value Method Alphabet Extra parameter Example of a masked value
Jack Randomly   “a” aaaa
S1000D Randomly   “4” 444444
S1000D FF1 with
SHA-2
Best
guess
  2MTW72

Replace n first chars

This function masks the first n characters, while the following ones
remain as is.

Option Description
Method The Randomly method randomly
selects a character. As a result, two identical input values can be masked with
different output values.

When the same value appears twice in
the input data, the Consistently
method ensures that the function outputs the same masked value in the same
Job execution.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires an extra parameter.

The extra parameter must be a number.
This is the number of characters to be masked.

You
can enter a second extra parameter, which is the replacement
character.

In the following examples, the first two characters from the input values
are masked.

In the first example, the replacement character is not defined. The first
two characters are masked with random characters.

In the second example, the
first two characters are masked with the defined character.

Input value Method Extra parameter Example of a masked value
Jack Randomly “2” Pvck
S1000D Randomly “2,s” ss000D

Replace n last chars

This function masks the last n characters, while the previous ones
remain as is.

Option Description
Method The Randomly method
randomly selects a character. As a result, two identical input values can be
masked with different output values.

When the same value
appears twice in the input data, the Consistently method ensures that the function outputs the
same masked value in the same Job execution.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires an extra parameter.

The extra parameter must be a number.
This is the number of characters to be masked.

You
can enter a second extra parameter, which is the replacement
character.

In the following examples, the last two characters from the input values
are masked.

In the first example, the replacement character is not defined. The last
two characters are masked with random characters.

In the second example, the
last two characters are masked with the defined character.

Input value Method Extra parameter Example of a masked value
Jack Randomly “2” Jadq
S1000D Randomly “2,s” S100ss

Replace characters between two positions

This function masks all characters included in the defined interval, while the ones
outside the interval are copied to the output as is.

Option Description
Method The Randomly method
randomly selects a character. As a result, two identical input values can be
masked with different output values.

When the same value
appears twice in the input data, the Consistently method ensures that the function outputs the
same masked value in the same Job execution.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires two extra parameters.

The
extra parameters must be numbers, which are the start and the end values of
the interval.

You can enter a third extra parameter,
which is the replacement character.

In the first example, the first three characters are masked with the defined
character.

In the second example, the replacement character is not defined. The second, third
and fourth characters are masked with random characters.

Input value Method Extra parameter Example of a masked value
Jack Randomly “1,3,p” pppk
S1000D Randomly “2,4” S0640D

Replace all letters

This function masks all letters from the input values.

Option Description
Method The Randomly method
randomly selects a character. As a result, two identical input values can be
masked with different output values.

When the same value
appears twice in the input data, the Consistently method ensures that the function outputs the
same masked value in the same Job execution.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter

The optional extra parameter is the replacement
character.

In the first example, the replacement character is not defined. All letters are
masked with random characters.

In the second example, all letters are masked with the defined character.

Input value Method Extra parameter Example of a masked value
Jack Randomly “” Zvxn
S1000D Randomly “q” q1000q

Replace all digits

This function masks all digits from the input values.

Option Description
Method The Randomly method
randomly selects a character. As a result, two identical input values can be
masked with different output values.

When the same value
appears twice in the input data, the Consistently method ensures that the function outputs the
same masked value in the same Job execution.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Alphabet Digits is the only alphabet available with the
FF1 with AES and FF1 with SHA-2 methods.
Extra parameter

The optional extra parameter is the replacement
character.

In the first example, the replacement character is not defined. All digits are masked
with random characters.

In the second example, all digits are masked with the defined character.

In the third example, all digits are masked with the defined digit.

Input value Method Extra parameter Example of a masked value
Jack Randomly “” Jack
S1000D Randomly “q” SqqqqD
S1000D Randomly “8” S8888D

Keep n first digits and replace following ones

This function keeps the first n digits as is and replaces subsequent
ones with digits. Non-digits characters remain as is.

Option Description
Method The Randomly method
randomly selects a character. As a result, two identical input values can be
masked with the different output values.

When the same value
appears twice in the input data, the Consistently method ensures that the function outputs the
same masked value in the same Job execution.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires an extra parameter.

The extra parameter is the number of
digits to be masked.

In the first example, the input value does not contain any digits, the input value is
copied as is to the output.

In the first example, the first two digits are are copied to the output as is. The
following ones are masked with random digits.

Input value Extra parameter Example of a masked value
Jack “2” Jack
S1000D “2” S1023D

Keep n last digits and replace previous ones

This function keeps the last n digits as is and replaces previous ones
with digits. Non-digits characters remain as is.

Option Description
Method The Randomly method
randomly selects a character. As a result, two identical input values can be
masked with the different output values.

When the same value
appears twice in the input data, the Consistently method ensures that the function outputs the
same masked value in the same Job execution.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires an extra parameter.

The extra parameter is the number of
digits to be masked.

In the first example, the input value does not contain any digits, the input value is
copied as is to the output.

In the first example, the last two digits are are copied to the output as is. The
previous ones are masked with random digits.

Input value Extra parameter Example of a masked value
Jack “2” Jack
S1000D “2” S8900D

Keep characters between two positions

This function keeps all characters included in the defined interval, while the ones
outside the interval are removed.

Option Description
Extra parameter This function requires two extra parameters.

The extra parameters must be numbers, which are the start
and the end values of the interval.

In the first example, the first three characters are kept, while the other ones are
removed.

In the second example, the second, third and fourth characters are kept, while the other ones
are removed.

Input value Extra parameter Example of a masked value
Jack “1,3” Jac
S1000D “2,4” 100

Remove characters between two positions

This function removes all characters included in the defined interval, while the ones
outside the interval are copied to the output as is.

Option Description
Extra parameter This function requires two extra parameters.

The
extra parameters must be numbers, which are the start and the end values of
the interval.

In the first example, the first three characters are removed, while the other ones
are kept.

In the second example, the second, third and fourth characters are removed, while the
other ones are kept.

Input value Extra parameter Example of a masked value
Jack “1,3” k
S1000D “2,4” S0D

Remove n first chars

This function removes the first n characters, while subsequent ones
are copied to the output as is.

Option Description
Extra parameter This function requires an extra parameter.

The extra parameter is the number of
characters to be removed.

In the first example, the first two characters are removed.

In the second example, the first four characters are removed.

Input value Extra parameter Example of a masked value
Jack “2” ck
S1000D “4” 0D

Remove n last chars

This function removes the last n characters, while previous ones are copied to
the output as is.

Option Description
Extra parameter This function requires an extra parameter.

The extra parameter is the number of
characters to be removed.

In the first example, the last two characters are removed.

In the second example, the last four characters are removed.

Input value Extra parameter Example of a masked value
Jack “2” Ja
S1000D “4” S1

Date handling functions

You can mask dates.

Function Random masking Consistent masking Format-preserving encryption Input data validation Note
Date variance tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png You can use the tPatternMasking component to mask dates in a bijective
manner. However, the variation in days is not guaranteed.
Keep year and set day and month to 01/01 tDataMasking_16.png tDataMasking_13.png tDataMasking_16.png tDataMasking_13.png

Date variance

This function varies the input date by the number of days
specified as an extra parameter.

If the input date is null, then the function returns the current date.

Option Description
Extra parameter This function requires an extra parameter.

The extra
parameter must be a number of days.

If the extra
parameter is 0 or null or if it is not a number, then the
parameter is replaced with 31.

For example, if the input date is 05-11-2016, then the generated date is randomly selected
between 04-10-2016 (31 days before the
input date) and 06-12-2016 (31 days after
the input date).

In the first example, the extra parameter is “0”. Then, the function replaces this value with 31. The generated date, 07-07-2018, is randomly selected between 01-06-2018 (31 days before the input date) and 02-08-2018 (31 days after the input date).

In the first example, the extra parameter is “4”. The generated date, 01-07-2018, is randomly selected between 29-06-2018 (4 days before the input date) and 06-08-2018 (4 days after the input
date).

Input value Extra parameter Example of a masked value
02-07-2018 “0” 07-07-2018
02-07-2018 “4” 01-07-2018

Keep year and set day and month to 01/01

This sets the month and day of the input date to January, 1 but does not
change the year.

If the input date is null, the function returns January,
1 of the current year, for example 01-01-2019.

This function requires no extra
parameter.

The function returns January, 1 of the current year.

Input value Example of a masked value
24-12-2019 01-01-2019

Number handling functions

You can mask numbers.

Function Random masking Consistent masking Format-preserving encryption Input data validation Note
Generate value between two values tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png To mask values in a bijective manner, you can use the
tPatternMasking
component.
Numeric variance tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png

Generate value between two values

This function generates a number randomly chosen between the
user-defined minimum and maximum values.

Option Description
Extra parameter This function requires an extra parameter.

The minimum and maximum values are specified as an extra parameter, by
a comma-separated list of two integers, for example: "1,10".

If the
user-defined minimum and maximum values do not use the right format, the
function returns the following masked values:

  • If the data type of the input column is String, the
    functions returns an empty string.
  • If the data type of the input column is a numeric
    data type, the functions returns 0.

The masked value has been randomly selected within the minimum value (50) and the
maximum value (99) defined as extra parameters.

Input value Extra parameter Example of a masked value
24 “50,99” 93

Numeric variance

This function varies the input numeric value, based on the percentage
specified as an extra parameter.

This function applies only to numeric data types: Integer, Long, Float and Double.

Option Description
Extra parameter This function requires an extra parameter.

The extra parameter must be a number, this parameter
represents a percentage of modification. The function modifies the input
data by multiplying it by a number between the parameter and its
opposite.

For example, if the input is 100 and the parameter is 10, then the generated value will be a
randomly selected value between 90 (100 –
10%) and 110 (100 + 10%).

If the extra parameter is 0, it will be replaced with 10.

If the input is null, then the
function will return 0.

In the following example, the masked value has been randomly selected
between 5 (10 – 50%) and 15 (10 + 50%).

Input value Extra parameter Example of a masked value
10 “50” 7

Bank account generation functions

You can generate bank account numbers.

To mask bank account numbers by keeping the original country and using the Format-Preserving
Encryption, use the Bank account masking
function
.

Function Random generation Consistent generation Bijective generation Input data validation
Generate account number tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png
Generate account number and keep original
country
tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png

Generate account number

This function generates a valid French bank account number.

This function only applies on String values.

This function requires no extra
parameter.

A French IBAN number is a 27-character code. The numbers
are randomly generated but against algorithms. The last digit of the IBAN is known as
the “clef RIB” and is generated with an algorithm and the third and fourth digits of the
IBAN are also generated through an algorithm.

In the following example, the masked value is a French IBAN number, regardless of the
input value.

Input value Example of a masked value
A26 FR76 3000 6000 0112 3456
7890 189

Generate account number and keep original country

This function generates a valid bank account number for the
original country.

If the input is a correct IBAN number, the function generates an IBAN number
from the same country as the input value. The function takes into account the IBAN
number which is different from one country to the other.

If the input value is a correct US account number, the function keeps the first
nine digits and randomly masks the other digits.

If the input value is not a correct account number, the function generates
a valid French IBAN number.

In the first example, the input value is not a correct account number, the
masked value is a valid French IBAN number.

In the second example, the input value is a correct US account number, the masked
value is a correct US account number.

Input value Example of a masked value
1234567890 FR76 3000 1007 9412 3456 7890 185
091000019 6564833713 091000019 3602742991

Credit card generation functions

You can generate credit card numbers.

To mask credit card numbers by using the Format-Preserving Encryption, use the Credit Card masking functions.

Function Random generation Consistent generation Bijective generation Input data validation
Generate credit card tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png
Generate credit card and keep original bank tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png

Generate credit card

This function generates a valid credit card number.

This function requires no extra
parameter.

This function applies on String or Long values.

Three types of credit card that can be generated:

  • Visa
  • MasterCard
  • American Express

One type is randomly chosen and a credit card number is randomly generated. Then,
the generated credit card number passes algorithms that detect false credit card
numbers.

In the following example, the masked value is a valid Visa credit card number,
regardless of the input value.

Input value Example of a masked value
A26 4346065537027896

Generate credit card and keep original bank

If the input value is a correct Visa, MasterCard or American Express
credit card number, this function generates a credit card number from the same company
and keeps the prefix

This function applies on String or Long values.

This function requires no extra
parameter.

The generated credit card number passes algorithms that detect false
credit card numbers.

In the following example, the input value is a valid American Express credit card
number. The masked value is also a valid American Express credit card number.

Input value Example of a masked value
346992550391727 348482709815527

Data generation functions

You can generate output data different from the input data.

Function Random generation Consistent generation Bijective generation Input data validation
Generate from pattern tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png
Generate Uuid tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png
Generate sequence tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png
Generate from file/list tDataMasking_13.png tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png

Generate from pattern

This function generates a value based on a user-defined
pattern.

This function is applied only on Strings.

Option Description
Extra parameter This function requires an extra parameter.

The extra parameter is a pattern that
follows those rules:

  • A is replaced with a
    random Latin uppercase letter.
  • a is replaced with a
    random Latin lowercase letter.
  • 9 is replaced with a
    random digit.
  • H is replaced with a
    random Hiragana character.
  • K is replaced with a
    random full-width Katakana character.
  • k is replaced with a
    random half-width Katakana character.
  • C is replaced with a
    random Kanji character.
  • G is replaced with a
    random Hangul character.

All other characters are copied to the generated value
as is.

For more information about the supported
character types and the related Unicode ranges, see Data masking functions in the masking components.

You can also use
numbered backreferences (\<number>) using the following syntax: <pattern>\<number>,<group1>,<groupN>.

  • <pattern>
    corresponds to the pattern to be used for generating the output
    value.
  • \<number>
    is a numbered backreference. <number> identifies the position of the group
    placed after the “,
    character.
  • <group1>,<groupN> are comma-separated
    groups of characters. Each group is treated as a single unit. If a
    backreference calls a group, it is added as is in the generated
    value.

If you want to copy a character used in patterns
(A, a,
9, H,
h, K,
k, C,
G) as is in the generated value, use a
backreference.

This function does not work correctly if a comma ‘,’ is used in the
pattern.

In the following example:

  • a characters are replaced with random
    Latin lowercase letters.
  • s characters are not masked in the
    generated output.
  • \2 calls the group placed after the
    second “,” character, which is @talend.com.
Input value Extra parameter Example of a masked value
A26 “aaaass\2,@gmail.com,@talend.com” hjdfss@talend.com
In the following example:

  • \3 calls the group placed after the
    third “,” character, which is a.
  • 9 characters are masked with random
    digits.
Input value Extra parameter Example of a masked value
A26 “\39999,D,Z,a” a4825

Generate UUID

This function masks the input value with a randomly generated
universally unique identifier (UUID).

This function uses the UUID.randomUUID() method
provided by Java. This Java method does not use a seed, meaning that if you run the job
twice, the function generates different UUIDs.

This function is applied on Strings.

This function requires no extra
parameter.

In the following example, the masked valued is a randomly generated UUID.

Input value Example of a masked value
A26 28e92000-aafa-4ec3-bd56-240f192a4a8c

Generate sequence

This function returns the extra parameter, and, for each row, will
increase this number by 1.

This function can be applied on all data types but Date (Integer, Long,
Strings, etc.).

Note: This function is not supported in the Spark version of the component.

Option Description
Extra parameter This function requires an extra parameter.

The extra parameter must be a number.

If the extra
parameter is not a number, it is set to 0.

In the following example, the generated sequence starts with the number
set as an extra parameter.

Input values Extra parameter Examples of masked values
21

A48

“0” 0

1

Generate from file/list

This function randomly replaces the input value with one of the
user-defined values.

This function is applied to Strings or numerical data types.

Option Description
Method The Randomly method randomly selects the value
from the list (or file). As a result, two similar input values can be masked
with the different output values.

The Consistently
method ensures that two similar input values are masked with the same output
value.

When using the
Consistently method, the probability of
generating duplicates can be calculated using the following formulas:

  • P = 1 if K < N, or
  • P = 1-K*(K-1)*(K-2)*…*(K-N+1) / K^N

where P
is the probability of generating duplicates, N the input data size and K is the size of the input list given as a
parameter.

Using this approach, it is possible to
calculate the probability to find a pair sharing the same value within a
group.

For example, the probability that, in a group
of n people, two people have the same birthday is
the following:

  • 2.7% in a group of 5 people,
  • 41.1% in a group of 20 people,
  • 100% in a group of 367 people, since there are 366
    possible birthdays, including February 29.
Extra parameter This function requires an extra parameter.

The extra parameter can be:

  • a comma-separated list of two values minimum; or
  • a path to a file containing the values.

The values must be stored in a String and
separated by commas, for example: "item1, item2,
item3, etc."
. This function uses the hashCode() method provided by Java to choose an element from
the list.

If you use the Apache Spark Batch or the Apache Spark Streaming
version of the component, enter the prefix before the file path:

  • prefix://file path, even if you run the Job in local mode,
    or
  • hdfs://hdpnameservice1/file path if the index is on a
    cluster.

Paths to folders are not supported.

If the extra parameter is not set, the function returns an empty String or
0.

In the following example, the masked value is one of the values set as extra
parameters.

Input value Method Extra parameter Examples of a masked value
21 Randomly “help,documentation” help

Phone number generation functions

You can generate French, German, Japanese, UK and US phone
numbers.

To mask phone numbers by using the Format-Preserving Encryption, use the Phone masking functions.

Function Random generation Consistent generation Bijective generation Input data validation
Generate French phone number tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png
Generate German phone number tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png
Generate Japanese phone number tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png
Generate UK phone number tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png
Generate US phone number tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png

Generate French phone number

This function generates a valid French phone number, regardless of
the input value.

This function only applies on Strings.

This function requires no extra
parameter.

Input value Example of a masked value
A26 +33
307066271

Generate German phone number

This function generates a valid German phone number, regardless of
the input value.

This function only applies on Strings.

This function requires no extra
parameter.

Input value Example of a masked value
A26 030
30748511

Generate Japanese phone number

This function generates a valid Japanese phone number, regardless
of the input value.

This function only applies on Strings.

This function requires no extra
parameter.

Input value Example of a masked value
A26 03-2419-1781

Generate UK phone number

This function generates a valid UK phone number, regardless of the
input value.

This function only applies on Strings.

This function requires no extra
parameter.

Input value Example of a masked value
A26 020 3705
5907

Generate US phone number

This function generates a valid US phone number, regardless of the
input value.

This function only applies on Strings.

This function requires no extra
parameter.

Input value Example of a masked value
A26 527-708-5526

Social Security Number (SSN) generation functions

You can generate Social Security Numbers.

To mask SSNs by using the Format-Preserving Encryption, use the Social Security Number (SSN) masking functions.

Function Random generation Consistent generation Bijective generation Input data validation
Generate French SSN number tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png
Generate German SSN number tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png
Generate Japanese SSN number tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png
Generate UK SSN number tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png
Generate US SSN number tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png
Generate Chinese SSN number tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png
Generate Indian SSN number tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png

Generate French SSN number

This function generates a valid French social security number,
regardless of the input value.

This function only applies on Strings.

Input value Example of a masked value
A26 2760774865895 37

Generate German SSN number

This function generates a valid German social security number,
regardless of the input value.

This function only applies on Strings.

Input value Example of a masked value
A26 96918234144

Generate Japanese SSN number

This function generates a valid Japanese social security number,
regardless of the input value.

This function only applies on Strings.

Input value Example of a masked value
A26 680917875625

Generate UK SSN number

This function generates a valid UK social security number, regardless of
the input value.

This function only applies on Strings.

Input value Example of a masked value
A26 BY 15 61 20 D

Generate US SSN number

This function generates a valid US social security number, regardless of
the input value.

This function only applies on Strings.

Input value Example of a masked value
A26 437-02-2223

Generate Chinese SSN number

This function generates a valid Chinese social security number,
regardless of the input value.

This function only applies on Strings.

Input value Example of a masked value
A26 653024204001080102

Generate Indian SSN number

This function generates a valid Indian social security number,
regardless of the input value.

This function only applies on Strings.

Input value Example of a masked value
A26 142543864863

Bank account masking function

You can mask IBAN and US bank account numbers.

Function Random masking Consistent masking Format-preserving encryption Input data validation
Mask account number and keep original country tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png tDataMasking_13.png

This function applies on String values.

Two methods are available: FF1 with AES
and FF1 with SHA-2. This function requires no
alphabet and no extra parameter.

If the input is a valid IBAN number, the function masks it by an IBAN number
from the same country. The function takes into account the IBAN number which is
different from one country to the other.

If the input is a valid US account number, the function masks all digits.

If the input is neither a valid IBAN nor US account number and there is:

  • no “Invalid” output flow, the function returns null in the main flow.
  • an “Invalid” output flow, the corresponding data are sent to the
    “Invalid” output flow.
The following table describes the validity of the bank account numbers and the logic applied to
them.

Bank account number Logic Valid if…
IBAN number

The whole string is verified.

The first two
characters correspond to the country ISO code. They remain the same.

The first two digits are generated through a
checksum algorithm.

For French and
Monegasque IBAN numbers, the last two digits, known as the “clef
RIB”, are generated through an algorithm.

  • The format and the checksum are valid.
  • The “clef RIB” is valid. Applicable to French and Monegasque IBAN
    numbers.
US The first nine digits are verified.

To verify whether the format of an IBAN number is valid or not, you can refer
to this IBAN registry.

In the following example, the Keep
format
check box is selected to preserve the space from the input
value.

Input value Method Example of masked value

SV43ACAT00000000000000123123

FF1 with SHA-2

SV53FAGI78247154681080694193

FR49 2867 2609 7580 N16P 4ZFM V39

FF1 with AES null

Cause: Invalid IBAN number

159 753 321 16 FF1 with SHA-2

607 503 340 92

4556156203746391 FF1 with AES null

Cause: Invalid bank account number

RO49 AAaA 1b31 1000 9344 0000 FF1 with SHA-2 null

Cause: Lowercase letters

ST23000200000289355710148

FF1 with AES

ST30061989350589302375875

The given outputs are valid bank account numbers.

Address masking functions

You can mask addresses.

This function only applies on Strings.

Function Random masking Consistent masking Format-preserving encryption Input data validation
Address masking tDataMasking_13.png tDataMasking_16.png tDataMasking_16.png tDataMasking_16.png

This function masks digits with other digits and other
characters with X.

The following case-insensitive keywords will not be masked in
the output: ALLEE, ALLEY, ALLÉE, AREA, AUFFAHRT, AV, AV., AVDA, AVE,
AVE., AVENIDA, AVENUE, BACKROAD, BANLIEUE, BD, BD., BLV, BLV., BLVD, BOULEVARD,
BREVE, BULEVAR, BVD, BVD., BYWAY, CALLE, CAMINHO, CAMINO, CARREFOUR, CARREGGIATA,
CARRETERA, CHAUSSEE, CHAUSSÉE, CHEMIN, CITE, CITÉ, CORTO, COUR, COURT, CRT, CT, CT.,
CURTO, DR, DR., DRIVE, DRIVEWAY, ESD, ESPLANADA, ESPLANADE, ESTRADA, FAUBOURG,
FORUM, FREEWAY, GLEIS, HIGHWAY, HWY, IMPASSE, INDUSTRIAL, INDUSTRIALE, INDUSTRIELLE,
KURZ, LANE, LUNGOMARE, MANEIRA, MODO, PARKWAY, PARVIS, PASSAGE, PASSERELLE,
PERIFERIA, PERIFERICO, PERIFÉRICO, PERIPHERAL, PERIPHERIQUE, PIAZZA, PISTA, PL, PL.,
PLACE, PLATZ, PLAZA, PONT, PORTE, PROMENADE, PERIPHERIQUE, PÉRIPHÉRIQUE, QUADRADO,
QUAI, R, R., RD, RD., ROAD, ROUTE, RTE, RUA, RUE, SQUARE, ST, ST., STD, STR, STRADA,
STRASSE, STREET, SUBURB, SUBURBIO, SUBÚRBIO, TERRASSE, TRACK, UBER, VIA, VIALE,
VILLA, VLE, VOIE, VORORT, VÍA, WAY, WEG, ZONA, ZONE, ÁREA, ÜBER
.

Note: This list is not exhaustive.
Option Description
Extra parameter The optional extra parameter can be:

  • a comma-separated list of two key words minimum
  • a path to a file containing keywords

Those keywords are added to the default list and will not be masked in
the output

In the first example, the extra parameter is not set. The word “venelle”
is not part of the list of keywords. As a result, this word is masked in the
output.

In the second example, “venelle”
is added to the list of keywords. As a result, this word is not masked in the
output.

Input value Extra parameter Example of a masked value
3 venelle
Artémis
“” 5 XXXXXXX
XXXXXXX
3 venelle
Artémis
“venelle,enceinte” 6 venelle
XXXXXXX

Email masking functions

You can mask email addresses.

Function Random masking Consistent masking Format-preserving encryption Input data validation
Mask email full domain by character tDataMasking_13.png tDataMasking_13.png tDataMasking_16.png tDataMasking_13.png
Mask email left part of domain by character tDataMasking_13.png tDataMasking_13.png tDataMasking_16.png tDataMasking_13.png
Mask email local part by character tDataMasking_13.png tDataMasking_13.png tDataMasking_16.png tDataMasking_13.png

Mask email local part

This function masks all characters before the @ character. Two
methods are available: By character and From a list of
values
.

This function only applies on Strings.

This function requires an extra parameter.

Option Description
Method When using the By
character
method, this function masks what comes before the @
character with a character.

When using the From a list of values method, this function
masks what comes before the @ character with one of the values from the
specified list.

Extra parameter This function requires an extra parameter.

When using the By
character
method, the extra parameter must be a character.
If you specify an invalid extra parameter, like a string, a list, multiple
characters or a digit, all characters before the @ character will be masked
with X characters by default.

When
using the From a list of values, the
extra parameter can be a comma-separated list of values or a path to a file
containing a list of values. If you do not specify an extra parameter, all
characters before the @ character are removed.

In the first example, all characters before the @ character are masked
with the user-defined characters.

In the second example, all characters before the @
character are masked with one of the values from the user-defined
list.

Input value Method Extra parameter Example of masked value
johnsmith@company.com By
character
“p” ppppppppp@company.com
johnsmith@company.com From a list of
values
“z,x,c,h” xxxxxxxxx@company.com

Mask email full domain

This function masks what comes after the @ character. Two methods
are available: By character and From a list of
values
.

This function only applies on Strings.

Option Description
Method When using the By
character
method, this function masks what comes after the @
character with a character.

When using the From a list of values method, this function
masks what comes after the @ character with one of the values from the
specified list.

Extra parameter This function requires an extra parameter.

When using the By
character
method, the extra parameter must be a character.
If you specify an invalid extra parameter, like a string, a list, multiple
characters or a digit, all characters after the @ character will be masked
with X characters by default.

When
using the From a list of values, the
extra parameter can be a comma-separated list of domains or a path to a file
containing a list of domains. If you do not specify an extra parameter, all
characters after the @ character are removed.

In the following example, all characters after the @ character are masked
with one of the values from the user-defined list.

Input value Method Extra parameter Example of a masked value
johnsmith@company.com From a list of
values
“newtalend.com,newcompany.org” johnsmith@newtalend.com

Mask email left part of domain

This function masks what comes between the @ character and the dot in
e-mail adresses. Two methods are available: By
character
and From a list of
values
.

This function only applies on Strings.

Option Description
Method When using the By
character
method, this function masks what comes between the @
character and the dot with a character.

When using the
From a list of values method,
this function masks what comes between the @ character and the dot with one
of the values from the specified list.

Extra parameter This function requires an extra parameter.

When using the By
character
method, the extra parameter must be a character.
If you specify an invalid extra parameter, like a string, a list, multiple
characters or a digit, all characters between the @ character and the dot
will be masked with X characters by
default.

When using the From a list of
values
, the extra parameter can be a comma-separated list of
domains or a path to a file containing a list of domains. If you do not
specify an extra parameter, all characters between the @ character and the
dot are removed.

In the following example, all characters between the @ character and the
dot are masked with one of the values from the user-defined list.

Input value Method Extra parameter Example of a masked value
johnsmith@company.com From a list of
values
“newtalend,talendforge” johnsmith@newtalend.com

Credit Card masking functions

You can mask valid credit card numbers.

Function Random masking Consistent masking Format-preserving encryption Input data validation
Mask Credit Card and keep bank tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png tDataMasking_13.png
Mask Credit Card tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png tDataMasking_13.png
These functions:

  • apply on String values,
  • support all credit card types and
  • keep the original format of the credit card number. For example, if the input has 13
    digits, the output has 13 digits.

A credit card number is considered invalid when it does not satisfy the Luhn algorithm.

If the input is an invalid credit card number and there is:

  • no “Invalid” output flow, the function returns null in the main flow.
  • an “Invalid” output flow, the corresponding data are sent to the “Invalid” output
    flow.

Mask Credit Card and keep bank

This function masks valid credit card numbers and keeps the Bank
Identification Number/Issuer Identification Number (BIN/IIN).

The output
value is a valid credit card number.
This function:

  • keeps the first six digits,
  • masks the other digits and
  • generates the last digit using the Luhn algorithm.

Two methods are available: FF1 with AES and FF1 with
SHA-2
. This function requires no alphabet and no extra parameter.

In the following example, the Keep format check box is selected to
preserve the space from the input value.

Credit card number Method Example of masked value

4570 5624 6978 6793

FF1 with AES

4570 5678 2786 4430

374140537770721

FF1 with AES

374140100455098

5168690988613241

FF1 with SHA-2

5168699616108078

5158495805899854

FF1 with SHA-2

5158494455420285

0123 4567 8987 6543 210 FF1 with AES null

Mask Credit Card

This function masks valid credit card numbers.
The
output value is a valid credit card number.
This function:

  • masks all digits and
  • generates the last digit using the Luhn algorithm.

Two methods are available: FF1 with AES and FF1 with
SHA-2
. This function requires no alphabet and no extra parameter.

In the following example, the Keep format check box is selected to
preserve the space from the input value.

Credit card number Method Example of masked value

4570 5624 6978 6793

FF1 with AES

4931 3744 4754 2072

374140537770721

FF1 with AES

749381687018333

5168690988613241

FF1 with SHA-2

4138106541683084

5158495805899854

FF1 with SHA-2

9641013768742255

0123 4567 8987 6543 210 FF1 with AES null

Phone masking functions

You can mask French, German, Japanese, UK and US phone
numbers.

Function Random masking Consistent masking Format-preserving encryption Input data validation
Mask French phone number tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png tDataMasking_13.png
Mask German phone number tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png tDataMasking_13.png
Mask Japanese phone number tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png tDataMasking_13.png
Mask UK phone number tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png tDataMasking_13.png
Mask US phone number tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png tDataMasking_13.png

Mask French phone number

This function generates a unique random French phone number related to the
input.

This function masks the last six digits. Input values that contain at least six digits
are regarded as valid phone numbers.

If the input value is not valid, the function returns null.

This function only applies on Strings.

If there are duplicates in the input data, you
will get the same duplicates in the masked values. In the same way, if there are no
duplicates in the input data, there will be no duplicates in the masked values.

Option Description
Method The default Basic method uses a proprietary
algorithm.

Note: As the masking methods are stronger, it is recommended to use the FF1
algorithms rather than the Basic method.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires no extra
parameter.

In the following example, the Keep
format
check box is selected to preserve the spaces from the input
value.

Input value Method Example of masked value
02 40 99 90 99 FF1 with AES 02 40 89 78 01

Mask German phone number

This function generates a unique random German phone number related to the
input.

This function masks the last eight digits. Input values that contain at least eight
digits are regarded as valid phone numbers.

If the input value is not valid, the function returns null.

This function only applies on Strings.

If there are duplicates in the input data, you
will get the same duplicates in the masked values. In the same way, if there are no
duplicates in the input data, there will be no duplicates in the masked values.

Option Description
Method The default Basic method uses a proprietary
algorithm.

Note: As the masking methods are stronger, it is recommended to use the FF1
algorithms rather than the Basic method.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires no extra
parameter.

In the following example, the Keep
format
check box is selected to preserve the dash from the input
value.

Input value Method Example of masked value
636-48018 FF1 with AES 389-54922

Mask Japanese phone number

This function generates a unique random Japanese phone number related to the
input.

This function masks the last seven digits. Input values that contain at least seven
digits are regarded as valid phone numbers.

If the input value is not valid, the function returns null.

This function only applies on Strings.

If there are duplicates in the input data, you
will get the same duplicates in the masked values. In the same way, if there are no
duplicates in the input data, there will be no duplicates in the masked values.

Option Description
Method The default Basic method uses a proprietary
algorithm.

Note: As the masking methods are stronger, it is recommended to use the FF1
algorithms rather than the Basic method.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires no extra
parameter.

In the following example, the Keep
format
check box is selected to preserve the dashes from the input
value.

Input value Method Example of masked value
052-2451-4455 FF1 with AES 052-2970-7735

Mask UK phone number

This function generates a unique random UK phone number related to the
input.

This function masks the last seven digits. Input values that contain at least seven
digits are regarded as valid phone numbers.

If the input value is not valid, the function returns null.

This function only applies on Strings.

If there are duplicates in the input data, you
will get the same duplicates in the masked values. In the same way, if there are no
duplicates in the input data, there will be no duplicates in the masked values.

Option Description
Method The default Basic method uses a proprietary
algorithm.

Note: As the masking methods are stronger, it is recommended to use the FF1
algorithms rather than the Basic method.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires no extra
parameter.
Input value Method Example of masked value
02071231234 FF1 with AES 02074444306

Mask US phone number

This function generates a unique random US phone number related to the
input.

This function masks the last six digits. Input values that contain at least six digits
are regarded as valid phone numbers.

If the input value is not valid, the function returns null.

This function only applies on Strings.

If there are duplicates in the input data, you
will get the same duplicates in the masked values. In the same way, if there are no
duplicates in the input data, there will be no duplicates in the masked values.

Option Description
Method The default Basic method uses a proprietary
algorithm.

Note: As the masking methods are stronger, it is recommended to use the FF1
algorithms rather than the Basic method.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires no extra
parameter.

In the following example, the Keep
format
check box is selected to preserve the dash from the input
value.

Input value Method Example of masked value
636-48018 FF1 with AES 389-54922

Social Security Number (SSN) masking functions

You can mask French, German, Japanese, UK, US, Chinese and Indian Social
Security Numbers.

Function Random masking Consistent masking Format-preserving encryption Input data validation
Mask French SSN number tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png tDataMasking_13.png
Mask German SSN number tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png tDataMasking_13.png
Mask Japanese SSN number tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png tDataMasking_13.png
Mask UK SSN number tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png tDataMasking_13.png
Mask US SSN number tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png tDataMasking_13.png
Mask Chinese SSN number tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png tDataMasking_13.png
Mask Indian SSN number tDataMasking_16.png tDataMasking_16.png tDataMasking_13.png tDataMasking_13.png

Mask French SSN number

This function generates a unique random French social security number
related to the input.

This function only applies on Strings.

If there are duplicates in the input data, you
will get the same duplicates in the masked values. In the same way, if there are no
duplicates in the input data, there will be no duplicates in the masked values.

If the input value is not valid, the function returns null.

Option Description
Method The default Basic method uses a proprietary
algorithm.

Note: As the masking methods are stronger, it is recommended to use the FF1
algorithms rather than the Basic method.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires no extra
parameter.

In the following example, the input value is a valid SSN number. The masked value is
also a valid SSN number.

Input value Method Examples of masked value
171125612301521 FF1 with AES 113056322612896

Mask German SSN number

This function generates a unique random German social security number
related to the input.

This function only applies on Strings.

If there are duplicates in the input data, you
will get the same duplicates in the masked values. In the same way, if there are no
duplicates in the input data, there will be no duplicates in the masked values.

If the input value is not valid, the function returns null.

Option Description
Method The default Basic method uses a proprietary
algorithm.

Note: As the masking methods are stronger, it is recommended to use the FF1
algorithms rather than the Basic method.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires no extra
parameter.

In the following example, the input value is a valid SSN number. The masked value is
also a valid SSN number.

Input value Method Examples of masked value
12123456123 FF1 with AES 04538250629

Mask Japanese SSN number

This function generates a unique random French phone number related to the
input.

This function only applies on Strings.

If there are duplicates in the input data, you
will get the same duplicates in the masked values. In the same way, if there are no
duplicates in the input data, there will be no duplicates in the masked values.

If the input value is not valid, the function returns null.

Option Description
Method The default Basic method uses a proprietary
algorithm.

Note: As the masking methods are stronger, it is recommended to use the FF1
algorithms rather than the Basic method.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires no extra
parameter.

In the following example, the input value is a valid SSN number. The masked value is
also a valid SSN number.

Input value Method Examples of masked value
123456789012 FF1 with AES 283950101162

Mask UK SSN number

This function generates a unique random UK social security number
related to the input.

This function only applies on Strings.

If there are duplicates in the input data, you
will get the same duplicates in the masked values. In the same way, if there are no
duplicates in the input data, there will be no duplicates in the masked values.

If the input value is not valid, the function returns null.

Option Description
Method The default Basic method uses a proprietary
algorithm.

Note: As the masking methods are stronger, it is recommended to use the FF1
algorithms rather than the Basic method.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires no extra
parameter.

In the following example, the input value is a valid SSN number. The masked value is
also a valid SSN number.

Input value Method Examples of masked value
PP132459A FF1 with AES PC916049A

Mask US SSN number

This function generates a unique random US social security number
related to the input.

This function only applies on Strings.

If there are duplicates in the input data, you
will get the same duplicates in the masked values. In the same way, if there are no
duplicates in the input data, there will be no duplicates in the masked values.

If the input value is not valid, the function returns null.

Option Description
Method The default Basic method uses a proprietary
algorithm.

Note: As the masking methods are stronger, it is recommended to use the FF1
algorithms rather than the Basic method.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires no extra
parameter.

In the following example, the input value is a valid SSN number. The masked value is
also a valid SSN number.

Input value Method Examples of masked value
153654862 FF1 with AES 828521191

Mask Chinese SSN number

This function generates a unique random Chinese social security
number related to the input.

This function only applies on Strings.

If there are duplicates in the input data, you
will get the same duplicates in the masked values. In the same way, if there are no
duplicates in the input data, there will be no duplicates in the masked values.

If the input value is not valid, the function returns null.

Option Description
Method The default Basic method uses a proprietary
algorithm.

Note: As the masking methods are stronger, it is recommended to use the FF1
algorithms rather than the Basic method.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires no extra
parameter.

In the following example, the input value is a valid SSN number. The masked value is
also a valid SSN number.

Input value Method Examples of masked value
130503196704010012 FF1 with AES 510304190708135114

Mask Indian SSN number

This function generates a unique random Indian social security number
related to the input.

This function only applies on Strings.

If there are duplicates in the input data, you
will get the same duplicates in the masked values. In the same way, if there are no
duplicates in the input data, there will be no duplicates in the masked values.

If the input value is not valid, the function returns null.

Option Description
Method The default Basic method uses a proprietary
algorithm.

Note: As the masking methods are stronger, it is recommended to use the FF1
algorithms rather than the Basic method.

This function can encrypt the output masked
values in the same format as the input values, using Format-Preserving Encryption (FPE)
methods:

  • FF1 with AES relies on the
    Advanced Encryption Standard in CBC mode.
  • FF1 with SHA-2 relies on the
    secure hash function HMAC-256.

The FPE methods are bijective methods, except when
using tweaks.

The FF1 with
AES
and FF1 with SHA-2 methods
require a password to generate encrypted and repeatable masked values. Those FPE methods
do not use a seed.

You can specify this password in the
password for FF1 method field, from the
Advanced Settings of the component.

Extra parameter This function requires no extra
parameter.

In the following example, the input value is a valid SSN number. The masked value is
also a valid SSN number.

Input value Method Examples of masked value
186034828209 FF1 with AES 203307371407

Set to null

You can nullify values from the input data.

This function returns null.

Option Description
Method Not applicable
Extra parameter This function requires no extra
parameter.
In the following examples, input values are nullified out.

Input value Examples of masked value
Arthur null
09-05-2019 null

tDataMasking Standard properties

These properties are used to configure tDataMasking running in the Standard Job framework.

The Standard
tDataMasking component belongs to the Data Quality family.

The component in this framework is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and in Talend Data Fabric.

Basic settings

Schema and Edit
Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Sync
columns
to retrieve the schema from the previous component connected in the
Job.

Click Edit
schema
to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this
    option to view the schema only.

  • Change to built-in property:
    choose this option to change the schema to Built-in for local changes.

  • Update repository connection:
    choose this option to change the schema stored in the repository and decide whether
    to propagate the changes to all the Jobs upon completion. If you just want to
    propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
    again in the Repository Content
    window.

The output schema of this component contains read-only
columns:

  • TWEAK: Is generated when the
    Use tweaks with FF1 Encryption check box is
    selected. This column contains the tweak necessary to decrypt the data.
  • ORIGINAL_MARK: Identifies by true or false if the record
    is an original record or a substitute record respectively.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Modifications

Define in the table what fields to change and how to change
them:

Input Column: Select the column from
the input flow that contains the data to be masked.

These
modifications are based on the function you select in the Function column.

Category: select a category of masking functions from the list.

  • Character Handling
  • Data Handling
  • Number Handling
  • Bank Account
    Generation
  • Data Generation
  • Phone Number
    generation
  • SSN Generation
  • Bank Account Masking
  • Address Masking
  • Email Masking
  • Credit Card Masking
  • Phone Masking
  • SSN Masking
  • Set to null

Function: Select the function that
will hide or obfuscate the original data with substitutes. For example, you can replace
digits or letters with the substitute of your choice, replace values with synonyms from
an index file or nullify values.

The functions you can
select from the Function list depend
on the data type of the input column.

For example, if the column type
is Long, you can use the Numeric variance function. If the column type is String, the Numeric
variance
function will not be available. Also, the Function list for a Date column is date-specific, it allows you to decide the type of
modification you want to do on date values.

Method: Select the Basic method or one FF1 algorithm (Format-Preserving
Encrytion (FPE)), FF1 with AES or FF1 with SHA-2:

The Basic method is the default
algorithm.

Note: As the masking methods are stronger, it is recommended to use the FF1
algorithms rather than the Basic method.

The FF1 with AES method is based
on the Advanced Encryption Standard in CBC mode. The FF1 with
SHA-2
method depends on the secure hash function HMAC-256.

Note: Java 8u161 is the minimum
required version to use the FF1 with AES method.
To be able to use this FPE method with Java versions earlier than 8u161, download the
Java Cryptography Extension (JCE) unlimited strength jurisdiction policy files from
Oracle website.

The FF1 with AES and
FF1 with SHA-2 methods require a password to
be specified in the Password for FF1 methods
field of the Advanced settings to generate unique
masked values.

The Method list is only available for functions that use Format-Preserving
Encryption algorithms.

When using the
Replace all, Replace
characters between two positions
, Replace n first
digits
and Replace n last digits with FPE methods, you can select
an alphabet.

Characters that
belong to the selected alphabets are masked with characters from the same character type
within the selected alphabet.

When selecting the Best guess alphabet, masked values contain characters
from all alphabets represented in the input values. Best
guess
is the default alphabet.

Any unrecognized character is copied to the output as is.

Extra Parameter: This field is used
by some of the functions, it will be disabled when not applicable. When applicable,
enter a number or a letter to decide the behavior of the function you have
selected.

Keep format: this function is only
used on Strings. Select this check box to keep the input format when using the
Generate account number and keep original
country
, Generate credit card
number and keep original bank
, Bank Account
Masking
, Credit Card Masking,
Phone Masking and SSN Masking functions or categories. That is to say, if there are
spaces, dots (‘.’), hyphens (‘-‘) or slashes (‘/’) in the input, those characters are
kept in the output. If you select this check box when using Phone Masking functions, the characters that are not numbers from the
input are copied to the output as is.

Advanced settings

Password for FF1 methods

Set the password
required for the FF1 with AES and FF1 with SHA-2 methods to generate unique masked
values. If the password is not set, a random password is created at each Job execution.
When using the FF1 with AES and FF1 with SHA-2 methods and a password, the seed from
the Seed for random generator field is not
used.

Use tweaks with FF1 Encryption

Select this
check box to use tweaks. A unique tweak is generated for each record and applies to
all data of a record.

If bijective
masking is necessary, do not use this functionality. For more information about
tweaks, see the data
masking functions
.

Seed for random generator

Set a random number if you want to generate
the same sample of substitute data in each execution of the Job. The seed is not set by
default.

If you do not set the seed, the component
creates a new random seed for each Job execution. Repeating the execution with a
different seed will result in a different sample being generated.

Encoding

Select the encoding from the list or select Custom and define it manually. If you select Custom and leave the field empty, the supported
encodings depend on the JVM that you are using. This field is compulsory for the file
encoding.

When you set Function to Generate from
file/list
, define the file path in Extra
Parameter
.

Output the original row

Select this check box to output original data rows in addition to the
substitute data. Outputting both the original and substitute data can be useful in debug
or test processes.

Should null input return
null

This check box is selected by
default. When selected, the component outputs null when
input values are null. Otherwise, the component returns
the default value when the input is null, that is an
empty string for string values, 0 for numeric values
and the current date for date values.

If the input is
null, the Generate
Sequence
function will not return null,
even if the check box is selected.

Should empty input return empty

When this check box is selected, empty values are left unchanged in
the output data. Otherwise, the selected functions are applied to the input
data.

Send invalid data to “Invalid”
output flow
This check box is selected by default.

  • Selected: When the data can be masked, they are sent to the
    main flow. Otherwise, the data are sent to the “Invalid” output flow.
  • Cleared: The data are sent to the main flow.

The data are considered invalid when:

tStat
Catcher
Statistics

Select this check box to gather the Job processing metadata at the Job level
as well as at each component level.

Usage

Usage rule

This component is an intermediary step. It requires an input and
output flows.

Altering data values to restrict the use of actual sensitive data

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

With the tDataMasking component, you can replace
sensitive information such as credit card or social security numbers with realistic values,
allowing production data to be safely used for purposes such as testing and training.

This scenario describes a Job which uses:

  • the tFixedFlowInput component to generate
    personal data including credit card numbers,

  • the tDataMasking component to hide specific
    original data with random characters or figures,

  • the tFileOutputExcel component to output the
    substitute data set.

tDataMasking_221.png

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tDataMasking and
    tFileOutputExcel.
  2. Connect the three components together using the Main links.

Configuring the input component

  1. Double-click tFixedFlowInput to open its
    Basic settings view in the Component tab.

    tDataMasking_222.png

  2. Create the schema through the Edit Schema
    button.

    tDataMasking_223.png

    In the open dialog box, click the [+] button
    and add the columns that will hold the initial input data.
  3. Click OK.
  4. In the Number of rows field, enter
    1.
  5. In the Mode area, select the Use Inline Content option.
  6. In the Content table, enter the customer data
    you want to replace with realistic values, for example:

Replacing actual data with realistic values

  1. Double-click tDataMasking to display the
    Basic settings view and define the component
    properties.

    tDataMasking_224.png

  2. If required, click Sync columns to retrieve
    the schema defined in the input component.
  3. Click the Edit schema button to open the
    schema dialog box.

    tDataMasking proposes one predefined
    read-only column as shown in the below capture.
    tDataMasking_225.png

    This column identifies by true or false if the
    output record is an original or a substitute record respectively.
  4. Move any of the input columns to the output schema if you want to show them in
    the results, click OK and accept to propagate
    the changes.
  5. In the Modifications table, click the [+] button to add four rows, and
    perform the following actions:

    • In the Input Column, select the
      columns which content you want to substitute.
    • In the Category column, select
      from the list the category the function you want to use to mask data
      belongs to.
    • In the Function column, select
      from the list the function you want to use to mask data.
    • When available, in the Parameter
      column, select from the list the method to be used by the function to
      mask data.
    • When available, in the Parameter
      column, enter a value, a pattern or a path to be used by the function to
      mask data.
    In this example, the Job will generate
    inauthentic credit card numbers, replace the first three letters of first names,
    replace last names with names from a local file and replace the local part in
    email addresses with X
    characters.
  6. Click the Advanced settings tab and select
    the Output the original row check box.

    The Job will add the original data rows to the substitute data.

Configuring the output component and executing the Job

  1. Double-click the tFileOutputExcel component
    to display the Basic settings view and define
    the component properties.

    tDataMasking_226.png

  2. Set the destination file name as well as the sheet name and then select the
    Define all columns auto size check box.
  3. Save your Job and press F6 to execute
    it.

    The tDataMasking component substitutes data
    in the selected columns and writes the result in an output file.
  4. Right-click the output component and select Data
    Viewer
    to display the original and substituted data.

    tDataMasking_227.png

    tDataMasking outputs original and substitute
    rows marked respectively with true and false in the
    ORIGINAL_MARK column. It generates inauthentic credit
    card numbers, replaces the first three letters of first names, replaces last
    names with names from a local file and finally replaces the part before the @
    sign in email addresses by the names defined in the component basic
    settings.
    Sensitive personal information in the input data has been “hidden” but data
    keeps looking real and consistent. The substitute data is still usable for
    purposes other than production.

tDataMasking properties for Apache Spark Batch

These properties are used to configure tDataMasking running in the Spark Batch Job framework.

The Spark Batch
tDataMasking component belongs to the Data Quality family.

The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.

Basic settings

Schema and Edit Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Sync
columns
to retrieve the schema from the previous component connected in the
Job.

Click Edit
schema
to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this
    option to view the schema only.

  • Change to built-in property:
    choose this option to change the schema to Built-in for local changes.

  • Update repository connection:
    choose this option to change the schema stored in the repository and decide whether
    to propagate the changes to all the Jobs upon completion. If you just want to
    propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
    again in the Repository Content
    window.

The output schema of this component contains read-only
columns:

  • TWEAK: Is generated when the
    Use tweaks with FF1 Encryption check box is
    selected. This column contains the tweak necessary to decrypt the data.
  • ORIGINAL_MARK: Identifies by true or false if the record
    is an original record or a substitute record respectively.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Modifications

Define in the table what fields to change and how to change
them:

Input Column: Select the column from
the input flow that contains the data to be masked.

These
modifications are based on the function you select in the Function column.

Category: select a category of masking functions from the list.

  • Character Handling
  • Data Handling
  • Number Handling
  • Bank Account
    Generation
  • Data Generation
  • Phone Number
    generation
  • SSN Generation
  • Bank Account Masking
  • Address Masking
  • Email Masking
  • Credit Card Masking
  • Phone Masking
  • SSN Masking
  • Set to null

Function: Select the function that
will hide or obfuscate the original data with substitutes. For example, you can replace
digits or letters with the substitute of your choice, replace values with synonyms from
an index file or nullify values.

The functions you can
select from the Function list depend
on the data type of the input column.

For example, if the column type
is Long, you can use the Numeric variance function. If the column type is String, the Numeric
variance
function will not be available. Also, the Function list for a Date column is date-specific, it allows you to decide the type of
modification you want to do on date values.

Method: Select the Basic method or one FF1 algorithm (Format-Preserving
Encrytion (FPE)), FF1 with AES or FF1 with SHA-2:

The Basic method is the default
algorithm.

Note: As the masking methods are stronger, it is recommended to use the FF1
algorithms rather than the Basic method.

The FF1 with AES method is based
on the Advanced Encryption Standard in CBC mode. The FF1 with
SHA-2
method depends on the secure hash function HMAC-256.

Note: Java 8u161 is the minimum
required version to use the FF1 with AES method.
To be able to use this FPE method with Java versions earlier than 8u161, download the
Java Cryptography Extension (JCE) unlimited strength jurisdiction policy files from
Oracle website.

The FF1 with AES and
FF1 with SHA-2 methods require a password to
be specified in the Password for FF1 methods
field of the Advanced settings to generate unique
masked values.

The Method list is only available for functions that use Format-Preserving
Encryption algorithms.

When using the
Replace all, Replace
characters between two positions
, Replace n first
digits
and Replace n last digits with FPE methods, you can select
an alphabet.

Characters that
belong to the selected alphabets are masked with characters from the same character type
within the selected alphabet.

When selecting the Best guess alphabet, masked values contain characters
from all alphabets represented in the input values. Best
guess
is the default alphabet.

Any unrecognized character is copied to the output as is.

Extra Parameter: This field is used
by some of the functions, it will be disabled when not applicable. When applicable,
enter a number or a letter to decide the behavior of the function you have
selected.

Keep format: this function is only
used on Strings. Select this check box to keep the input format when using the
Generate account number and keep original
country
, Generate credit card
number and keep original bank
, Bank Account
Masking
, Credit Card Masking,
Phone Masking and SSN Masking functions or categories. That is to say, if there are
spaces, dots (‘.’), hyphens (‘-‘) or slashes (‘/’) in the input, those characters are
kept in the output. If you select this check box when using Phone Masking functions, the characters that are not numbers from the
input are copied to the output as is.

Advanced settings

Password for FF1 methods

Set the password
required for the FF1 with AES and FF1 with SHA-2 methods to generate unique masked
values. If the password is not set, a random password is created at each Job execution.
When using the FF1 with AES and FF1 with SHA-2 methods and a password, the seed from
the Seed for random generator field is not
used.

Use tweaks with FF1
Encryption

Select this
check box to use tweaks. A unique tweak is generated for each record and applies to
all data of a record.

If bijective
masking is necessary, do not use this functionality. For more information about
tweaks, see the data
masking functions
.

Seed for random generator

Set a random number if you want to generate
the same sample of substitute data in each execution of the Job. The seed is not set by
default.

If you do not set the seed, the component
creates a new random seed for each Job execution. Repeating the execution with a
different seed will result in a different sample being generated.

Encoding

Select the encoding from the list or select Custom and define it manually. If you select Custom and leave the field empty, the supported
encodings depend on the JVM that you are using. This field is compulsory for the file
encoding.

When you set Function to Generate from
file/list
, define the file path in Extra
Parameter
.

Output the original row

Select this check box to output original data rows in addition to the
substitute data. Outputting both the original and substitute data can be useful in debug
or test processes.

Should null input return
null

This check box is selected by
default. When selected, the component outputs null when
input values are null. Otherwise, the component returns
the default value when the input is null, that is an
empty string for string values, 0 for numeric values
and the current date for date values.

Should empty input return empty

When this check box is selected, empty values are left unchanged in
the output data. Otherwise, the selected functions are applied to the input
data.

Send invalid data to “Invalid”
output flow
This check box is selected by default.

  • Selected: When the data can be masked, they are sent to the
    main flow. Otherwise, the data are sent to the “Invalid” output flow.
  • Cleared: The data are sent to the main flow.

The data are considered invalid when:

tStat
Catcher
Statistics

Select this check box to gather the Job processing metadata at the Job level
as well as at each component level.

Usage

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to,
appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a
scenario presents only Standard Jobs, that is to
say traditional
Talend
data integration Jobs.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Related scenarios

No scenario is available for the Spark Batch version of this component
yet.

tDataMasking properties for Apache Spark Streaming

These properties are used to configure tDataMasking running in the Spark Streaming Job framework.

The Spark Streaming
tDataMasking component belongs to the Data Quality family.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

Basic settings

Schema and
Edit Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

Click Sync
columns
to retrieve the schema from the previous component connected in the
Job.

Click Edit
schema
to make changes to the schema. If the current schema is of the Repository type, three options are available:

  • View schema: choose this
    option to view the schema only.

  • Change to built-in property:
    choose this option to change the schema to Built-in for local changes.

  • Update repository connection:
    choose this option to change the schema stored in the repository and decide whether
    to propagate the changes to all the Jobs upon completion. If you just want to
    propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
    again in the Repository Content
    window.

The output schema of this component contains read-only
columns:

  • TWEAK: Is generated when the
    Use tweaks with FF1 Encryption check box is
    selected. This column contains the tweak necessary to decrypt the data.
  • ORIGINAL_MARK: Identifies by true or false if the record
    is an original record or a substitute record respectively.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Modifications

Define in the table what fields to change and how to change
them:

Input Column: Select the column from
the input flow that contains the data to be masked.

These
modifications are based on the function you select in the Function column.

Category: select a category of masking functions from the list.

  • Character Handling
  • Data Handling
  • Number Handling
  • Bank Account
    Generation
  • Data Generation
  • Phone Number
    generation
  • SSN Generation
  • Bank Account Masking
  • Address Masking
  • Email Masking
  • Credit Card Masking
  • Phone Masking
  • SSN Masking
  • Set to null

Function: Select the function that
will hide or obfuscate the original data with substitutes. For example, you can replace
digits or letters with the substitute of your choice, replace values with synonyms from
an index file or nullify values.

The functions you can
select from the Function list depend
on the data type of the input column.

For example, if the column type
is Long, you can use the Numeric variance function. If the column type is String, the Numeric
variance
function will not be available. Also, the Function list for a Date column is date-specific, it allows you to decide the type of
modification you want to do on date values.

Method: Select the Basic method or one FF1 algorithm (Format-Preserving
Encrytion (FPE)), FF1 with AES or FF1 with SHA-2:

The Basic method is the default
algorithm.

Note: As the masking methods are stronger, it is recommended to use the FF1
algorithms rather than the Basic method.

The FF1 with AES method is based
on the Advanced Encryption Standard in CBC mode. The FF1 with
SHA-2
method depends on the secure hash function HMAC-256.

Note: Java 8u161 is the minimum
required version to use the FF1 with AES method.
To be able to use this FPE method with Java versions earlier than 8u161, download the
Java Cryptography Extension (JCE) unlimited strength jurisdiction policy files from
Oracle website.

The FF1 with AES and
FF1 with SHA-2 methods require a password to
be specified in the Password for FF1 methods
field of the Advanced settings to generate unique
masked values.

The Method list is only available for functions that use Format-Preserving
Encryption algorithms.

When using the
Replace all, Replace
characters between two positions
, Replace n first
digits
and Replace n last digits with FPE methods, you can select
an alphabet.

Characters that
belong to the selected alphabets are masked with characters from the same character type
within the selected alphabet.

When selecting the Best guess alphabet, masked values contain characters
from all alphabets represented in the input values. Best
guess
is the default alphabet.

Any unrecognized character is copied to the output as is.

Extra Parameter: This field is used
by some of the functions, it will be disabled when not applicable. When applicable,
enter a number or a letter to decide the behavior of the function you have
selected.

Keep format: this function is only
used on Strings. Select this check box to keep the input format when using the
Generate account number and keep original
country
, Generate credit card
number and keep original bank
, Bank Account
Masking
, Credit Card Masking,
Phone Masking and SSN Masking functions or categories. That is to say, if there are
spaces, dots (‘.’), hyphens (‘-‘) or slashes (‘/’) in the input, those characters are
kept in the output. If you select this check box when using Phone Masking functions, the characters that are not numbers from the
input are copied to the output as is.

Advanced settings

Password for FF1 methods

Set the password
required for the FF1 with AES and FF1 with SHA-2 methods to generate unique masked
values. If the password is not set, a random password is created at each Job execution.
When using the FF1 with AES and FF1 with SHA-2 methods and a password, the seed from
the Seed for random generator field is not
used.

Use tweaks with FF1
Encryption

Select this
check box to use tweaks. A unique tweak is generated for each record and applies to
all data of a record.

If bijective
masking is necessary, do not use this functionality. For more information about
tweaks, see the data
masking functions
.

Seed for random generator

Set a random number if you want to generate
the same sample of substitute data in each execution of the Job. The seed is not set by
default.

If you do not set the seed, the component
creates a new random seed for each Job execution. Repeating the execution with a
different seed will result in a different sample being generated.

Encoding

Select the encoding from the list or select Custom and define it manually. If you select Custom and leave the field empty, the supported
encodings depend on the JVM that you are using. This field is compulsory for the file
encoding.

When you set Function to Generate from
file/list
, define the file path in Extra
Parameter
.

Output the original row

Select this check box to output original data rows in addition to the
substitute data. Outputting both the original and substitute data can be useful in debug
or test processes.

Should null input return
null

This check box is selected by
default. When selected, the component outputs null when
input values are null. Otherwise, the component returns
the default value when the input is null, that is an
empty string for string values, 0 for numeric values
and the current date for date values.

Should empty input return empty

When this check box is selected, empty values are left unchanged in
the output data. Otherwise, the selected functions are applied to the input
data.

Send invalid data to
“Invalid” output flow
This check box is selected by default.

  • Selected: When the data can be masked, they are sent to the
    main flow. Otherwise, the data are sent to the “Invalid” output flow.
  • Cleared: The data are sent to the main flow.

The data are considered invalid when:

tStat
Catcher
Statistics

Select this check box to gather the Job processing metadata at the Job level
as well as at each component level.

Usage

Usage rule

This component, along with the Spark Streaming component Palette it belongs to, appears
only when you are creating a Spark Streaming Job.

This component is used as an intermediate step.

You need to use the Spark Configuration tab in the
Run view to define the connection to a given Spark cluster
for the whole Job.

This connection is effective on a per-Job basis.

For further information about a
Talend
Spark Streaming Job, see the sections
describing how to create, convert and configure a
Talend
Spark Streaming Job of the

Talend Open Studio for Big Data Getting Started Guide
.

Note that in this documentation, unless otherwise explicitly stated, a
scenario presents only Standard Jobs, that is to
say traditional
Talend
data integration Jobs.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Related scenarios

No scenario is available for the Spark Streaming version of this component
yet.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x