tDataMasking
Hides original data with random characters or figures to protect the actual data
while having a functional substitute for occasions when it is not advisable to show
sensitive real data.
Data will keep looking real and consistent and will remain usable for purposes such as
testing and training. The most common data type which may need masking method is where the
data contains Personally Identifiable Information (PII) or Sensitive Personal Data (SPD).
For further information, see Function behavior in common PII.
tDataMasking reads a data set row by
row and creates a structurally similar but inauthentic version of the
data after having applied specific functions on data fields. It
generates one row for each input row.
Depending on the Talend solution you
are using, this component can be used in one, some or all of the following Job
frameworks:
-
Standard: see tDataMasking Standard properties.
The component in this framework is available when you have subscribed to one of
the Talend Platform products or Talend Data
Fabric. -
Spark Batch: see tDataMasking properties for Apache Spark Batch.
The component in this framework is available when you have subscribed to any Talend Platform product with Big Data or Talend Data
Fabric. -
Spark Streaming: see tDataMasking properties for Apache Spark Streaming.
The component in this framework is available only if you have subscribed to Talend Real-time Big Data Platform or Talend Data
Fabric.
Function behavior in common PII
What is sensitive data
The definition of sensitive data is broad and may differ from one country to the
other or from one organization to the other. Basically, sensitive data can be
personal information or business information which includes anything that poses a
risk to the person or company in question.
Globally, Credit/Debit card data for example is considered to be sensitive. Also
an employee’s salary details any information that can be used to identify or locate
a person can be considered to be sensitive data. A non-exhaustive list of personal
sensitive data may include: first and last names, email addresses, addresses,
Security Social Number (SSN), credit card numbers, bank account numbers, race,
gender, date of birth, salary and geolocation combined with time.
For further information about personal sensitive data, check Personally Identifiable Information.
Also, business sensitive data may include trade secrets, acquisition plans,
financial data and customer information, among other possibilities.
Functions and common PII
There are several functions in the tDataMasking component which vary according to the type of the data
column.
It is advisable to use the functions predefined in the component with
columns that hold personal information, such as first and last names, email addresses,
addresses, SSN, credit card numbers, bank account numbers, race, gender, date of birth
and salary.
Functions that are not self-explanatory are explained in the below
table:
Function |
Description |
---|---|
Set to null |
This function returns |
Date Variance |
This function applies only on Date values. It uses a For example, if the input date is If the input date is null, then the function returns the current date. If the given parameter is 0 or |
Keep year and set day and month to 01/01 |
This function applies only on Date values. It requires no For example, if the input date is |
Generate Account Number |
This function generates a valid French bank account number. A French IBAN number is a 27-character code. The numbers are |
Generate Account Number and keep original country |
This function works like Generate Account If the input is a correct IBAN number, the function generates |
Generate Credit Card |
This function generates a valid credit card number. It There are three types of credit card that can be generated: |
Generate credit card and keep original bank |
This function works like Generate Credit If the input is a correct Visa, MasterCard or American |
Generate from Pattern |
This function is applied only on Strings and it requires a It generates a value that matches the pattern given as – the – the – the – all other characters are kept as they are. You can generate several strings with the same argument For example, if the given pattern is This function does not work correctly if a comma ‘,’ is used |
Generate Phone Number |
This function is applied only on Strings and requires no It generates a random phone number from different countries |
Generate Social Security Number (SSN) |
This function is only used on Strings and requires no |
Generate unique SSN |
This function is only used on Strings and requires no That is to say, if there are duplicates in the input data, If the input is null or in a wrong format, the function |
Generate Sequence
Note:
This function is not supported in the Spark version |
This function can be applied on everything that is not a date |
Generate Uuid |
This function is only applied on Strings and requires no This function uses the |
Generate value between two values |
This function generates a value randomly chosen between two This function can be applied to any types of fields. However, If the input is of Date type, the function returns the |
Keep characters between two positions |
This function can be used on Strings and requires two The two first parameters represent the places of two elements If the input is null or if the parameter is in a wrong |
Remove Characters between two positions |
This function has the same behavior as Keep characters between two positions but with a remove |
Replace characters between two positions |
This function has the same behavior as Keep characters between two positions but with a replace When using the Replace characters between For example, if the input is Steven |
Keep n first digits and replace following ones |
This function is used on Strings, Integers and Long values If the parameter is n, the function If the parameter is bigger than the input length, no |
Keep n last digits and replace previous ones |
This function is the counterpart of Keep n |
Mask Address |
This function can only be used on String values. It replaces Moreover, there is a list of key words that will not be You can give a parameter, it can either be a list of key |
Mask email full domain by character |
This function can only be used on Strings. It replaces If you enter as a parameter something illegal like a string, For example, if the initial email is |
Mask email full domain with consistent items |
This function can only be used on Strings. It replaces For example, if the initial email is |
Mask email left part of domain by character |
This function can only be used on Strings. It replaces the If you enter as a parameter something illegal like a string, For example, if the initial email is |
Mask email left part of domain with consistent items |
This function can only be used on Strings. It replaces the For example, if the initial email is |
Mask email local part by character |
This function can only be used on Strings. It replaces If you enter as a parameter something illegal like a string, For example, if the initial email is |
Mask email local part with consistent items |
This function can only be used on Strings. It replaces For example, if the initial email is |
Numeric Variance |
This function applies only to numerical types (Integer, It takes a parameter that must be a number, this parameter |
Replace all |
This function can be used on Strings and requires a If the parameter is X, the function |
Replace all digits |
This function can be used on Strings and requires a Anything that is not a digit will not be changed. A null |
Replace all letters |
This function can be used on Strings and requires a Anything that is not a letter will not be changed. A null |
Replace by consistent items from input list (or |
This function modifies the input value by randomly selecting It is applied to Strings or numerical types and it ensures For example, you could use this function to generate SSNs. When using Replace by consistent
items from input list (or file), the probability of generating duplicates can be calculated using the following formulas:
where Using this approach, it is possible to calculate the For example, the probability that, in a group of
n people, two people have the same birthday is thefollowing:
|
Replace by item from input list (or file) |
This function has the same behavior as Replace by consistent item from input list, but it randomly select |
Replace n first characters |
If the parameter is n, the function If the parameter is bigger than the input length, all the You can enter a second parameter which is the replacement For example, if the input is Steven |
Replace n last characters |
This function is the counterpart of Replace n first characters. |
tDataMasking Standard properties
These properties are used to configure tDataMasking running in the Standard Job framework.
The Standard
tDataMasking component belongs to the Data Quality family.
The component in this framework is available when you have subscribed to one of
the Talend Platform products or Talend Data
Fabric.
Basic settings
Schema and Edit |
A schema is a row description. It defines the number of fields (columns) to Click Sync columns to retrieve the schema from Click Edit schema to make changes to the schema.
The output schema of this component contains one read-only column, |
|
Built-In: You create and store the |
|
Repository: You have already created |
Modifications |
Define in the table what fields to change and how to change them:
Input Column: Select the column from the input flow for These modifications are based on the function you select in the Function column and the number of modifications you set in the Max Modification Count column.
Function: Select the function that will decide what The Function list will vary according to the column type. For example, a column of a Long type will have a Numeric
Extra Parameter: This field is used by some of the
Keep format: this function is only used on Strings. |
Advanced settings
Seed for random generator |
Set a random number if you want to generate the same sample of substitute data in each Repeating the execution with a different value for this field will result in a different |
Output the original row |
Select this check box to output original data rows in addition to the substitute data. |
Should null input return |
This check box is selected by default. When selected, the component outputs This parameter does not have an effect on the Generate |
Should empty input return empty |
When this check box is selected, the component returns the input values if they are empty. |
tStat |
Select this check box to gather the Job processing metadata at the Job level |
Usage
Usage rule |
This component is an intermediary step. It requires an input and |
Scenario: Altering data values to restrict the use of actual sensitive data
This scenario applies only to a subscription-based Talend Platform solution or Talend Data Fabric.
With the tDataMasking component, you can replace
sensitive information such as credit card or social security numbers with realistic values,
allowing production data to be safely used for purposes such as testing and training.
This scenario describes a Job which uses:
-
the tFixedFlowInput component to generate
personal data including credit card numbers, -
the tDataMasking component to hide specific
original data with random characters or figures, -
the tFileOutputExcel component to output the
substitute data set.

Setting up the Job
-
Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tDataMasking and
tFileOutputExcel. - Connect the three components together using the Main links.
Configuring the input component
-
Double-click tFixedFlowInput to open its
Basic settings view in the Component tab. -
Create the schema through the Edit Schema
button.In the open dialog box, click the [+] button
and add the columns that will hold the initial input data. - Click OK.
-
In the Number of rows field, enter
1. - In the Mode area, select the Use Inline Content option.
-
In the Content table, enter the customer data
you want to replace with realistic values, for example:1234567891011120|4244487462024688|Nowmer|Sheri|A.|2433 Bailey Road|Tlaxiaco|Oaxaca|15057|Mexico|271-555-9715|SheriNowmer@@Tlaxiaco.org1|3458687462024688||Sheri|A.|2433 Bailey Road|Tlaxiaco|Oaxaca|15057|Mexico|271-555-9715|SheriNowmer@Tlaxiaco.org.org2|4639587470586299|Whelply|Derrick|I.|2219 Dewing Avenue|Sooke|BC|17172|Canada|211-555-7669|DerrickWhelply@Sooke.org3|2541387475757600|Derry|Jeanne||7640 First Ave.|Issaquah|WA|73980|USA|656-555-2272|JeanneDerry@Issaquah.org4|7845987500482201|Spence|Michael|J.|337 Tosca Way|Burnaby|BC|74674|Canada|929-555-7279|MichaelSpence@Burnaby.org5|1547887514054179|Gutierrez|Maya||8668 Via Neruda|Novato|CA|57355|$$#|387-555-7172|MayaGutierrez@Novato.org6|5469887517782449|Damstra|Robert|F.|1619 Stillman Court|Lynnwood|WA|90792|$$#|922-555-5465|RobertDamstra@Lynnwood.org7|54896387521172800|Kanagaki|Rebecca||2860 D Mt. Hood Circle|||13343|Mexico|515-555-6247|RebeccaKanagaki@Tlaxiaco.org8|47859687539744377||Kim|H.|6064 Brodia Court|San Andres|DF|12942|Mexico|411-555-6825|Kim@Brunner@San Andresorg9|35698487544797658||Brenda|C.|7560 Trees Drive||BC|$$|Canada|815-555-3975|BrendaBlumberg@Richmond.org10|36521487568712234|Stanz|Darren|M.|1019 Kenwal Rd.|$$#|OR|82017|USA|847-555-5443|DarrenStanz@Lake Oswego.org...
Replacing actual data with realistic values
-
Double-click tDataMasking to display the
Basic settings view and define the component
properties. -
If required, click Sync columns to retrieve
the schema defined in the input component. -
Click the Edit schema button to open the
schema dialog box.tDataMasking proposes one predefined
read-only column as shown in the below capture.This column identifies bytrue
orfalse
if the
output record is an original or a substitute record respectively. -
Move any of the input columns to the output schema if you want to show them in
the results, click OK and accept to propagate
the changes. -
In the Modifications table, click the
[+] button to add four rows, and
then:-
in the Input Column, select the
columns which content you want to substitute, -
in the Function column, select from
the predefined list the function you want to use to generate the
substitute data, -
in the Parameter column, enter a
value, a pattern or a path to be used by the function to substitute
data.
The Job will generate inauthentic credit card numbers, replace the first three
letters of first names, replace last names with names from a local file and
finally replace the part before the @ sign in email addresses by a series of
X. -
-
Click the Advanced settings tab and select
the Output the original row check box.The Job will add the original data rows to the substitute data.
Configuring the output component and executing the Job
-
Double-click the tFileOutputExcel component
to display the Basic settings view and define
the component properties. -
Set the destination file name as well as the sheet name and then select the
Define all columns auto size check box. -
Save your Job and press F6 to execute
it.The tDataMasking component substitutes data
in the selected columns and writes the result in an output file. -
Right-click the output component and select Data
Viewer to display the original and substituted data.tDataMasking outputs original and substitute
rows marked respectively withtrue
andfalse
in the
ORIGINAL_MARK column. It generates inauthentic credit
card numbers, replaces the first three letters of first names, replaces last
names with names from a local file and finally replaces the part before the @
sign in email addresses by the names defined in the component basic
settings.Sensitive personal information in the input data has been “hidden” but data
keeps looking real and consistent. The substitute data is still usable for
purposes other than production.
tDataMasking properties for Apache Spark Batch
These properties are used to configure tDataMasking running in the Spark Batch Job framework.
The Spark Batch
tDataMasking component belongs to the Data Quality family.
The component in this framework is available when you have subscribed to any Talend Platform product with Big Data or Talend Data
Fabric.
Basic settings
Schema and Edit |
A schema is a row description. It defines the number of fields (columns) to Click Sync columns to retrieve the schema from Click Edit schema to make changes to the schema.
The output schema of this component contains one read-only column, |
|
Built-In: You create and store the |
|
Repository: You have already created |
Modifications |
Define in the table what fields to change and how to change them:
Input Column: Select the column from the input flow for These modifications are based on the function you select in the Function column and the number of modifications you set in the Max Modification Count column.
Function: Select the function that will decide what The Function list will vary according to the column type. For example, a column of a Long type will have a Numeric
Extra Parameter: This field is used by some of the
Keep format: this function is only used on Strings. |
Advanced settings
Seed for random generator |
Set a random number if you want to generate the same sample of substitute data in each Repeating the execution with a different value for this field will result in a different |
Output the original row |
Select this check box to output original data rows in addition to the substitute data. |
Should null input return |
This check box is selected by default. When selected, the component outputs This parameter does not have an effect on the Generate |
Should empty input return empty |
When this check box is selected, the component returns the input values if they are empty. |
tStat |
Select this check box to gather the Job processing metadata at the Job level |
Usage
Usage rule |
This component is used as an intermediate step. This component, along with the Spark Batch component Palette it belongs to, appears only Note that in this documentation, unless otherwise |
Spark Connection |
You need to use the Spark Configuration tab in
the Run view to define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Related scenarios
No scenario is available for the Spark Batch version of this component
yet.
tDataMasking properties for Apache Spark Streaming
These properties are used to configure tDataMasking running in the Spark Streaming Job framework.
The Spark Streaming
tDataMasking component belongs to the Data Quality family.
The component in this framework is available only if you have subscribed to Talend Real-time Big Data Platform or Talend Data
Fabric.
Basic settings
Schema and Edit |
A schema is a row description. It defines the number of fields (columns) to Click Sync columns to retrieve the schema from Click Edit schema to make changes to the schema.
The output schema of this component contains one read-only column, |
|
Built-In: You create and store the |
|
Repository: You have already created |
Modifications |
Define in the table what fields to change and how to change them:
Input Column: Select the column from the input flow for These modifications are based on the function you select in the Function column and the number of modifications you set in the Max Modification Count column.
Function: Select the function that will decide what The Function list will vary according to the column type. For example, a column of a Long type will have a Numeric
Extra Parameter: This field is used by some of the
Keep format: this function is only used on Strings. |
Advanced settings
Seed for random generator |
Set a random number if you want to generate the same sample of substitute data in each Repeating the execution with a different value for this field will result in a different |
Output the original row |
Select this check box to output original data rows in addition to the substitute data. |
Should null input return |
This check box is selected by default. When selected, the component outputs This parameter does not have an effect on the Generate |
Should empty input return empty |
When this check box is selected, the component returns the input values if they are empty. |
tStat |
Select this check box to gather the Job processing metadata at the Job level |
Usage
Usage rule |
This component, along with the Spark Streaming component Palette it belongs to, appears This component is used as an intermediate step. You need to use the Spark Configuration tab in the This connection is effective on a per-Job basis. For further information about a Note that in this documentation, unless otherwise |
Spark Connection |
You need to use the Spark Configuration tab in
the Run view to define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Related scenarios
No scenario is available for the Spark Streaming version of this component
yet.