tExtractRegexFields
matching.
Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:
-
Standard: see tExtractRegexFields Standard properties.
The component in this framework is available in all Talend
products. -
MapReduce: see tExtractRegexFields MapReduce properties (deprecated).
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric. -
Spark Batch:
see tExtractRegexFields properties for Apache Spark Batch.The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric. -
Spark Streaming:
see tExtractRegexFields properties for Apache Spark Streaming.This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.
tExtractRegexFields Standard properties
These properties are used to configure tExtractRegexFields running in the Standard Job framework.
The Standard
tExtractRegexFields component belongs to the Data Quality and the Processing families.
The component in this framework is available in all Talend
products.
Basic settings
Field to split |
Select an incoming field from the Field to |
Regex |
Enter a regular expression according to the programming language |
Schema and Edit |
A schema is a row description. It defines the number of fields Click Edit
Click Sync columns to retrieve Warning:
Make sure that the output schema does not contain any column |
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Advanced settings
Die on error |
Select the check box to stop the execution of the Job when an error Clear the check box to skip any rows on error and complete the process for |
Check each row structure against |
Select this check box to check whether the total number of columns |
tStatCatcher Statistics |
Select this check box to gather the processing metadata at the Job |
Global Variables
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
Usage rule |
This component handles flow of data therefore it requires input |
Extracting name, domain and TLD from e-mail addresses
This scenario describes a three-component Job where tExtractRegexFields is used to specify a regular
expression that corresponds to one column in the input data, email. The tExtractRegexFields component is
used to perform the actual regular expression matching. This regular expression includes
field identifiers for user name, domain name and Top-Level Domain (TLD) name portions in
each e-mail address. If the given e-mail address is valid, the name, domain and TLD are
extracted and displayed on the console in three separate columns. Data in the other two
input columns, id and age is extracted and
routed to destination as well.

Setting up the Job
- Drop the following components from the Palette onto the design workspace: tFileInputDelimited, tExtractRegexFields, and tLogRow.
-
Connect tFileInputDelimited to tExtractRegexFields using a Row > Main link, and do
the same to connect tExtractRegexFields to
tLogRow.
Configuring the components
-
Double-click the tFileInputDelimited
component to open its Basic settings view
in the Component tab. -
Click the […] button next to the
File name/Stream field to browse to the
file where you want to extract information from.The input file used in this scenario is called test4.
It is a text file that holds three columns: id,
email, and
age.1234id;email;age1;anna@yahoo.net;242;diana@sohu.com;313;fiona@gmail.org;20For more information, see tFileInputDelimited. -
Click Edit schema to define the data
structure of this input file. -
Double-click the tExtractRegexFields
component to open its Basic settings
view. -
Select the column to split from the Field to
split list: email in this
scenario. -
Enter the regular expression you want to use to perform data matching in
the Regex panel. In this scenario, the
regular expression"([a-z]*)@([a-z]*).([a-z]*)"
is used to
match the three parts of an email address: user name, domain name and TLD
name.For more information about the regular expression, see http://en.wikipedia.org/wiki/Regular_expression. -
Click Edit schema to open the Schema of tExtractRegexFields dialog box, and
click the plus button to add five columns for the output schema.In this scenario, we want to split the input email
column into three columns in the output flow, name,
domain, and tld. The two other input
columns will be extracted as they are. -
Double-click the tLogRow component to
open its Component view. - In the Mode area, select Table (print values in cells of a table).
Saving and executing the Job
- Press Ctrl+S to save your Job.
-
Execute the Job by pressing F6 or
clicking Run on the Run tab.
The tExtractRegexFields component matches all
given e-mail addresses with the defined regular expression and extracts the name,
domain, and TLD names and displays them on the console in three separate columns.
The two other columns, id and age, are
extracted as they are.
tExtractRegexFields MapReduce properties (deprecated)
These properties are used to configure tExtractRegexFields running in the MapReduce Job framework.
The MapReduce
tExtractRegexFields component belongs to the Processing family.
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.
The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.
Basic settings
Field to split |
Select an incoming field from the Field to |
Regex |
Enter a regular expression according to the programming language |
Property type |
Either Built-In or Repository. |
 |
Built-In: No property data stored centrally. |
 |
Repository: Select the repository file where the The properties are stored centrally under the Hadoop The fields that come after are pre-filled in using the fetched For further information about the Hadoop |
Schema and Edit |
A schema is a row description. It defines the number of fields Click Edit
Click Sync columns to retrieve Warning:
Make sure that the output schema does not contain any column |
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Advanced settings
Die on error |
Select the check box to stop the execution of the Job when an error Clear the check box to skip any rows on error and complete the process for |
Encoding |
Select the encoding from the list or select Custom |
Global Variables
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
Usage rule |
In a You need to use the Hadoop Configuration tab in the This connection is effective on a per-Job basis. For further information about a Note that in this documentation, unless otherwise |
Related scenarios
No scenario is available for the Map/Reduce version of this component yet.
tExtractRegexFields properties for Apache Spark Batch
These properties are used to configure tExtractRegexFields running in the Spark Batch Job framework.
The Spark Batch
tExtractRegexFields component belongs to the Processing family.
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.
Basic settings
Prev.Comp.Column list |
Select the column you need to extract data from. |
Regex |
Enter a regular expression according to the programming language you |
Schema and Edit |
A schema is a row description. It defines the number of fields Click Edit
Click Sync columns to retrieve the Warning:
Make sure that the output schema does not contain any column with |
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Die on error |
Select the check box to stop the execution of the Job when an error |
Advanced settings
Encoding |
Select the encoding from the list or select Custom |
Usage
Usage rule |
This component is used as an intermediate step. This component, along with the Spark Batch component Palette it belongs to, Note that in this documentation, unless otherwise explicitly stated, a |
Spark Connection |
In the Spark
Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Related scenarios
No scenario is available for the Spark Batch version of this component
yet.
tExtractRegexFields properties for Apache Spark Streaming
These properties are used to configure tExtractRegexFields running in the Spark Streaming Job framework.
The Spark Streaming
tExtractRegexFields component belongs to the Processing family.
This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.
Basic settings
Prev.Comp.Column list |
Select the column you need to extract data from. |
Regex |
Enter a regular expression according to the programming language you |
Schema and Edit |
A schema is a row description. It defines the number of fields Click Edit
Click Sync columns to retrieve the Warning:
Make sure that the output schema does not contain any column with |
 |
Built-In: You create and store the schema locally for this component |
 |
Repository: You have already created the schema and stored it in the |
Die on error |
Select the check box to stop the execution of the Job when an error |
Advanced settings
Encoding |
Select the encoding from the list or select Custom |
Usage
Usage rule |
This component is used as an intermediate step. This component, along with the Spark Streaming component Palette it belongs to, appears Note that in this documentation, unless otherwise explicitly stated, a scenario presents |
Spark Connection |
In the Spark
Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Related scenarios
No scenario is available for the Spark Streaming version of this component
yet.