Component family |
Processing/Fields |
|
Function |
tExtractRegexFields generates |
|
Purpose |
tExtractRegexFields allows you to |
|
Basic settings |
Field to split |
Select an incoming field from the Field to |
|
Regex |
Enter a regular expression according to the programming language |
Property type |
Either Built-in or Repository. This feature is available to |
|
Built-in: no property data stored |
||
Repository: reuse properties The fields that come after are pre-filled in using the fetched For further information about the Hadoop |
||
|
Schema and Edit |
A schema is a row description, it defines the number of fields to Click Edit schema to make changes to the schema. If the
Click Sync columns to retrieve WarningMake sure that the output schema does not contain any column |
|
|
Built-in: You create and store |
|
|
Repository: The schema already |
Advanced settings |
Die on error |
Select this check box to stop the execution of the Job when an error occurs. Clear the check box to skip any rows on error and complete the process for error-free rows. |
|
Check each row structure against |
Select this check box to check whether the total number of columns This feature is not available to the Map/Reduce |
Encoding |
Select the encoding from the list or select Custom and This feature is available to the Map/Reduce version only. |
|
|
tStatCatcher Statistics |
Select this check box to gather the processing metadata at the Job |
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see Talend Studio |
|
Usage |
This component handles flow of data therefore it requires input |
|
Usage in Map/Reduce Jobs |
If you have subscribed to one of the Talend solutions with Big Data, you can also You need to use the Hadoop Configuration tab in the This connection is effective on a per-Job basis. For further information about a Talend Map/Reduce Job, see the sections Note that in this documentation, unless otherwise explicitly stated, a scenario presents |
|
Log4j |
The activity of this component can be logged using the log4j feature. For more information on this feature, see Talend Studio User For more information on the log4j logging levels, see the Apache documentation at http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/Level.html. |
|
Limitation |
n/a |
This scenario describes a three-component Job where tExtractRegexFields is used to specify a regular expression that
corresponds to one column in the input data, email. The tExtractRegexFields component is used to perform the actual
regular expression matching. This regular expression includes field identifiers for user
name, domain name and Top-Level Domain (TLD) name portions in each e-mail address. If
the given e-mail address is valid, the name, domain and TLD are extracted and displayed
on the console in three separate columns. Data in the other two input columns,
id and age is extracted and routed to
destination as well.
-
Drop the following components from the Palette onto the design workspace: tFileInputDelimited, tExtractRegexFields, and tLogRow.
-
Connect tFileInputDelimited to tExtractRegexFields using a Row > Main link, and do
the same to connect tExtractRegexFields to
tLogRow.
-
Double-click the tFileInputDelimited
component to open its Basic settings view
in the Component tab. -
Click the […] button next to the
File name/Stream field to browse to the
file where you want to extract information from.The input file used in this scenario is called test4.
It is a text file that holds three columns: id,
email, and
age.1234id;email;age1;anna@yahoo.net;242;diana@sohu.com;313;fiona@gmail.org;20For more information, see tFileInputDelimited.
-
Click Edit schema to define the data
structure of this input file. -
Double-click the tExtractRegexFields
component to open its Basic settings
view. -
Select the column to split from the Field to
split list: email in this
scenario. -
Enter the regular expression you want to use to perform data matching in
the Regex panel. In this scenario, the
regular expression"([a-z]*)@([a-z]*).([a-z]*)"
is used to
match the three parts of an email address: user name, domain name and TLD
name.For more information about the regular expression, see http://en.wikipedia.org/wiki/Regular_expression.
-
Click Edit schema to open the [Schema of tExtractRegexFields] dialog box, and
click the plus button to add five columns for the output schema.In this scenario, we want to split the input email
column into three columns in the output flow, name,
domain, and tld. The two other input
columns will be extracted as they are. -
Double-click the tLogRow component to
open its Component view. -
In the Mode area, select Table (print values in cells of a table).
-
Press Ctrl+S to save your Job.
-
Execute the Job by pressing F6 or
clicking Run on the Run tab.
The tExtractRegexFields component matches all
given e-mail addresses with the defined regular expression and extracts the name,
domain, and TLD names and displays them on the console in three separate columns.
The two other columns, id and age, are
extracted as they are.