July 30, 2023

tVerifyEmail – Docs for ESB 7.x

tVerifyEmail

Verifies if email addresses comply with
specific rules and corrects addresses that do not match the rules by using the content from
specific columns.

In local mode, Apache Spark 1.6.0, 2.0.0, 2.3.0 and 2.4.0 are supported.

Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:

tVerifyEmail Standard properties

These properties are used to configure tVerifyEmail running in the Standard Job framework.

The Standard
tVerifyEmail component belongs to the Data Quality family.

The component in this framework is available in Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and in Talend Data Fabric.

Basic settings

Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Edit Schema

Click Edit
schema
to make changes to the schema.

Note: If you
make changes, the schema automatically becomes built-in.

The output schema of tVerifyEmail has different read-only
columns depending on the options you select in the component Basic
settings
view. Read-only output columns include:

VerificationLevel: provides you with the verification
status of the processed email addresses as the following:

VALID: means that the email address comply with the defined
rule.

INVALID: means that the email address does not comply with the
defined rule.

CORRECTED: means that the input email does not comply with the
defined rule and has been corrected by using the content of the selected columns. This
column is available only when you select the Use column
content
option in the LOCAL Part Options
section.

VERIFIED: means that the email address does exist at the domain.
This column is available only when you select the Check with mail
server callback
option.

REJECTED: means that the email address does not exist at the domain.
This column is available only when you select the Check with mail
server callback
option.

Suggested_Email: provides you with a suggested content
for the email part before the @ sign. The email string is built up from the columns you
select from the Use column content view.

Column to validate

Select from the list the column you want to validate with tVerifyEmail.

Check the entire email with regular expression

Select this check box if you want to match the complete email address against a specific
regular expression.

Complete regular expression: enter the regular expression
against which you want to match email addresses.

This match is done as a first step to optimize the matching process and exclude addresses
that have problems before going any further to match the local and domain parts of email
addresses.

LOCAL Part Options

Fields in this section will vary according to what option you select. “LOCAL part” in an
email address refers to the string before the @ sign.

Use regular expression: enter in the Pattern field the expression against which you want to check the
local part of the email address.

Use simplified pattern: enter in the Pattern field the simplified pattern against which you want
to check the local part of the email address. Select the Show
syntax of simplified pattern
option to display the syntax to use for
simplified patterns. For more information about the syntax, see Simplified pattern syntax for tVerifyEmail.

Use column content: use the fields in this view to
decide the content against which you want to check the local part of the email. If the local
part does not match what you have defined, it will be rewritten by using the content of the
fields.

Enable case-sensitive pattern matching: select this
check box to enable a case sensitive pattern matching of the local part of email addresses.
You can use case sensitive pattern matching with each of the above options.

DOMAIN Part Options

Fields in this view will vary according to what option you select.

Check the Top-level Domains and the following ones:
select this check box to verify the part of the email address which follows the last dot.
You can use the Additional Top-level Domains table to add
additional top-level domains against which you want to validate email addresses.

Check domains with a black list: select this option to
verify the domains you define in the Domain list table as
black listed.

Check domains with a white list: select this option to
verify the domains you define in the Domain List table as
white listed.

Check with mail server callback

Select this check box to enable the verification of email addresses by
the SMTP server.

With this technique, the mail server verifies the complete address
(parts before and after the @ sign). It establishes a successful SMTP
connection to the mail exchanger (MX) of the email address. Then it
queries the exchanger, and make sure that it accepts the address as a
valid one. This is done in the same way as sending an email to the
address, however the process is stopped after the mail exchanger accepts
or rejects the address.

It is not advisable to enable the SMTP verification when you have a
lot of email addresses with different domains to check as some mail
servers may not reply correctly and even black list your IP
address.

The following is a list of cases when the
SMTP verification will not work properly:

  • When the mail server requires authentication,
  • When the mail server has a security policy that may put your IP put into
    a black list and reject your queries,
  • When the mail server is taking too long to reply (time out),
  • Any other unexpected exception generated by the mail server.

When the mail server accepts all emails from a domain, the component cannot verify whether the email address exists or not.

In all these cases, the component results will only take into account
the results from the other rules you set in the component
settings.

Advanced settings

tStatCatcher Statistics

Select this check box to gather the Job processing metadata at the Job level
as well as at each component level.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

This component is usually used as an intermediate component, and it requires an
input component and an output component.

Simplified pattern syntax for tVerifyEmail

tVerifyEmail enables you to check the local part of email
addresses against a simplified pattern.

The following table lists the simplified pattern syntax elements.

Syntax Equivalent regex Description
9 [0-9] A digit
a [a-z] A lowercase ASCII letter
A [A-Z] An uppercase ASCII letter
w [a-z]+ One or more lowercase ASCII letters
W [A-Z]+ One or more uppercase ASCII letters
? . Any character
* .* Any string
. . The period symbol
[-_+] [-_+] Any of the symbols found between square brackets
<pattern> pattern Any standard regular expression placed between angle brackets

Verifying email addresses against column content and domain names

This scenario applies only to Talend Data Management Platform, Talend Big Data Platform, Talend Real Time Big Data Platform, Talend Data Services Platform, Talend MDM Platform and Talend Data Fabric.

This scenario describes a Job which uses:

  • the tFixedFlowInput component to generate the
    email addresses to be analyzed,

  • the tverifyEmail component to format the email
    addresses through
    Talend
    email API,

  • the tFileOutputExcel component to output the
    formatted addresses in an .xls file.

tVerifyEmail_1.png

Setting up the Job

  1. Drop the following components from the Palette onto the design workspace: tFixedFlowInput, tVerifyEmail and
    tFileOutputExcel.
  2. Connect the three components together using the Main links.

Configuring the input component

  1. Double-click tFixedFlowInput to open its
    Basic settings view in the Component tab.

    tVerifyEmail_2.png

  2. Create the schema through the Edit Schema
    button.

    In the open dialog box, click the [+] button
    and add the columns that will hold input address data. For this example, add
    firstname, lastname and
    email.
  3. Click OK.
  4. In the Number of rows field, enter
    1.
  5. In the Mode area, select the Use Inline Table option.
  6. In the Inline table, use the [+] button to add lines to the table and then enter the
    address data you want to analyze.

Verifying and formatting email addresses

  1. Double-click tVerifyEmail to display the
    Basic settings view and define the component
    properties.

    tVerifyEmail_3.png

  2. If required, click Sync columns to retrieve
    the schema defined in the input component.
  3. Click the Edit schema button to open the
    schema dialog box.

    tVerifyEmail proposes predefined read-only
    address columns as shown in the below capture.
    tVerifyEmail_4.png

    The VerificationLevel column returns the
    verification status of input email addresses. The SuggestedEmail column returns a suggested content for the email part
    before the @ sign. This column is shown in the output schema only if you select
    theUse column content option in the Local Part Options section. For further information about
    output columns, see tVerifyEmail Standard properties.
  4. Move any of the input columns to the output schema if you want to show them in
    the verification results, click OK and accept
    to propagate the changes.
  5. From the Column to validate list, select the
    email column.
  6. In the LOCAL Part Options section, select the
    Use column content option.

    In this example, you want to check the email part before the @ sign to see if
    it starts with the first letter of the first name followed by the family name,
    all in lower case. If the local part does not match what you have defined,
    tVerifyEmail will rewrite it by using the
    parameters you define.
  7. In the DOMAIN Part Options, select:

    • the Check the default Top-level Domains and the
      following ones
      check box and define in the table the
      additional top-level domain against which you want to validate email
      addresses.

    • the Check domains with a black list
      check box and define in the Domain list
      table the domain to consider as black listed.

  8. Select the Check with mail server callback
    check box to enable the mail server to verify the complete address and accept or
    reject the email.

Configuring the output component and executing the Job

  1. Double-click the tFileOutputExcel component
    to display the Basic settings view and define
    the component properties.

    tVerifyEmail_5.png

  2. Set the destination file name as well as the sheet name and then select the
    Define all columns auto size check box.
  3. Save your Job and press F6 to execute
    it.

    The tVerifyEmail component analyzes email
    addresses and corrects those that do not match what you have defined in the
    local and domain part options.
  4. Right-click the output component and select Data
    Viewer
    to display the formatted email addresses.

    tVerifyEmail_6.png

    tVerifyEmail matches input addresses against
    the rule you set in the LOCAL part options
    section and the parameters you set for the domain names.
    The VerificationLevel output column returns
    the status as VALID, INVALID,
    CORRECTED and REJECTED according
    to what you set/selected in tVerifyEmail basic
    settings.
    All email addresses labeled as CORRECTED have a suggested
    address in the SuggestedEmail output column.

tVerifyEmail properties for Apache Spark Batch

These properties are used to configure tVerifyEmail running in the Spark Batch Job framework.

The Spark Batch
tVerifyEmail component belongs to the Data Quality family.

The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.

Basic settings

Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Edit Schema

Click Edit
schema
to make changes to the schema.

Note: If you
make changes, the schema automatically becomes built-in.

The output schema of tVerifyEmail has different read-only
columns depending on the options you select in the component Basic
settings
view. Read-only output columns include:

VerificationLevel: provides you with the verification
status of the processed email addresses as the following:

VALID: means that the email address comply with the defined
rule.

INVALID: means that the email address does not comply with the
defined rule.

CORRECTED: means that the input email does not comply with the
defined rule and has been corrected by using the content of the selected columns. This
column is available only when you select the Use column
content
option in the LOCAL Part Options
section.

VERIFIED: means that the email address does exist at the domain.
This column is available only when you select the Check with mail
server callback
option.

REJECTED: means that the email address does not exist at the domain.
This column is available only when you select the Check with mail
server callback
option.

Suggested_Email: provides you with a suggested content
for the email part before the @ sign. The email string is built up from the columns you
select from the Use column content view.

Column to validate

Select from the list the column you want to validate with tVerifyEmail.

Check the entire email with regular expression

Select this check box if you want to match the complete email address against a specific
regular expression.

Complete regular expression: enter the regular expression
against which you want to match email addresses.

This match is done as a first step to optimize the matching process and exclude addresses
that have problems before going any further to match the local and domain parts of email
addresses.

LOCAL Part Options

Fields in this section will vary according to what option you select. “LOCAL part” in an
email address refers to the string before the @ sign.

Use regular expression: enter in the Pattern field the expression against which you want to check the
local part of the email address.

Use simplified pattern: enter in the Pattern field the simplified pattern against which you want
to check the local part of the email address. Select the Show
syntax of simplified pattern
option to display the syntax to use for
simplified patterns. For more information about the syntax, see Simplified pattern syntax for tVerifyEmail.

Use column content: use the fields in this view to
decide the content against which you want to check the local part of the email. If the local
part does not match what you have defined, it will be rewritten by using the content of the
fields.

Enable case-sensitive pattern matching: select this
check box to enable a case sensitive pattern matching of the local part of email addresses.
You can use case sensitive pattern matching with each of the above options.

DOMAIN Part Options

Fields in this view will vary according to what option you select.

Check the Top-level Domains and the following ones:
select this check box to verify the part of the email address which follows the last dot.
You can use the Additional Top-level Domains table to add
additional top-level domains against which you want to validate email addresses.

Check domains with a black list: select this option to
verify the domains you define in the Domain list table as
black listed.

Check domains with a white list: select this option to
verify the domains you define in the Domain List table as
white listed.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to,
appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a
scenario presents only Standard Jobs, that is to
say traditional
Talend
data integration Jobs.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Related scenarios

No scenario is available for the Spark Batch version of this component
yet.

tVerifyEmail properties for Apache Spark Streaming

These properties are used to configure tVerifyEmail running in the Spark Streaming Job framework.

The Spark Streaming
tVerifyEmail component belongs to the Data Quality family.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

Basic settings

Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

 

Built-In: You create and store the schema locally for this component
only.

 

Repository: You have already created the schema and stored it in the
Repository. You can reuse it in various projects and Job designs.

Edit Schema

Click Edit
schema
to make changes to the schema.

Note: If you
make changes, the schema automatically becomes built-in.

The output schema of tVerifyEmail has different read-only
columns depending on the options you select in the component Basic
settings
view. Read-only output columns include:

VerificationLevel: provides you with the verification
status of the processed email addresses as the following:

VALID: means that the email address comply with the defined
rule.

VALID: means that the email address comply with the defined
rule.

INVALID: means that the email address does not comply with the
defined rule.

INVALID: means that the email address does not comply with the
defined rule.

CORRECTED: means that the input email does not comply with the
defined rule and has been corrected by using the content of the selected columns. This
column is available only when you select the Use column
content
option in the LOCAL Part Options
section.

VERIFIED: means that the email address does exist at the domain.
This column is available only when you select the Check with mail
server callback
option.

REJECTED: means that the email address does not exist at the domain.
This column is available only when you select the Check with mail
server callback
option.

Suggested_Email: provides you with a suggested content
for the email part before the @ sign. The email string is built up from the columns you
select from the Use column content view.

Column to validate

Select from the list the column you want to validate with tVerifyEmail.

Check the entire email with regular expression

Select this check box if you want to match the complete email address against a specific
regular expression.

Complete regular expression: enter the regular expression
against which you want to match email addresses.

This match is done as a first step to optimize the matching process and exclude addresses
that have problems before going any further to match the local and domain parts of email
addresses.

LOCAL Part Options

Fields in this section will vary according to what option you select. “LOCAL part” in an
email address refers to the string before the @ sign.

Use regular expression: enter in the Pattern field the expression against which you want to check the
local part of the email address.

Use simplified pattern: enter in the Pattern field the simplified pattern against which you want
to check the local part of the email address. Select the Show
syntax of simplified pattern
option to display the syntax to use for
simplified patterns. For more information about the syntax, see Simplified pattern syntax for tVerifyEmail.

Use column content: use the fields in this view to
decide the content against which you want to check the local part of the email. If the local
part does not match what you have defined, it will be rewritten by using the content of the
fields.

Enable case-sensitive pattern matching: select this
check box to enable a case sensitive pattern matching of the local part of email addresses.
You can use case sensitive pattern matching with each of the above options.

DOMAIN Part Options

Fields in this view will vary according to what option you select.

Check the Top-level Domains and the following ones:
select this check box to verify the part of the email address which follows the last dot.
You can use the Additional Top-level Domains table to add
additional top-level domains against which you want to validate email addresses.

Check domains with a black list: select this option to
verify the domains you define in the Domain list table as
black listed.

Check domains with a white list: select this option to
verify the domains you define in the Domain List table as
white listed.

Global Variables

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see
Talend Studio

User Guide.

Usage

Usage rule

This component, along with the Spark Streaming component Palette it belongs to, appears
only when you are creating a Spark Streaming Job.

This component is used as an intermediate step.

You need to use the Spark Configuration tab in the
Run view to define the connection to a given Spark cluster
for the whole Job.

This connection is effective on a per-Job basis.

For further information about a
Talend
Spark Streaming Job, see the sections
describing how to create, convert and configure a
Talend
Spark Streaming Job of the

Talend Open Studio for Big Data Getting Started Guide
.

Note that in this documentation, unless otherwise explicitly stated, a
scenario presents only Standard Jobs, that is to
say traditional
Talend
data integration Jobs.

Spark Connection
In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Related scenarios

No scenario is available for the Spark Streaming version of this component
yet.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x