July 30, 2023

tCollectAndCheck – Docs for ESB 7.x

tCollectAndCheck

Shows and validates the result of a component test.

tCollectAndCheck is available only in a test case
about a given component and is added automatically to the test case you are using. For
further information about test cases, see
Talend Studio User Guide
.

tCollectAndCheck
receives output from a component being tested, loads given reference files to be compare
with this output and returns whether the output matches the expected result.

Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:

tCollectAndCheck properties for Apache Spark Batch

These properties are used to configure tCollectAndCheck running in the Spark Batch Job framework.

The Spark Batch
tCollectAndCheck component belongs to the Technical family.

The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.

Basic settings

Separator

Enter character, string or regular expression to separate fields for the transferred
data.

Line separator

The separator used to identify the end of a row.

Use context variable

If you have already created the context variable representing the reference file to be
used, select this check box and enter this variable in the Variable name field that is
displayed.

Then syntax to call a variable is context.VariableName.

For further information about variables, see
Talend Studio

User Guide.

Reference data

If you do not want to use context variables to represent the reference data to be used,
enter this reference data directly in this field.

Keep the order from the reference

If the RDDs to be checked are sorted, select this check box to keep your reference data
ordered.

Advanced settings

When the reference is empty, expect no incoming value

By default, this check box is clear, meaning that when an field in the reference data is
empty, the test expects an equally empty field in the incoming datasets being verified in
order to validate the test result.

If you want the test to expect no value when the reference is empty, select this check
box.

Usage

Usage rule

This component is used as an end component and requires an input link.

This component is added automatically to a test case being created to show the test result
in the console of the Run view.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Related scenarios

No scenario is available for the Spark Batch version of this component
yet.

tCollectAndCheck properties for Apache Spark Streaming

These properties are used to configure tCollectAndCheck running in the Spark Streaming Job framework.

The Spark Streaming
tCollectAndCheck component belongs to the Technical family.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

Basic settings

Separator

Enter character, string or regular expression to separate fields for the transferred
data.

Line separator

The separator used to identify the end of a row.

Micro batch separator

Enter the separator used to identify the end of a micro batch in the
data stream.

Use context variable

If you have already created the context variable representing the reference file to be
used, select this check box and enter this variable in the Variable name field that is
displayed.

The syntax to call a variable is context.VariableName.

For further information about variables, see
Talend Studio

User Guide.

Reference data

If you do not want to use context variables to represent the reference data to be used,
enter this reference data directly in this field.

Keep the order from the reference

If the RDDs to be checked are sorted, select this check box to keep your reference data
ordered.

Advanced settings

When the reference is empty, expect no incoming
value

By default, this check box is clear, meaning that when an field in the reference data is
empty, the test expects an equally empty field in the incoming datasets being verified in
order to validate the test result.

If you want the test to expect no value when the reference is empty, select this check
box.

Usage

Usage rule

This component is used as an end component and requires an input link.

This component is added automatically to a test case being created to show the test result
in the console of the Run view.

Spark Connection

In the Spark
Configuration
tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

  • Yarn mode (Yarn client or Yarn cluster):

    • When using Google Dataproc, specify a bucket in the
      Google Storage staging bucket
      field in the Spark configuration
      tab.

    • When using HDInsight, specify the blob to be used for Job
      deployment in the Windows Azure Storage
      configuration
      area in the Spark
      configuration
      tab.

    • When using Altus, specify the S3 bucket or the Azure
      Data Lake Storage for Job deployment in the Spark
      configuration
      tab.
    • When using Qubole, add a
      tS3Configuration to your Job to write
      your actual business data in the S3 system with Qubole. Without
      tS3Configuration, this business data is
      written in the Qubole HDFS system and destroyed once you shut
      down your cluster.
    • When using on-premise
      distributions, use the configuration component corresponding
      to the file system your cluster is using. Typically, this
      system is HDFS and so use tHDFSConfiguration.

  • Standalone mode: use the
    configuration component corresponding to the file system your cluster is
    using, such as tHDFSConfiguration or
    tS3Configuration.

    If you are using Databricks without any configuration component present
    in your Job, your business data is written directly in DBFS (Databricks
    Filesystem).

This connection is effective on a per-Job basis.

Related scenarios

No scenario is available for the Spark Streaming version of this component
yet.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x