tSample
Returns a sample subset of the data being processed.
tSample generates a sample
dataset from the incoming flow.
Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:
-
MapReduce:
see tSample MapReduce properties (deprecated). -
Spark Batch:
see tSample properties for Apache Spark Batch.
tSample MapReduce properties (deprecated)
These properties are used to configure tSample running in the MapReduce Job framework.
The MapReduce
tSample component belongs to the Processing family.
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.
The MapReduce framework is deprecated from Talend 7.3 onwards. Use Talend Jobs for Apache Spark to accomplish your integration tasks.
Basic settings
Schema and Edit |
A schema is a row description. It defines the number of fields Click Edit
|
Sampling fraction |
Enter the sample size ratio to the data being processed. For |
Use a seed for random number generator |
Enter a positive seed number (the starting number for a random |
Global Variables
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
Usage rule |
This component is an intermediate component that passes sampled datasets to the Note that the knowledge of statistics and sampling is This component, along with the MapReduce family it belongs to, appears only when you are Note that in this documentation, unless otherwise |
Hadoop Connection |
You need to use the Hadoop Configuration tab in the This connection is effective on a per-Job basis. |
Related scenarios
No scenario is available for the Map/Reduce version of this component yet.
tSample properties for Apache Spark Batch
These properties are used to configure tSample running in the Spark Batch Job framework.
The Spark Batch
tSample component belongs to the Processing family.
The component in this framework is available in all subscription-based Talend products with Big Data
and Talend Data Fabric.
Basic settings
Schema and Edit |
A schema is a row description. It defines the number of fields Click Edit
|
Sampling with replacement |
Select this check box to proceed the sampling with replacement to |
Sampling fraction |
Enter the sample size ratio to the data being processed. For |
Use a seed for random number generator |
Enter a positive seed number (the starting number for a random |
Usage
Usage rule |
This component is an intermediate component that passes sampled datasets to the Note that the knowledge of statistics and sampling is This component, along with the Spark Batch component Palette it belongs to, Note that in this documentation, unless otherwise explicitly stated, a |
Spark Connection |
In the Spark
Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Related scenarios
No scenario is available for the Spark Batch version of this component
yet.