tSample
Returns a sample subset of the data being processed.
tSample generates a sample
dataset from the incoming flow.
Depending on the Talend solution you
are using, this component can be used in one, some or all of the following Job
frameworks:
-
MapReduce:
see tSample MapReduce properties. -
Spark Batch:
see tSample properties for Apache Spark Batch.
tSample MapReduce properties
These properties are used to configure tSample running in the MapReduce Job framework.
The MapReduce
tSample component belongs to the Processing family.
The component in this framework is available only if you have subscribed to one
of the
Talend
solutions with Big Data.
Basic settings
|
Schema and Edit |
A schema is a row description. It defines the number of fields (columns) to Click Edit schema to make changes to the schema.
|
|
Sampling fraction |
Enter the sample size ratio to the data being processed. For |
|
Use a seed for random number generator |
Enter a positive seed number (the starting number for a random |
Global Variables
|
Global Variables |
ERROR_MESSAGE: the error message generated by the A Flow variable functions during the execution of a component while an After variable To fill up a field or expression with a variable, press Ctrl + For further information about variables, see |
Usage
|
Usage rule |
This component is an intermediate component that passes sampled datasets to the Note that the knowledge of statistics and sampling is This component, along with the MapReduce family it belongs to, appears only when you are Note that in this documentation, unless otherwise |
|
Hadoop Connection |
You need to use the Hadoop Configuration tab in the This connection is effective on a per-Job basis. |
Related scenarios
No scenario is available for the Map/Reduce version of this component yet.
tSample properties for Apache Spark Batch
These properties are used to configure tSample running in the Spark Batch Job framework.
The Spark Batch
tSample component belongs to the Processing family.
The component in this framework is available only if you have subscribed to one
of the
Talend
solutions with Big Data.
Basic settings
|
Schema and Edit |
A schema is a row description. It defines the number of fields (columns) to Click Edit schema to make changes to the schema.
|
|
Sampling with replacement |
Select this check box to proceed the sampling with replacement to |
|
Sampling fraction |
Enter the sample size ratio to the data being processed. For |
|
Use a seed for random number generator |
Enter a positive seed number (the starting number for a random |
Usage
|
Usage rule |
This component is an intermediate component that passes sampled datasets to the Note that the knowledge of statistics and sampling is This component, along with the Spark Batch component Palette it belongs to, appears only Note that in this documentation, unless otherwise |
|
Spark Connection |
You need to use the Spark Configuration tab in
the Run view to define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Related scenarios
No scenario is available for the Spark Batch version of this component
yet.