tModelEncoder
Performs featurization operations to transform data into the format expected by the
model training components such as tLogisticRegressionModel or tKMeansModel.
tModelEncoder receives data from its
preceding components, applies a wide range of feature processing algorithms to transform
given columns of this data and sends the result to the model training component that
follows to eventually train and create a predictive model.
Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:
-
Spark Batch: see tModelEncoder properties for Apache Spark Batch.
The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.
-
Spark Streaming: see tModelEncoder properties for Apache Spark Streaming.
This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.
ML feature-processing algorithms in Talend
This table presents the feature processing algorithms you can use in the tModelEncoder component.
1: These are the parameters you can set in the Parameters column of the
Transformation table in the Basic settings tab of tModelEncoder. If you do not set
any parameters yourself, the default ones, if any, are used.
HashingTF
As a text-processing algorithm, HashingTF converts input data into fixed-length feature vectors to reflect
the importance of a term (a word or a sequence of words) by calculating the frequency
that these words in the input data appear.
This algorithm is available in Spark Batch
and Spark Streaming Jobs.
In a Spark Batch Job, it is typically
used along with the IDF (Inverse document frequency) algorithm to make the weight
calculation more reliable. In the context of a Talend Spark Job, you need to put a second tModelEncoder to apply the Inverse document
frequency algorithm on the output of the HashingTF computation.
The data must be already
segmented before being sent to the HashingTF
computation; therefore, if the data to be used has not been segmented, you need to use
another tModelEncoder to apply the Tokenizer algorithm or the Regex
tokenizer algorithm to prepare the data.
For further
details about the HashingTF implementation in
Spark, see HashingTF from the Spark
documentation.
- type of the input column: Object
- type of the output column: Vector
Parameter | Description |
---|---|
numFeatures |
The number of features that define the dimension of the feature For example, you can enter numFeatures=220 to define the dimension. If you do not put The output vectors are sparse vectors. For example, a For further information about how to read a sparse |
For further information about the Spark API of HashingTF, see ML HashingTF.
It can be used to prepare data for the Classification or the Clustering
components from the Machine Learning family in order to create sentiment analysis
model.
Inverse document frequency
As a text-processing algorithm, Inverse document frequency (IDF) is often used to process
the output of the HashingTF computation in order
to downplay the importance of the terms that appear in too many documents.
This
algorithm is available in Spark Batch Jobs.
It requires a
tModelEncoder component performing the HashingTF computation to provide input data.
For further details about the IDF
implementation in Spark, see IDF from the Spark
documentation.
- type of the input column: Vector
- type of the output column: Vector
Parameter | Description |
---|---|
minDocFreq |
The minimum number of documents that should contain a term. This For example, if you put For further details about the Spark API of this |
It can be used to prepare data for the Classification or the
Clustering components from the Machine Learning family in order to create sentiment
analysis model.
Word2Vector
Word2Vector
transforms a document into a feature vector, for use in other learning computations such
as text similarity calculation.
This algorithm is available in Spark Batch Jobs.
For further details about the Word2Vector implementation in Spark, see Word2Vec from the Spark documentation.
Job, this algorithm expects:
- type of the input column: List
- type of the output column: Vector
Parameter | Description |
---|---|
maxIter |
Maximum number of iterations for obtaining the optimal result. For |
minCount |
Minimum number of times a token should appear to be included in the |
numPartitions |
Number of partitions. |
seed |
The random seed number. |
stepSize |
Size of the Step for each iteration. This defines the learning |
vectorSize |
Size of each feature vector. The default is |
If you need to set several parameters, separate these
parameters using semicolons (;), for example,
maxIter=5;minCount=4.
For further information about
the Spark API of Word2Vector, see Word2Vec.
It
can be used to prepare data for the Classification or the Clustering components from the
Machine Learning family in order to, for example, find similar user comments about a
product.
CountVectorizer
CountVectorizer extracts the
most frequent terms from a collection of text documents and convert these terms into
vectors of token counts.
This algorithm is available in Spark Batch Jobs.
It requires a tModelEncoder
component performing the Tokenizer
or the Regex tokenizer computation
to provide input data of the List
type.
For further information about the CountVectorizer implementation in Spark, see CountVectorizer.
- type of the input column: List
- type of the output column: Vector
Parameter | Description |
---|---|
minDF |
The minimum The default value is minDF=1. If you put a value between |
minTF |
The The default value is minTF=1. |
vocabSize |
The The default value is 218 |
For further information about the Spark API of CountVectorizer, see CountVectorizer.
It is often used to process text in terms of text mining for the
Classification or the Clustering components.
Binarizer
Using the given threshold, Binarizer transforms a feature to a binary feature
of which the value is distributed be either 1.0 or 0.0.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
For further details about the Binarizer implementation in Spark, see Binarizer from the Spark documentation.
- type of the input column: Double
- type of the output column: Double
Parameter | Description |
---|---|
threshold |
The The default is threshold=0.0. |
For further information about the Spark API of Binarizer, see ML Binarizer.
It can be used to prepare data for the Classification or the
Clustering components from the Machine Learning family in order to,
for example, estimate a user comment indicates this user’s
satisfaction or dissatisfaction.
Bucketizer
Bucketizer segments continuous
features to a column of feature buckets using the boundary values you define.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
For further details about the Bucketizer implementation in Spark, see Bucketizer from the Spark documentation.
- type of the input column: Double
- type of the output column: Double
Parameter | Description |
---|---|
splits |
The parameter used For example, you can put splits=Double.NEGATIVE_INFINITY, -0.5, 0.0, 0.5, |
For further information about the Spark API of Bucketizer, see Bucketizer.
It can be used to prepare categorical data for training
classification or clustering models.
Discrete Cosine Transform
(DCT)
Discrete Cosine Transform in
Spark implements the one-dimensional DCT-II to transform a real-valued vector in the
time domain to another real-valued vector of the same length in the frequency
domain, That is to say, the input data is converted into a series of cosine waves
oscillating at different frequencies.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
transformation to provide input data of the Vector type. For example, this component can be:
-
a tModelEncoder
component using algorithms such as Vector assembler -
a tMatchModel
component.
For further information about the DCT implementation in Spark, see
DCT.
- type of the input column: Vector
- type of the output column: Vector
Parameter | Description |
---|---|
inverse |
The boolean to By default, it is inverse=false. |
For further information about the Spark API of Discrete Cosine Transform, see DCT.
It is widely used to process images and audios for training
related classification or clustering models.
MinMaxScaler
MinMaxScaler rescales each
feature vector into a fixed range.
transformation to provide input data of the Vector type. For example, this component can be:
-
a tModelEncoder
component using algorithms such as such as Vector assembler -
a tMatchModel
component.
This algorithm is available in Spark Batch Jobs.
For further information about the MinMaxScaler implementation in Spark, see MinMaxScaler.
- type of the input column: Vector
- type of the output column: Vector
Parameter | Description |
---|---|
min |
The lower bound of |
max |
The upper bound of |
For further information about the Spark API of MinMaxScaler, see MinMaxScaler.
It is used to normalize features to fit within a certain range.
This is typically used in image processing, for example, to
normalize data about pixel intensities.
N-gram
N-gram converts a tokenized
string (often words) to an array of comma-separated n-grams. Within each program,
words are separated by space. For example, when creating 2-grams, the string
Good morning World will be converted to (good morning,
morning world).
This algorithm is available in Spark Batch and Spark Streaming Jobs.
It requires a tModelEncoder
component performing the Tokenizer
or the Regex tokenizer computation
to provide input data of the List
type.
For further information about the N-gram implementation in Spark, see NGram.
- type of the input column: List
- type of the output column: List
Parameter | Description |
---|---|
n |
The minimum length |
For further information about the Spark API of N-gram, see NGram.
It if often used in natural language processing such as speech
recognition to prepare data for the related Classification or the
Clustering models.
Normalizer
Normalizer normalizes each
vector of the input data to have unit norm so as to improve the performance of
learning computations.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
For further information about the Normalizer implementation in Spark, see Normalizer from the Spark documentation.
- type of the input column: Vector
- type of the output column: Vector
Parameter | Description |
---|---|
p |
The p-norm value The default is p=2, |
For further information about the Spark API of Normalizer, see Normalizer.
It can be used to normalize of the result of the TF-IDF
computation in order to eventually improve the performance of text
classification (by tLogicRegressionModel for example) or text
clustering.
One hot encoder
One hot encoder enables the
algorithms that expect continuous features to use categorical features by mapping
the column of label indices of the categorical features to a column of binary
code.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
You can use another tModelEncoder
component with the String indexer
algorithm to create this column of label indices.
For further information about the OneHotEncoder implementation in Spark, see OneHotEncoder from the Spark documentation.
- type of the input column: Double
- type of the output column: Vector
Parameter | Description |
---|---|
dropLast |
The boolean The default is dropLast=true, meaning that the last |
For further information about the Spark API of One hot encoder, see OneHotEncoder.
It can be used to provide feature data to the Classification or
the Clustering components, such as tLogicRegressionModel.
PCA
PCA implements an orthogonal transformation to convert vectors of
correlated features into vectors of linearly uncorrelated features.
This can project high-dimensional feature vectors to low-dimensional
feature vectors.
This algorithm is available in Spark Batch Jobs.
transformation to provide input data of the Vector type. For example, this component can be:
-
a tModelEncoder
component using algorithms such as such as Vector assembler -
a tMatchModel
component.
For further information about the PCA implementation in Spark, see PCA from the Spark documentation.
- type of the input column: Vector
- type of the output column: Vector
Parameter | Description |
---|---|
k |
The number of |
For further information about the Spark API of PCA, see PCA.
Typically, it can be used to prepare features for resolving
clustering problems.
Polynomial expansion
Polynomial expansion expands
the input features so as to improve the performance of learning computations.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
For further details about the Polynomial
expansion implementation in Spark, see PolynomialExpansion from the Spark documentation.
- type of the input column: Vector
- type of the output column: Vector
Parameter | Description |
---|---|
degree |
The polynomial The default is degree=2, meaning to expand the input features |
For further information about the Spark API of Polynomial expansion, see Polynomial expansion.
It can be used to process feature data for the Classification or
the Clustering components, such as tLogicRegressionModel.
QuantileDiscretizer
QuantileDiscretizer reads a
column of continuous features, analyzes a sample of data from these feature data,
and accordingly outputs a column of categorical features that group the data of the
continuous features into roughly equal parts.
This algorithm is available in Spark Batch Jobs but is not compatible with Spark 2.2
on EMR 5.8.
For further information about the QuantileDiscretizer implementation in Spark, see
QuantileDiscretizer from the Spark documentation.
- type of the input column: Double
- type of the output column: Double
Parameter | Description |
---|---|
numBuckets |
The maximum |
For further information about the Spark API of QuantileDiscretizer, see Quantile Discretizer.
It can be used to prepare categorical data for training
classification or clustering models.
Regex tokenizer
Regex tokenizer performs
advanced tokenization based on given regex patterns.
For further details about the RegexTokenizer implementation in Spark, see RegexTokenizer from the Spark documentation.
This algorithm is available in Spark Batch and Spark Streaming Jobs.
- type of the input column: String
- type of the output column: Object and List
Parameter | Description |
---|---|
gaps |
The boolean By default, this parameter is set to be true and the default delimiter |
pattern |
The parameter |
minTokenLength |
The |
If you need to set several parameters, separate these parameters
using semicolons (;), for example, gaps=true;minTokenLength=4.
For further information about the Spark API of Regex tokenizer, see RegexTokenizer.
It is often used to process text in terms of text mining for the
Classification or the Clustering components, such as tRandomForestModel, in order to create,
for example, a spam filtering model.
Tokenizer
Tokenizer breaks input text
(often sentences) into individual terms (often words).
Note that
these words are all convert to lowercase.
For further details
about the Tokenizer implementation in Spark, see
Tokenizer from the Spark
documentation.
This algorithm is available in Spark Batch and Spark Streaming
Jobs.
- type of the input column: String
- type of the output column: Object and
List
You do not need to set any additional parameters for
Tokenizer.
For further
information about the Spark API of Tokenizer,
see Tokenizer.
It is often used to process text in terms of text mining for the Classification or the
Clustering components, such as tRandomForestModel, in order to create, for example, a spam filtering
model.
SQLTransformer
SQLTransformer
allows you to implement feature transformation using Spark SQL statements. It is subject
to the limitations indicated in the Spark documentation.
For further
information about these limitations and the SQLTransformer implementation in Spark, see SQLTransformer from Spark
documentation.
This algorithm is available in Spark Batch and Spark Streaming
Jobs.
-
type of the input column: All types. You need to select the column to be used in
your SQL statement -
type of the output column: All types: you need define it depending on your SQL
statement
Parameter | Description |
---|---|
statement |
The Spark SQL statement to be used to select and/or transform input |
For further information about the Spark API of SQLTransformer, see SQLTransformer.
It allows you to be flexible to extract and transform data to prepare
features for other Machine learning algorithms or directly perform queries in the
results of other Machine Learning algorithms.
Note that you can
also efficiently perform Spark SQL queries using tSqlRow or join data using tMap to
prepare the data.
Standard scaler
Standard
scaler standardizes each input vector to have unit standard deviation (unit
variance), a common case of normal distribution. The standardized data can improve the
convergence rate and prevent features with very large variances from exerting overly
large influence during model training.
For further details about the
StandardScaler implementation in Spark, see
StandardScaler from the Spark
documentation.
This algorithm is available in Spark Batch Jobs.
this algorithm expects:
-
type of the input column: Vector
-
type of the output column: Vector
Parameter | Description |
---|---|
withMean |
The boolean parameter used to indicate whether to center each vector By default, this parameter is set to be |
withStd |
The boolean parameter used to indicate whether to scale the input By default, withStd is set to |
If you need to set several parameters, separate these
parameters using semicolons (;), for example,
withMean=true;withStd=true.
Note that if you set both
parameters to be false, Standard
scaler will actually do nothing.
For further
information about the Spark API of Standard
scaler, see StandardScaler.
It can be used to prepare data for the Classification or the Clustering
components, such as tKMeanModel.
StopWordsRemover
StopWordsRemover filters out stop words from the input word strings.
It requires a tModelEncoder
component performing the Tokenizer or the
Regex tokenizer computation to provide input
data of the List type.
For further details about the StopWordsRemover
implementation in Spark, see StopWordsRemover from the Spark
documentation.
This algorithm is available in Spark Batch and Spark Streaming
Jobs.
-
type of the input column: List
-
type of the output column: List
Parameter | Description |
---|---|
caseSensitive |
The boolean to indicate whether filtering out stop words is |
stopWords |
It defines the list of stop words to be used for the filtering. By If you need to use a custom list of stop words, you |
For further information about what is a stop word, see Stop words on wikipedia.
For further
information about the Spark API of StopWordsRemover, see StopWordsRemover.
It removes the most common words that often do not carry as much
meaning in order to avoid as many noises as possible in text processing.
String indexer
String indexer
generates indices for categorical features (string-type labels). These indices can be
used by other algorithms such as One hot encoder
to build equivalent continuous features.
The indices are ordered by
frequencies and the most frequent label gets the index 0.
For
further details about the StringIndexer
implementation in Spark, see StringIndexer from the Spark
documentation.
This algorithm is available in Spark Batch Jobs.
this algorithm expects:
-
type of the input column: String
-
type of the output column: Double
You do not need to set any additional parameters for
String indexer.
For
further information about the Spark API of String
indexer, see StringIndexer.
String indexer, along with
One hot encoder, enables algorithms that expects
continuous features to use categorical features.
Vector indexer
Vector indexer
identifies categorical feature columns based on your definition of the
maxCategories parameter and indexes the categories from each of the
identified columns, starting from 0. The other columns are declared as continuous
feature columns and are not indexed.
For further details about the
VectorIndexer implementation in Spark, see VectorIndexer from the Spark
documentation.
This algorithm is available in Spark Batch Jobs.
this algorithm expects:
-
type of the input column: Vector
-
type of the output column: Vector
Parameter | Description |
---|---|
maxCategories |
The parameter used to set the threshold indicating whether a vector The default is |
For further information about the Spark API of Vector indexer, see VectorIndexer.
Vector indexer gives indexes to
categorical features so that algorithms such as the Decision Trees computations run by
tRandomForestModel, can handle the categorical
features appropriately.
Vector assembler
Vector
assembler combines selected input columns into one single vector column that
can be used by other algorithms or machine learning computations that expect vector
features.
Note that Vector
assembler does not re-calculate the features taken from different columns.
It only combines these feature columns into one single vector but keep the features as
they are.
When you select Vector
assembler, the Input column column
of the Transformation table in the Basic settings view of tModelEncoder is deactivated and you need to use the inputCols parameter in the Parameters
column to select the input columns to be combined.
For further
details about the VectorAssembler implementation
in Spark, see VectorAssembler from the Spark
documentation.
This algorithm is available in Spark Batch and Spark Streaming
Jobs.
-
type of the input column: numeric types, boolean type and vector type
-
type of the output column: Vector
Parameter | Description |
---|---|
inputCols |
The parameter used to indicate the input columns to be combined into |
For further information about the Spark API of Vector assembler, see VectorAssembler.
Vector assembler prepares
feature vectors for the Logistic Regression computations or the Decision Tree
computations run by components such as tLogisticRegressionModel and tRandomForestModel.
ChiSqSelector
ChiSqSelector
determines feature relevance to given feature categories based on a Chi-Squared test of
independence and then selects the features the most relevant to those categories.
For further details about the ChiSqSelector implementation in Spark, see ChiSqSelector from Spark
documentation.
This algorithm is available in Spark Batch and Spark Streaming
Jobs.
-
type of the input column: Vector,
Double and List -
type of the output column: Vector
Parameter | Description |
---|---|
featuresCol |
The input column that provides features to be selected by ChiSqSelector. The type of this column |
labelCol |
The input column that provides categories for the features to be |
numTopFeatures |
The number of the features that ChiSqSelector defines as the most relevant to the |
For example, in an analysis of loan validation, a column
called features and a column called label
have been prepared: the former column carries features about the borrower candidates
such as address, age and
income, and the latter column the categories that indicate
whether to validate the loan for each candidate. You need to put
featuresCol=features,labelCol=label to make use of these columns
by ChiSqSelector.
In
addition, if you want to select only the top 1 most relevant feature, put
numTopFeatures=1 and then the income
feature will be selected; if you put numTopFeatures=2 instead,
the top 2 most relevant features will be selected, that is to say, the
income and the age features.
For further information about the Spark API of ChiSqSelector, see ChiSqSelector.
It can be used to yield the features with the most predictive power for
training classification or clustering models.
RFormula
Rformula allows you
to generate feature vectors along with their feature labels. It is subject to the
limitations indicated in the Spark documentation.
For further
information about these limitations and the RFormula implementation in Spark, see Rformula from Spark documentation.
This
algorithm is available in Spark Batch Jobs but is not compatible with Spark 2.2 on EMR
5.8.
-
type of the input column: String types and numeric types. You need to use R
formula to select the columns to be used. -
type of the output column: Vector for features and
Double for labels
Parameter | Description |
---|---|
featuresCol |
The output column used to carry the feature data. |
labelCol |
The output column used to carry the feature labels. |
formula |
The R formula to be applied. |
For example, if you put
featuresCol=features;labelCol=label;formula=clicked ~ country +
hour in the Parameters column of the
Transformation table, you need to add the
features column and the label column to
the output schema and set the former column to the type Vector
and the latter to the type Double. Then during the
transformation, the R formula defined using the formula parameter
is applied on the input columns, features are generated into the
features column and feature labels into the
label column. This example is based on the one explained in Rformula from Spark documentation.
For further information about the Spark API of RFormula, see R Formula.
It allows you to
apply R formulas to prepare feature data.
VectorSlicer
VectorSlicer
reads a feature vector, selects features from this vector based on the values of the indices parameter and writes the selected features into a new
vector in the output column.
vector-generating transformation to provide input data of the Vector type. For example, this component can be
-
a tModelEncoder
component using algorithms such as such as Vector
assembler -
a tMatchModel
component.
For further information about the VectorSlicer implementation in Spark, see VectorSlicer from Spark
documentation.
This algorithm is available in Spark Batch and Spark Streaming
Jobs.
-
type of the input column: Vector
-
type of the output column: Vector
Parameter | Description |
---|---|
indices |
The numeric indices of the features to be selected and outputted. For The features in the output vector is ordered |
For further information about the Spark API of VectorSlicer, see VectorSlicer. But note that the
names parameter is not supported by tModelEncoder.
It allows you to be precise in
selecting the features you want to use.
tModelEncoder properties for Apache Spark Batch
These properties are used to configure tModelEncoder running in the Spark Batch Job framework.
The Spark Batch
tModelEncoder component belongs to the Machine Learning family.
The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.
Basic settings
Schema and Edit |
A schema is a row description. It defines the number of fields An output column must be named differently from any of the input columns, because the Click Edit
|
Transformation table |
Complete this table using columns from the input and the output schemas and the The algorithms available in the Transformation column For further information about the algorithms available for each type of input data, see |
Usage
Usage rule |
This component is used as an intermediate step. This component, along with the Spark Batch component Palette it belongs to, Note that in this documentation, unless otherwise explicitly stated, a |
Spark Connection |
In the Spark
Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Related scenario
For a scenario in which tModelEncoder is used, see Creating a classification model to filter spam.
tModelEncoder properties for Apache Spark Streaming
These properties are used to configure tModelEncoder running in the Spark Streaming Job framework.
The Spark Streaming
tModelEncoder component belongs to the Machine Learning family.
This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.
Basic settings
Schema and Edit |
A schema is a row description. It defines the number of fields An output column must be named differently from any of the input columns, because the Click Edit
|
Transformation table |
Complete this table using columns from the input and the output schemas and the The algorithms available in the Transformation column For further information about the algorithms available for each type of input data, see |
Usage
Usage rule |
This component is used as an intermediate step. This component, along with the Spark Streaming component Palette it belongs to, appears Note that in this documentation, unless otherwise explicitly stated, a scenario presents |
Spark Connection |
In the Spark
Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Related scenarios
No scenario is available for the Spark Streaming version of this component
yet.