tModelEncoder

Performs featurization operations to transform data into the format expected by the
model training components such as tLogisticRegressionModel or tKMeansModel.

tModelEncoder receives data from its
preceding components, applies a wide range of feature processing algorithms to transform
given columns of this data and sends the result to the model training component that
follows to eventually train and create a predictive model.

Depending on the Talend
product you are using, this component can be used in one, some or all of the following
Job frameworks:

Spark Batch: see tModelEncoder properties for Apache Spark Batch.

The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.
Spark Streaming: see tModelEncoder properties for Apache Spark Streaming.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

ML feature-processing algorithms in Talend

This table presents the feature processing algorithms you can use in the tModelEncoder component.

^{1: These are the parameters you can set in the Parameters column of the

Transformation table in the Basic settings tab of tModelEncoder. If you do not set

any parameters yourself, the default ones, if any, are used.}

HashingTF

As a text-processing algorithm, HashingTF converts input data into fixed-length feature vectors to reflect
the importance of a term (a word or a sequence of words) by calculating the frequency
that these words in the input data appear.

This algorithm is available in Spark Batch
and Spark Streaming Jobs.

In a Spark Batch Job, it is typically
used along with the IDF (Inverse document frequency) algorithm to make the weight
calculation more reliable. In the context of a Talend Spark Job, you need to put a second tModelEncoder to apply the Inverse document
frequency algorithm on the output of the HashingTF computation.

The data must be already
segmented before being sent to the HashingTF
computation; therefore, if the data to be used has not been segmented, you need to use
another tModelEncoder to apply the Tokenizer algorithm or the Regex
tokenizer algorithm to prepare the data.

For further
details about the HashingTF implementation in
Spark, see HashingTF from the Spark
documentation.

In a Job, this algorithm expects:

type of the input column: Object
type of the output column: Vector

Parameter	Description
numFeatures	The number of features that define the dimension of the feature vector. For example, you can enter numFeatures=2²⁰ to define the dimension. If you do not put any parameter, the default value, 2²⁰, is used. The output vectors are sparse vectors. For example, a document reading “`tModelEncoder transforms your data to features.`” can be transformed to (3,[0,1,2],[1.0,2.0,3.0]), if you put `numFeatures=3`. For further information about how to read a sparse vector, see Local vector.

Parameter

Description

numFeatures

The number of features that define the dimension of the feature
vector.

For example, you can enter numFeatures=2²⁰ to define the dimension. If you do not put
any parameter, the default value, 2²⁰, is
used.

The output vectors are sparse vectors. For example, a
document reading “tModelEncoder transforms your data to features.” can be transformed to
(3,[0,1,2],[1.0,2.0,3.0]), if you put
numFeatures=3.

For further information about how to read a sparse
vector, see Local
vector.

For further information about the Spark API of HashingTF, see ML HashingTF.

It can be used to prepare data for the Classification or the Clustering
components from the Machine Learning family in order to create sentiment analysis
model.

Inverse document frequency

As a text-processing algorithm, Inverse document frequency (IDF) is often used to process
the output of the HashingTF computation in order
to downplay the importance of the terms that appear in too many documents.

This
algorithm is available in Spark Batch Jobs.

It requires a
tModelEncoder component performing the HashingTF computation to provide input data.

For further details about the IDF
implementation in Spark, see IDF from the Spark
documentation.

In a Job, this algorithm expects:

type of the input column: Vector
type of the output column: Vector

Parameter	Description
minDocFreq	The minimum number of documents that should contain a term. This number is the threshold to indicate that a term becomes relevant to the IDF computation. For example, if you put `minDocReq=5`, when only 4 documents contain a term, this term is considered irrelevant and no IDF is actually applied on it. For further details about the Spark API of this IDF algorithm, see ML feature IDF.

Parameter

Description

minDocFreq

The minimum number of documents that should contain a term. This
number is the threshold to indicate that a term becomes relevant to
the IDF computation.

For example, if you put
minDocReq=5, when only 4 documents contain a
term, this term is considered irrelevant and no IDF is actually
applied on it.

For further details about the Spark API of this
IDF algorithm, see ML feature
IDF.

It can be used to prepare data for the Classification or the
Clustering components from the Machine Learning family in order to create sentiment
analysis model.

Word2Vector

Word2Vector
transforms a document into a feature vector, for use in other learning computations such
as text similarity calculation.

This algorithm is available in Spark Batch Jobs.

For further details about the Word2Vector implementation in Spark, see Word2Vec from the Spark documentation.

In a
Job, this algorithm expects:

type of the input column: List
type of the output column: Vector

Parameter	Description
maxIter	Maximum number of iterations for obtaining the optimal result. For example, `maxIter=5`.
minCount	Minimum number of times a token should appear to be included in the vocabulary of the Word2Vector model. The default is `minCount=5`.
numPartitions	Number of partitions.
seed	The random seed number.
stepSize	Size of the Step for each iteration. This defines the learning rate.
vectorSize	Size of each feature vector. The default is `vectorSize=100`, with which 100 numeric values are calculated to identify a document.

If you need to set several parameters, separate these
parameters using semicolons (;), for example,maxIter=5;minCount=4.

For further information about
the Spark API of Word2Vector, see Word2Vec.

It
can be used to prepare data for the Classification or the Clustering components from the
Machine Learning family in order to, for example, find similar user comments about a
product.

CountVectorizer

CountVectorizer extracts the
most frequent terms from a collection of text documents and convert these terms into
vectors of token counts.

This algorithm is available in Spark Batch Jobs.

It requires a tModelEncoder
component performing the Tokenizer
or the Regex tokenizer computation
to provide input data of the List
type.

For further information about the CountVectorizer implementation in Spark, see CountVectorizer.

In a Job, this algorithm expects:

type of the input column: List
type of the output column: Vector

Parameter	Description
minDF	The minimum number of documents in which a term should appear so as to be included in the vocabulary built by CountVectorizer. The default value is `minDF=1`. If you put a value between 0 and 1, it means a fraction of the documents.
minTF	The threshold used to ignore the rare terms in a document. A term with frequency less than the value of minTF will be ignored. The default value is `minTF=1`.
vocabSize	The maximum size of each vocabulary vector built by CountVectorizer. The default value is 2¹⁸ .

Parameter

Description

minDF

The minimum
number of documents in which a term should appear so as
to be included in the vocabulary built by CountVectorizer.

The default value is minDF=1. If you put a value between
0 and 1, it means a fraction of
the documents.

minTF

The
threshold used to ignore the rare terms in a document. A
term with frequency less than the value of minTF will be
ignored.

The default value is minTF=1.

vocabSize

The
maximum size of each vocabulary vector built by
CountVectorizer.

The default value is 2¹⁸
.

For further information about the Spark API of CountVectorizer, see CountVectorizer.

It is often used to process text in terms of text mining for the
Classification or the Clustering components.

Binarizer

Using the given threshold, Binarizer transforms a feature to a binary feature
of which the value is distributed be either 1.0 or 0.0.

This algorithm is available in Spark Batch and Spark Streaming Jobs.

For further details about the Binarizer implementation in Spark, see Binarizer from the Spark documentation.

In a Job, this algorithm expects:

type of the input column: Double
type of the output column: Double

Parameter	Description
threshold	The threshold used to binarize continuous features. The features greater than the threshold are binarized to 1.0 and the features equal to or less than the threshold are binarized to 0.0. The default is `threshold=0.0`.

Parameter

Description

threshold

The
threshold used to binarize continuous features. The features
greater than the threshold are binarized to 1.0 and the features equal to
or less than the threshold are binarized to 0.0.

The default is threshold=0.0.

For further information about the Spark API of Binarizer, see ML Binarizer.

It can be used to prepare data for the Classification or the
Clustering components from the Machine Learning family in order to,
for example, estimate a user comment indicates this user’s
satisfaction or dissatisfaction.

Bucketizer

Bucketizer segments continuous
features to a column of feature buckets using the boundary values you define.

This algorithm is available in Spark Batch and Spark Streaming Jobs.

For further details about the Bucketizer implementation in Spark, see Bucketizer from the Spark documentation.

In a Job, this algorithm expects:

type of the input column: Double
type of the output column: Double

Parameter	Description
splits	The parameter used to segment continuous features into buckets. A bucket is a half-open range [x,y) defined by the boundary values (x and y) you give except the last bucket, which also includes y. For example, you can put `splits=Double.NEGATIVE_INFINITY, -0.5, 0.0, 0.5, Double.POSITIVE_INFINITY` to segment values such as `-0.5, 0.3, 0.0, 0.2`. The `Double.NEGATIVE_INFINITY` and the `Double.POSITIVE_INFINITY` are recommended when you do not know the upper bound and the lower bound of the target column.

Parameter

Description

splits

The parameter used
to segment continuous features into buckets. A bucket is a
half-open range [x,y)
defined by the boundary values (x and y)
you give except the last bucket, which also includes
y.

For example, you can put splits=Double.NEGATIVE_INFINITY, -0.5, 0.0, 0.5, Double.POSITIVE_INFINITY to segment values
such as -0.5, 0.3, 0.0, 0.2. The Double.NEGATIVE_INFINITY and the Double.POSITIVE_INFINITY are
recommended when you do not know the upper bound and the
lower bound of the target column.

For further information about the Spark API of Bucketizer, see Bucketizer.

It can be used to prepare categorical data for training
classification or clustering models.

Discrete Cosine Transform
(DCT)

Discrete Cosine Transform in
Spark implements the one-dimensional DCT-II to transform a real-valued vector in the
time domain to another real-valued vector of the same length in the frequency
domain, That is to say, the input data is converted into a series of cosine waves
oscillating at different frequencies.

This algorithm is available in Spark Batch and Spark Streaming Jobs.

It requires a component performing a vector-generating
transformation to provide input data of the Vector type. For example, this component can be:

a tModelEncoder
component using algorithms such as Vector assembler
a tMatchModel
component.

For further information about the DCT implementation in Spark, see
DCT.

In a Job, this algorithm expects:

type of the input column: Vector
type of the output column: Vector

Parameter	Description
inverse	The boolean to indicate whether to perform the inverse DCT calculations (when `inverse=true`) or the forward DCT calculations (when `inverse=false`). By default, it is `inverse=false`.

Parameter

Description

inverse

The boolean to
indicate whether to perform the inverse DCT calculations
(when inverse=true) or
the forward DCT calculations (when inverse=false).

By default, it is inverse=false.

For further information about the Spark API of Discrete Cosine Transform, see DCT.

It is widely used to process images and audios for training
related classification or clustering models.

MinMaxScaler

MinMaxScaler rescales each
feature vector into a fixed range.

It requires a component performing a vector-generating
transformation to provide input data of the Vector type. For example, this component can be:

a tModelEncoder
component using algorithms such as such as Vector assembler
a tMatchModel
component.

This algorithm is available in Spark Batch Jobs.

For further information about the MinMaxScaler implementation in Spark, see MinMaxScaler.

In a Job, this algorithm expects:

type of the input column: Vector
type of the output column: Vector

Parameter	Description
min	The lower bound of each vector after the transformation. By default, it is `min=0`.
max	The upper bound of each vector after the transformation. By default, it is `max=1`.

For further information about the Spark API of MinMaxScaler, see MinMaxScaler.

It is used to normalize features to fit within a certain range.
This is typically used in image processing, for example, to
normalize data about pixel intensities.

N-gram

N-gram converts a tokenized
string (often words) to an array of comma-separated n-grams. Within each program,
words are separated by space. For example, when creating 2-grams, the string
Good morning World will be converted to (good morning,
morning world).

This algorithm is available in Spark Batch and Spark Streaming Jobs.

It requires a tModelEncoder
component performing the Tokenizer
or the Regex tokenizer computation
to provide input data of the List
type.

For further information about the N-gram implementation in Spark, see NGram.

In a Job, this algorithm expects:

type of the input column: List
type of the output column: List

Parameter	Description
n	The minimum length of each n-gram. By default, it is `n=1`, that is to say, 1-gram or unigram.

For further information about the Spark API of N-gram, see NGram.

It if often used in natural language processing such as speech
recognition to prepare data for the related Classification or the
Clustering models.

Normalizer

Normalizer normalizes each
vector of the input data to have unit norm so as to improve the performance of
learning computations.

This algorithm is available in Spark Batch and Spark Streaming Jobs.

For further information about the Normalizer implementation in Spark, see Normalizer from the Spark documentation.

In a Job, this algorithm expects:

type of the input column: Vector
type of the output column: Vector

Parameter	Description
p	The p-norm value used to standardize the feature vectors from the input flow to unit norm. The default is `p=2`, meaning to use the Euclidean norm.

For further information about the Spark API of Normalizer, see Normalizer.

It can be used to normalize of the result of the TF-IDF
computation in order to eventually improve the performance of text
classification (by tLogicRegressionModel for example) or text
clustering.

One hot encoder

One hot encoder enables the
algorithms that expect continuous features to use categorical features by mapping
the column of label indices of the categorical features to a column of binary
code.

This algorithm is available in Spark Batch and Spark Streaming Jobs.

You can use another tModelEncoder
component with the String indexer
algorithm to create this column of label indices.

For further information about the OneHotEncoder implementation in Spark, see OneHotEncoder from the Spark documentation.

In a Job, this algorithm expects:

type of the input column: Double
type of the output column: Vector

Parameter	Description
dropLast	The boolean parameter used to determine whether to drop the last category. The default is `dropLast=true`, meaning that the last category is dropped with the result that the output vector for this category contains only 0 and each vector uses one bit less of storage space. This configuration allows you to save the storage space for the output vectors.

Parameter

Description

dropLast

The boolean
parameter used to determine whether to drop the last
category.

The default is dropLast=true, meaning that the last
category is dropped with the result that the output vector
for this category contains only 0 and each vector uses one
bit less of storage space. This configuration allows you to
save the storage space for the output vectors.

For further information about the Spark API of One hot encoder, see OneHotEncoder.

It can be used to provide feature data to the Classification or
the Clustering components, such as tLogicRegressionModel.

PCA

PCA implements an orthogonal transformation to convert vectors of
correlated features into vectors of linearly uncorrelated features.
This can project high-dimensional feature vectors to low-dimensional
feature vectors.

This algorithm is available in Spark Batch Jobs.

It requires a component performing a vector-generating
transformation to provide input data of the Vector type. For example, this component can be:

a tModelEncoder
component using algorithms such as such as Vector assembler
a tMatchModel
component.

For further information about the PCA implementation in Spark, see PCA from the Spark documentation.

In a Job, this algorithm expects:

type of the input column: Vector
type of the output column: Vector

Parameter	Description
k	The number of principal components to be generated. This value determine the dimension of the feature vectors to be outputted. For example, k=3 means the 3-dimensional feature vectors will be outputted. For further information about PCA and its principal components, see Principal component analysis.

For further information about the Spark API of PCA, see PCA.

Typically, it can be used to prepare features for resolving
clustering problems.

Polynomial expansion

Polynomial expansion expands
the input features so as to improve the performance of learning computations.

This algorithm is available in Spark Batch and Spark Streaming Jobs.

For further details about the Polynomial
expansion implementation in Spark, see PolynomialExpansion from the Spark documentation.

In a Job, this algorithm expects:

type of the input column: Vector
type of the output column: Vector

Parameter	Description
degree	The polynomial degree to expand. A higher-degree expansion of features often means more accuracy in the model you need to create, but note that too high a degree can lead to overfitting in the result of the predictive analysis based on the same model. The default is `degree=2`, meaning to expand the input features into a 2-degree polynomial space.

Parameter

Description

degree

The polynomial
degree to expand. A higher-degree expansion of features
often means more accuracy in the model you need to create,
but note that too high a degree can lead to overfitting in
the result of the predictive analysis based on the same
model.

The default is degree=2, meaning to expand the input features
into a 2-degree
polynomial space.

For further information about the Spark API of Polynomial expansion, see Polynomial expansion.

It can be used to process feature data for the Classification or
the Clustering components, such as tLogicRegressionModel.

QuantileDiscretizer

QuantileDiscretizer reads a
column of continuous features, analyzes a sample of data from these feature data,
and accordingly outputs a column of categorical features that group the data of the
continuous features into roughly equal parts.

This algorithm is available in Spark Batch Jobs but is not compatible with Spark 2.2
on EMR 5.8.

For further information about the QuantileDiscretizer implementation in Spark, see
QuantileDiscretizer from the Spark documentation.

In a Job, this algorithm expects:

type of the input column: Double
type of the output column: Double

Parameter	Description
numBuckets	The maximum number of buckets into which you want to group the input data. This value must be greater than or equal to 2. The default value is `numBuckets=2`.

For further information about the Spark API of QuantileDiscretizer, see Quantile Discretizer.

It can be used to prepare categorical data for training
classification or clustering models.

Regex tokenizer

Regex tokenizer performs
advanced tokenization based on given regex patterns.

For further details about the RegexTokenizer implementation in Spark, see RegexTokenizer from the Spark documentation.

This algorithm is available in Spark Batch and Spark Streaming Jobs.

In a Job, this algorithm expects:

type of the input column: String
type of the output column: Object and List

Parameter	Description
gaps	The boolean parameter used to indicate whether the regex splits the input text using one or more whitespace characters (when `gaps=true`) or repetitively matches a token (when `gaps=false`). By default, this parameter is set to be `true` and the default delimiter is `\s+`, which matches one or more characters.
pattern	The parameter used to set regex pattern that matches tokens out of the input text.
minTokenLength	The parameter used to filter matched tokens using a minimal length. The default value is `1`, so as to avoid returning empty strings.

Parameter

Description

gaps

The boolean
parameter used to indicate whether the regex splits the
input text using one or more whitespace characters (when
gaps=true) or
repetitively matches a token (when gaps=false).

By default, this parameter is set to be true and the default delimiter
is \s+, which matches
one or more characters.

pattern

The parameter
used to set regex pattern that matches tokens out of the
input text.

minTokenLength

The
parameter used to filter matched tokens using a minimal
length. The default value is 1, so as to avoid returning empty
strings.

If you need to set several parameters, separate these parameters
using semicolons (;), for example, gaps=true;minTokenLength=4.

For further information about the Spark API of Regex tokenizer, see RegexTokenizer.

It is often used to process text in terms of text mining for the
Classification or the Clustering components, such as tRandomForestModel, in order to create,
for example, a spam filtering model.

Tokenizer

Tokenizer breaks input text
(often sentences) into individual terms (often words).

Note that
these words are all convert to lowercase.

For further details
about the Tokenizer implementation in Spark, see
Tokenizer from the Spark
documentation.

This algorithm is available in Spark Batch and Spark Streaming
Jobs.

In a Job, this algorithm expects:

type of the input column: String
type of the output column: Object and
List

You do not need to set any additional parameters for
Tokenizer.

For further
information about the Spark API of Tokenizer,
see Tokenizer.

It is often used to process text in terms of text mining for the Classification or the
Clustering components, such as tRandomForestModel, in order to create, for example, a spam filtering
model.

SQLTransformer

SQLTransformer
allows you to implement feature transformation using Spark SQL statements. It is subject
to the limitations indicated in the Spark documentation.

For further
information about these limitations and the SQLTransformer implementation in Spark, see SQLTransformer from Spark
documentation.

This algorithm is available in Spark Batch and Spark Streaming
Jobs.

In a Job, this algorithm expects:

type of the input column: All types. You need to select the column to be used in
your SQL statement
type of the output column: All types: you need define it depending on your SQL
statement

Parameter	Description
statement	The Spark SQL statement to be used to select and/or transform input data.

For further information about the Spark API of SQLTransformer, see SQLTransformer.

It allows you to be flexible to extract and transform data to prepare
features for other Machine learning algorithms or directly perform queries in the
results of other Machine Learning algorithms.

Note that you can
also efficiently perform Spark SQL queries using tSqlRow or join data using tMap to
prepare the data.

Standard scaler

Standard
scaler standardizes each input vector to have unit standard deviation (unit
variance), a common case of normal distribution. The standardized data can improve the
convergence rate and prevent features with very large variances from exerting overly
large influence during model training.

For further details about the
StandardScaler implementation in Spark, see
StandardScaler from the Spark
documentation.

This algorithm is available in Spark Batch Jobs.

In a Job,
this algorithm expects:

type of the input column: Vector
type of the output column: Vector

Parameter	Description
withMean	The boolean parameter used to indicate whether to center each vector of feature data with mean (that is to say, subtract the mean of the feature numbers from each of these numbers) before scaling. Centering the data will build a dense output and so when the input data is sparse, it will raise exception. By default, this parameter is set to be `false`, meaning that no centering occurs.
withStd	The boolean parameter used to indicate whether to scale the input data to have unit standard deviation. By default, withStd is set to be `true`, meaning to normalize the input feature vectors to have unit standard deviation.

Parameter

Description

withMean

The boolean parameter used to indicate whether to center each vector
of feature data with mean (that is to say, subtract the mean of the
feature numbers from each of these numbers) before scaling.
Centering the data will build a dense output and so when the input
data is sparse, it will raise exception.

By default, this parameter is set to be
false, meaning that no centering occurs.

withStd

The boolean parameter used to indicate whether to scale the input
data to have unit standard deviation.

By default, withStd is set to
be true, meaning to normalize the input
feature vectors to have unit standard deviation.

If you need to set several parameters, separate these
parameters using semicolons (;), for example,withMean=true;withStd=true.

Note that if you set both
parameters to be false, Standard
scaler will actually do nothing.

For further
information about the Spark API of Standard
scaler, see StandardScaler.

It can be used to prepare data for the Classification or the Clustering
components, such as tKMeanModel.

StopWordsRemover

StopWordsRemover filters out stop words from the input word strings.

It requires a tModelEncoder
component performing the Tokenizer or the
Regex tokenizer computation to provide input
data of the List type.

For further details about the StopWordsRemover
implementation in Spark, see StopWordsRemover from the Spark
documentation.

This algorithm is available in Spark Batch and Spark Streaming
Jobs.

In a Job, this algorithm expects:

type of the input column: List
type of the output column: List

Parameter	Description
caseSensitive	The boolean to indicate whether filtering out stop words is case-sensitive. By default, the value is `caseSensitive=false`.
stopWords	It defines the list of stop words to be used for the filtering. By default, it is `stopWords=English`. You can consult the list of these default stop words on stop_words. If you need to use a custom list of stop words, you can directly enter them with this parameter, for example, `stopWords=the,stop,words`.

Parameter

Description

caseSensitive

The boolean to indicate whether filtering out stop words is
case-sensitive. By default, the value is
caseSensitive=false.

stopWords

It defines the list of stop words to be used for the filtering. By
default, it is stopWords=English. You can
consult the list of these default stop words on stop_words.

If you need to use a custom list of stop words, you
can directly enter them with this parameter, for example,
stopWords=the,stop,words.

For further information about what is a stop word, see Stop words on wikipedia.

For further
information about the Spark API of StopWordsRemover, see StopWordsRemover.

It removes the most common words that often do not carry as much
meaning in order to avoid as many noises as possible in text processing.

String indexer

String indexer
generates indices for categorical features (string-type labels). These indices can be
used by other algorithms such as One hot encoder
to build equivalent continuous features.

The indices are ordered by
frequencies and the most frequent label gets the index 0.

For
further details about the StringIndexer
implementation in Spark, see StringIndexer from the Spark
documentation.

This algorithm is available in Spark Batch Jobs.

In a Job,
this algorithm expects:

type of the input column: String
type of the output column: Double

You do not need to set any additional parameters for
String indexer.

For
further information about the Spark API of String
indexer, see StringIndexer.

String indexer, along with
One hot encoder, enables algorithms that expects
continuous features to use categorical features.

Vector indexer

Vector indexer
identifies categorical feature columns based on your definition of the
maxCategories parameter and indexes the categories from each of the
identified columns, starting from 0. The other columns are declared as continuous
feature columns and are not indexed.

For further details about the
VectorIndexer implementation in Spark, see VectorIndexer from the Spark
documentation.

This algorithm is available in Spark Batch Jobs.

In a Job,
this algorithm expects:

type of the input column: Vector
type of the output column: Vector

Parameter	Description
maxCategories	The parameter used to set the threshold indicating whether a vector column represents categorical features or continuous features. For example, if you put `maxCategories=2`, the columns that contain more than 2 distinct values will be declared as continuous feature column and the other columns as categorical feature column. The default is `maxCategories=20`.

Parameter

Description

maxCategories

The parameter used to set the threshold indicating whether a vector
column represents categorical features or continuous features. For
example, if you put maxCategories=2, the
columns that contain more than 2 distinct values will be declared as
continuous feature column and the other columns as categorical
feature column.

The default is
maxCategories=20.

For further information about the Spark API of Vector indexer, see VectorIndexer.

Vector indexer gives indexes to
categorical features so that algorithms such as the Decision Trees computations run by
tRandomForestModel, can handle the categorical
features appropriately.

Vector assembler

Vector
assembler combines selected input columns into one single vector column that
can be used by other algorithms or machine learning computations that expect vector
features.

Note that Vector
assembler does not re-calculate the features taken from different columns.
It only combines these feature columns into one single vector but keep the features as
they are.

When you select Vector
assembler, the Input column column
of the Transformation table in the Basic settings view of tModelEncoder is deactivated and you need to use the inputCols parameter in the Parameters
column to select the input columns to be combined.

For further
details about the VectorAssembler implementation
in Spark, see VectorAssembler from the Spark
documentation.

This algorithm is available in Spark Batch and Spark Streaming
Jobs.

In a Job, this algorithm expects:

type of the input column: numeric types, boolean type and vector type
type of the output column: Vector

Parameter	Description
inputCols	The parameter used to indicate the input columns to be combined into one single vector column. For example, you can put `inputCols=id,country_code` to combine the `id` column and the `country_code` column.

For further information about the Spark API of Vector assembler, see VectorAssembler.

Vector assembler prepares
feature vectors for the Logistic Regression computations or the Decision Tree
computations run by components such as tLogisticRegressionModel and tRandomForestModel.

ChiSqSelector

ChiSqSelector
determines feature relevance to given feature categories based on a Chi-Squared test of
independence and then selects the features the most relevant to those categories.

For further details about the ChiSqSelector implementation in Spark, see ChiSqSelector from Spark
documentation.

This algorithm is available in Spark Batch and Spark Streaming
Jobs.

In a Job, this algorithm expects:

type of the input column: Vector,
Double and List
type of the output column: Vector

Parameter	Description
featuresCol	The input column that provides features to be selected by ChiSqSelector. The type of this column should be `Vector`.
labelCol	The input column that provides categories for the features to be used. The type of this input column should be `Double`.
numTopFeatures	The number of the features that ChiSqSelector defines as the most relevant to the feature categories and so selects to output. By default, this parameter is `numTopFeatures=50`, meaning to select the first 50 of the most relevant features.

For example, in an analysis of loan validation, a column
called features and a column called label
have been prepared: the former column carries features about the borrower candidates
such as address, age and
income, and the latter column the categories that indicate
whether to validate the loan for each candidate. You need to put
featuresCol=features,labelCol=label to make use of these columns
by ChiSqSelector.

In
addition, if you want to select only the top 1 most relevant feature, put
numTopFeatures=1 and then the income
feature will be selected; if you put numTopFeatures=2 instead,
the top 2 most relevant features will be selected, that is to say, the
income and the age features.

For further information about the Spark API of ChiSqSelector, see ChiSqSelector.

It can be used to yield the features with the most predictive power for
training classification or clustering models.

RFormula

Rformula allows you
to generate feature vectors along with their feature labels. It is subject to the
limitations indicated in the Spark documentation.

For further
information about these limitations and the RFormula implementation in Spark, see Rformula from Spark documentation.

This
algorithm is available in Spark Batch Jobs but is not compatible with Spark 2.2 on EMR
5.8.

In a Job, this algorithm expects:

type of the input column: String types and numeric types. You need to use R
formula to select the columns to be used.
type of the output column: Vector for features and
Double for labels

Parameter	Description
featuresCol	The output column used to carry the feature data.
labelCol	The output column used to carry the feature labels.
formula	The R formula to be applied.

For example, if you put
featuresCol=features;labelCol=label;formula=clicked ~ country + hour in the Parameters column of the
Transformation table, you need to add the
features column and the label column to
the output schema and set the former column to the type Vector
and the latter to the type Double. Then during the
transformation, the R formula defined using the formula parameter
is applied on the input columns, features are generated into the
features column and feature labels into the
label column. This example is based on the one explained in Rformula from Spark documentation.

For further information about the Spark API of RFormula, see R Formula.

It allows you to
apply R formulas to prepare feature data.

VectorSlicer

VectorSlicer
reads a feature vector, selects features from this vector based on the values of the indices parameter and writes the selected features into a new
vector in the output column.

It requires a component performing a
vector-generating transformation to provide input data of the Vector type. For example, this component can be

a tModelEncoder
component using algorithms such as such as Vector
assembler
a tMatchModel
component.

For further information about the VectorSlicer implementation in Spark, see VectorSlicer from Spark
documentation.

This algorithm is available in Spark Batch and Spark Streaming
Jobs.

In a Job, this algorithm expects:

type of the input column: Vector
type of the output column: Vector

Parameter	Description
indices	The numeric indices of the features to be selected and outputted. For example, if you need to select the second and the third features of the following vector, [0.0, 10.0, 0.5], you need to put `indices=1,2`. Then the vector [10.0, 0.5] will be written in the output column. The features in the output vector is ordered according to their indices.

Parameter

Description

indices

The numeric indices of the features to be selected and outputted. For
example, if you need to select the second and the third features of
the following vector, [0.0, 10.0, 0.5], you need to put
indices=1,2. Then the vector [10.0, 0.5] will
be written in the output column.

The features in the output vector is ordered
according to their indices.

For further information about the Spark API of VectorSlicer, see VectorSlicer. But note that the
names parameter is not supported by tModelEncoder.

It allows you to be precise in
selecting the features you want to use.

tModelEncoder properties for Apache Spark Batch

These properties are used to configure tModelEncoder running in the Spark Batch Job framework.

The Spark Batch
tModelEncoder component belongs to the Machine Learning family.

The component in this framework is available in all Talend Platform products with Big Data and in Talend Data Fabric.

Basic settings

Schema and Edit Schema	A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word `line` when naming the fields. An output column must be named differently from any of the input columns, because the successive transformations from the input side to the output side take place in the same DataFrame (the Spark term for a schema-based data collection) and thus the output columns are actually added to the same DataFrame alongside the input columns. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available: View schema: choose this option to view the schema only. Change to built-in property: choose this option to change the schema to Built-in for local changes. Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the Repository Content window.
Transformation table	Complete this table using columns from the input and the output schemas and the feature-processing algorithms to be applied on these columns. The algorithms available in the Transformation column varies depending on the type of the input schema columns to be processed. For further information about the algorithms available for each type of input data, see ML feature-processing algorithms in Talend.

Schema and Edit
Schema

A schema is a row description. It defines the number of fields
(columns) to be processed and passed on to the next component. When you create a Spark
Job, avoid the reserved word line when naming the
fields.

An output column must be named differently from any of the input columns, because the
successive transformations from the input side to the output side take place in the same
DataFrame (the Spark term for a schema-based data collection) and thus the output columns
are actually added to the same DataFrame alongside the input columns.

Click Edit
schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

View schema: choose this
option to view the schema only.
Change to built-in property:
choose this option to change the schema to Built-in for local changes.
Update repository connection:
choose this option to change the schema stored in the repository and decide whether
to propagate the changes to all the Jobs upon completion. If you just want to
propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
again in the Repository Content
window.

Transformation table

Complete this table using columns from the input and the output schemas and the
feature-processing algorithms to be applied on these columns.

The algorithms available in the Transformation column
varies depending on the type of the input schema columns to be processed.

For further information about the algorithms available for each type of input data, see
ML feature-processing algorithms in Talend.

Usage

Usage rule	This component is used as an intermediate step. This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.
Spark Connection	In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files: Yarn mode (Yarn client or Yarn cluster): When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab. When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab. When using Altus, specify the S3 bucket or the Azure Data Lake Storage for Job deployment in the Spark configuration tab. When using Qubole, add a tS3Configuration to your Job to write your actual business data in the S3 system with Qubole. Without tS3Configuration, this business data is written in the Qubole HDFS system and destroyed once you shut down your cluster. When using on-premise distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration. Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration or tS3Configuration. If you are using Databricks without any configuration component present in your Job, your business data is written directly in DBFS (Databricks Filesystem). This connection is effective on a per-Job basis.

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Batch component Palette it belongs to,
appears only when you are creating a Spark Batch Job.

Note that in this documentation, unless otherwise explicitly stated, a
scenario presents only Standard Jobs, that is to
say traditional
Talend
data integration Jobs.

Spark Connection

In the Spark
Configuration tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:

Yarn mode (Yarn client or Yarn cluster):
- When using Google Dataproc, specify a bucket in the
  Google Storage staging bucket
  field in the Spark configuration
  tab.
- When using HDInsight, specify the blob to be used for Job
  deployment in the Windows Azure Storage
  configuration area in the Spark
  configuration tab.
- When using Altus, specify the S3 bucket or the Azure
  Data Lake Storage for Job deployment in the Spark
  configuration tab.
- When using Qubole, add a
  tS3Configuration to your Job to write
  your actual business data in the S3 system with Qubole. Without
  tS3Configuration, this business data is
  written in the Qubole HDFS system and destroyed once you shut
  down your cluster.
- When using on-premise
  distributions, use the configuration component corresponding
  to the file system your cluster is using. Typically, this
  system is HDFS and so use tHDFSConfiguration.
Standalone mode: use the
configuration component corresponding to the file system your cluster is
using, such as tHDFSConfiguration or
tS3Configuration.

If you are using Databricks without any configuration component present
in your Job, your business data is written directly in DBFS (Databricks
Filesystem).

This connection is effective on a per-Job basis.

Related scenario

For a scenario in which tModelEncoder is used, see Creating a classification model to filter spam.

tModelEncoder properties for Apache Spark Streaming

These properties are used to configure tModelEncoder running in the Spark Streaming Job framework.

The Spark Streaming
tModelEncoder component belongs to the Machine Learning family.

This component is available in Talend Real Time Big Data Platform and Talend Data Fabric.

Basic settings

Schema and Edit Schema	A schema is a row description. It defines the number of fields (columns) to be processed and passed on to the next component. When you create a Spark Job, avoid the reserved word `line` when naming the fields. An output column must be named differently from any of the input columns, because the successive transformations from the input side to the output side take place in the same DataFrame (the Spark term for a schema-based data collection) and thus the output columns are actually added to the same DataFrame alongside the input columns. Click Edit schema to make changes to the schema. If the current schema is of the Repository type, three options are available: View schema: choose this option to view the schema only. Change to built-in property: choose this option to change the schema to Built-in for local changes. Update repository connection: choose this option to change the schema stored in the repository and decide whether to propagate the changes to all the Jobs upon completion. If you just want to propagate the changes to the current Job, you can select No upon completion and choose this schema metadata again in the Repository Content window.
Transformation table	Complete this table using columns from the input and the output schemas and the feature-processing algorithms to be applied on these columns. The algorithms available in the Transformation column varies depending on the type of the input schema columns to be processed. For further information about the algorithms available for each type of input data, see ML feature-processing algorithms in Talend.

Schema and Edit
Schema

Click Edit
schema to make changes to the schema. If the current schema is of the Repository type, three options are available:

View schema: choose this
option to view the schema only.
Change to built-in property:
choose this option to change the schema to Built-in for local changes.
Update repository connection:
choose this option to change the schema stored in the repository and decide whether
to propagate the changes to all the Jobs upon completion. If you just want to
propagate the changes to the current Job, you can select No upon completion and choose this schema metadata
again in the Repository Content
window.

Transformation table

Complete this table using columns from the input and the output schemas and the
feature-processing algorithms to be applied on these columns.

The algorithms available in the Transformation column
varies depending on the type of the input schema columns to be processed.

For further information about the algorithms available for each type of input data, see
ML feature-processing algorithms in Talend.

Usage

Usage rule	This component is used as an intermediate step. This component, along with the Spark Streaming component Palette it belongs to, appears only when you are creating a Spark Streaming Job. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs.
Spark Connection	In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files: Yarn mode (Yarn client or Yarn cluster): When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab. When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab. When using Altus, specify the S3 bucket or the Azure Data Lake Storage for Job deployment in the Spark configuration tab. When using Qubole, add a tS3Configuration to your Job to write your actual business data in the S3 system with Qubole. Without tS3Configuration, this business data is written in the Qubole HDFS system and destroyed once you shut down your cluster. When using on-premise distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration. Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration or tS3Configuration. If you are using Databricks without any configuration component present in your Job, your business data is written directly in DBFS (Databricks Filesystem). This connection is effective on a per-Job basis.

Usage rule

This component is used as an intermediate step.

This component, along with the Spark Streaming component Palette it belongs to, appears
only when you are creating a Spark Streaming Job.

Note that in this documentation, unless otherwise explicitly stated, a scenario presents
only Standard Jobs, that is to say traditional
Talend
data
integration Jobs.

Spark Connection

Yarn mode (Yarn client or Yarn cluster):
- When using Google Dataproc, specify a bucket in the
  Google Storage staging bucket
  field in the Spark configuration
  tab.
- When using HDInsight, specify the blob to be used for Job
  deployment in the Windows Azure Storage
  configuration area in the Spark
  configuration tab.
- When using Altus, specify the S3 bucket or the Azure
  Data Lake Storage for Job deployment in the Spark
  configuration tab.
- When using Qubole, add a
  tS3Configuration to your Job to write
  your actual business data in the S3 system with Qubole. Without
  tS3Configuration, this business data is
  written in the Qubole HDFS system and destroyed once you shut
  down your cluster.
- When using on-premise
  distributions, use the configuration component corresponding
  to the file system your cluster is using. Typically, this
  system is HDFS and so use tHDFSConfiguration.
Standalone mode: use the
configuration component corresponding to the file system your cluster is
using, such as tHDFSConfiguration or
tS3Configuration.

If you are using Databricks without any configuration component present
in your Job, your business data is written directly in DBFS (Databricks
Filesystem).

This connection is effective on a per-Job basis.

Related scenarios

No scenario is available for the Spark Streaming version of this component
yet.

Document get from Talend https://help.talend.com

Thank you for watching.

Docs 7.x

0 Comments

Inline Feedbacks

View all comments

tModelEncoder – Docs for ESB 7.x

tModelEncoder

ML feature-processing algorithms in Talend

HashingTF

Inverse document frequency

Word2Vector

CountVectorizer

Binarizer

Bucketizer

Discrete Cosine Transform
(DCT)

MinMaxScaler

N-gram

Normalizer

One hot encoder

PCA

Polynomial expansion

QuantileDiscretizer

Regex tokenizer

Tokenizer

SQLTransformer

Standard scaler

StopWordsRemover

String indexer

Vector indexer

Vector assembler

ChiSqSelector

RFormula

VectorSlicer

tModelEncoder properties for Apache Spark Batch

Basic settings

Usage

Related scenario

tModelEncoder properties for Apache Spark Streaming

Basic settings

Usage

Related scenarios

My Website Links

Tags

tModelEncoder

ML feature-processing algorithms in Talend

HashingTF

Inverse document frequency

Word2Vector

CountVectorizer

Binarizer

Bucketizer

Discrete Cosine Transform (DCT)

MinMaxScaler

N-gram

Normalizer

One hot encoder

PCA

Polynomial expansion

QuantileDiscretizer

Regex tokenizer

Tokenizer

SQLTransformer

Standard scaler

StopWordsRemover

String indexer

Vector indexer

Vector assembler

ChiSqSelector

RFormula

VectorSlicer

tModelEncoder properties for Apache Spark Batch

Basic settings

Usage

Related scenario

tModelEncoder properties for Apache Spark Streaming

Basic settings

Usage

Related scenarios

Discrete Cosine Transform
(DCT)