August 15, 2023

Natural Language Processing using Talend Studio – Docs for ESB 6.x

Natural Language Processing using Talend Studio

Using Talend Studio and machine learning on Spark, you can teach computers to
understand how humans learn and use natural language.

What is natural language

Natural language processing tasks include:

  • text tokenization, which divides a text into basic units such as words or
    punctuation marks;

  • sentence splitting, which divides the input into sentences, based on
    ending characters, such as periods or question marks; and

  • named entity recognition, which finds and classify person names, dates,
    locations and organizations in a text.

Natural language processing is useful to:

  • extract person names or company names from textual resources;

  • group forum discussions together by topics;

  • find discussions where people are mentioned but don’t participate to the
    discussion; or

  • link entities.

Natural language processing can help you create links between user profiles
and mentions in the text, between persons and organizations, or between persons and any
other information that may be used for re-identification.


Machine learning with Spark is usually two phases: the first phase computes a model
based on historical data and mathematical heuristics, and the second phase applies
the model on text data. In Talend Studio, the first phase is
implemented by two Jobs:

  • the first one with the tNLPPreprocessing and the
    tNormalize components; and

  • the second one with the tNLPModel component.

While the second phase is implemented by a third Job with the
tNLPPredict component.

In this workflow, tNLPPreprocessing:

  • divides a text sample in tokens; and

  • cleans the text sample by removing all HTML tags.

Then, tNormalize converts tokens to the CoNLL format.

You can then manually label the tokens and add optional features by editing the
files. For example, you can label person names with PER:


Next, you can use the tokenized sample text you labeled with
tNLPModel in the second Job where

  • generates fatures for each token; and

  • trains a classification model.

tNLPPredict labels text data automatically using the
classification model generated by tNLPModel.

For example, you can extract named entities with <PER>


Document get from Talend
Thank you for watching.
Notify of
Inline Feedbacks
View all comments
Would love your thoughts, please comment.x