Natural Language Processing using Talend Studio
understand how humans learn and use natural language.
What is natural language
text tokenization, which divides a text into basic units such as words or
sentence splitting, which divides the input into sentences, based on
ending characters, such as periods or question marks; and
named entity recognition, which finds and classify person names, dates,
locations and organizations in a text.
extract person names or company names from textual resources;
group forum discussions together by topics;
find discussions where people are mentioned but don’t participate to the
Natural language processing can help you create links between user profiles
and mentions in the text, between persons and organizations, or between persons and any
other information that may be used for re-identification.
based on historical data and mathematical heuristics, and the second phase applies
the model on text data. In Talend Studio, the first phase is
implemented by two Jobs:
the first one with the tNLPPreprocessing and the
tNormalize components; and
the second one with the tNLPModel component.
While the second phase is implemented by a third Job with the
divides a text sample in tokens; and
cleans the text sample by removing all HTML tags.
Then, tNormalize converts tokens to the CoNLL format.
files. For example, you can label person names with
tNLPModel in the second Job where
generates fatures for each token; and
trains a classification model.
tNLPPredict labels text data automatically using the
classification model generated by tNLPModel.