Natural Language Processing using Talend Studio
understand how humans learn and use natural language.
What is natural language
processing?
-
text tokenization, which divides a text into basic units such as words or
punctuation marks; -
sentence splitting, which divides the input into sentences, based on
ending characters, such as periods or question marks; and -
named entity recognition, which finds and classify person names, dates,
locations and organizations in a text.
-
extract person names or company names from textual resources;
-
group forum discussions together by topics;
-
find discussions where people are mentioned but don’t participate to the
discussion; or -
link entities.
Natural language processing can help you create links between user profiles
and mentions in the text, between persons and organizations, or between persons and any
other information that may be used for re-identification.
Workflow
based on historical data and mathematical heuristics, and the second phase applies
the model on text data. In Talend Studio, the first phase is
implemented by two Jobs:
-
the first one with the tNLPPreprocessing and the
tNormalize components; and -
the second one with the tNLPModel component.
While the second phase is implemented by a third Job with the
tNLPPredict component.
-
divides a text sample in tokens; and
-
cleans the text sample by removing all HTML tags.
Then, tNormalize converts tokens to the CoNLL format.
files. For example, you can label person names with
PER
:
tNLPModel in the second Job where
tNLPModel:
-
generates fatures for each token; and
-
trains a classification model.
tNLPPredict labels text data automatically using the
classification model generated by tNLPModel.
<PER>
labels: