August 15, 2023

Converting the tokenized text to the CoNLL format – Docs for ESB 6.x

Converting the tokenized text to the CoNLL format

To be able to learn a classification model from a text, you must divide this text
into tokens and convert it to the CoNLL format using
tNormalize.

  1. Double click the tNLPPreprocessing component to open its
    Basic settings view and define its properties.

    use_case_tnlppreprocessing3.png


    1. Click Sync columns to retrieve the
      schema from the previous component connected in the Job.

    1. From the NLP Library list, select the library to
      be used for tokenization. In this example,
      ScalaNLP is used.
  2. From the Column to preprocess list, select the column
    that holds the text to be divided into tokens, which is
    message in this example.
  3. Double click the tFilterColumns component to open its
    Basic settings view and define its properties.
  4. Click Edit schema to add the
    tokens column in the output schema because this is
    the column to be normalized, and click OK to
    validate.
  5. Double click the tNormalize component to open its
    Basic settings view and define its properties.

    use_case_tnlppreprocessing5.png


    1. Click Sync columns to retrieve the
      schema from the previous component connected in the Job.

    2. From the Column to normalize list, select
      tokens.
    3. In the Item separator field, enter
      " " to separate tokens using a tab in the
      output file.
  6. Double click the tFileOutputDelimited component to open
    its Basic settings view and define its properties.

    use_case_tnlppreprocessing6.png


    1. Click Sync columns to retrieve the
      schema from the previous component connected in the Job.

    2. In the Folder field, specify the path to the
      folder where the CoNLL files will be stored.
    3. In the Row Separator field, enter
      "
      "
      .
    4. In the Field Separator field, enter
      " " to separate fields with a tab.

  7. Press F6 to save and execute the
    Job.

The output files are created in the specified folder. The files contain a single
column with one token per row.

use_case_tnlppreprocessing7.png

You can then manually label person names with PER and the
other tokens with O before you can learn a classification
model from this text data:

use_case_tnlppreprocessing8.png


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x