August 17, 2023

tLoqateAddressRow – Docs for ESB 5.x

tLoqateAddressRow

tLoqateAddressRow_icon32_white.png

Warning

This component will be available in the Palette of
Talend Studio on the condition that you have subscribed to one of
the Talend Platform products.

This address management component is the result of Talend collaboration with Loqate,
one of the world leaders for high quality, accurate location information.

For more information about the enterprise and its software tools, please visit http://
www.loqate.com/.

tLoqateAddressRow properties

Component family

Data Quality

 

Function

tLoqateAddressRow parses, verifies, cleanses,
standardizes, transliterates, and formats international addresses.

This component uses the Loqate Global Knowledge Repository containing definitive
address and geographic reference data for over 240 countries in multiple languages
and character sets.

tLoqateAddressRow uses the Q4, 2012
release.

Purpose

tLoqateAddressRow enables you to parse
structured or unstructured text into labeled address, it automatically puts address
components into the correct address field.

You can compare address data against reference data to ensure that it is
accurate and complete. You can correct spellings, add missing address data such as
city, city area, region or postcode, and enrich address with other elements such as
latitude longitude and other relevant data.

Basic settings

Schema

A schema is a row description, it defines the number of fields to be processed
and passed on to the next component. The schema is either Built-in or stored remotely in the Repository.

Since version 5.6, both the Built-In mode and the Repository mode are
available in any of the Talend solutions.

 

 

Built-in: You create the schema and store it
locally for this component only. Related topic: see Talend Studio User
Guide
.

 

 

Repository: You have already created the schema
and stored it in the Repository. You can reuse it in various projects and job
designs. Related topic: see Talend Studio User
Guide
.

 

Edit Schema

Click the […] button and define the input and
output schema of the address data.

Make sure to define in the output schema all columns necessary to output the
formatted data you want to get from tLoqateAddressRow.

 

Input Address

Address field: add lines to the table and
select from the component predefined list the fields that will hold the input
address.

tLoqateAddressRow provides a long list of
individual fields because some countries have more complex addressing structures
than others. For further information about the input fields, see Address fields in tLoqateAddressRow.

Input Column: add lines to the table and select
from the list the columns that hold the input address. The input schema can have one
or multiple columns and can have columns that do not represent address data.

 

Output Address

Address field: add lines to the table and
select from the component predefined list the fields that will hold the output
address. The component will map the values of these fields to the output columns you
set in the table.

tLoqateAddressRow provides a long list of
individual fields because some countries have more complex addressing structures
than others. For further information about the output fields, see Address fields in tLoqateAddressRow.

Output Column: add lines to the table and
select from the list the columns that will hold the output address.

If you select to have an output column in the Output
Address
table that has the exact name of an input column, the input
column value will be overwritten by the value given by tLoqateAddressRow.

In the output schema, there are two output standard columns that are read-only:

STATUS: returns the status of processing input addresses. For
further information about process status, see Process status in tLoqateAddressRow.

ACCURACYCODE: returns the verification code for the processed
address. For further information about what values this code is made up of and the
implications of each segment, see Address verification codes in tLoqateAddressRow.

 

Loqate Data Path

Set the path to the Loqate Global Knowledge Repository provided by Loqate and
installed locally.

You must order and download the Loqate Local API and the Global Knowledge
Repository from http:// www.loqate.com/. tLoqateAddressRow uses the Q4, 2012 release.

Advanced settings

Server options

Set the server options as the following:

Address Line Separator: define the string
which will separate the output address components within the output address fields.
The default separator is the line break string (<BR>).

Default Country: select the country name for
which the ISO 3166-1 alpha-3 code should be used when parsing data and if no
identifiable country is found in an input record.

Forced Country: select the country name for
which the ISO 3166-1 alpha-3 code should be used for all input records when parsing
data.

Output Script: use this option to
transliterate the output address.

Select Latin to encode the parsing results in
Latin, or western characters.

Select Native to encode the parsing results
using the country script.

Below is a list of the character sets (scripts) and languages tLoqateAddressRow can transliterate:

Latn – Latin (Western characters),

Cyrl – Cyrillic (Russia),

Grek – Greek (Greece)

Hebr – Hebrew (Israel),

Hani – Kanji (Japan),

Hans – simplified Chinese (China),

Arab – Arabic (United Arab Emirates),

Thai – Thai (Thailand),

Hang – Hangul (South Korea),

Native – output in the native script wherever possible.

Minimum match score: specify the minimum match
score a record must reach in order not to be reverted. The default value is zero,
and valid values are between zero and 100.

This option is very helpful when you want to get, in the output fields, the
input data if a specific level of verification (minimum match score) was not
reached.

tStatCatcher
Statistics

Select this check box to collect log data at the component level.

Global Variables

ERROR_MESSAGE: the error message generated by the
component when an error occurs. This is an After variable and it returns a string. This
variable functions only if the Die on error check box is
cleared, if the component has this check box.

A Flow variable functions during the execution of a component while an After variable
functions after the execution of the component.

To fill up a field or expression with a variable, press Ctrl +
Space
to access the variable list and choose the variable to use from it.

For further information about variables, see Talend Studio
User Guide.

Usage

This component is an intermediary step. It requires an input and output
flows.

Limitation

n/a

Address fields in tLoqateAddressRow

Some countries have more complex addressing structures than others. As such, the use of
individual fields in this component will vary based on the input country and the available
reference data.

The table below lists all input and output fields in tLoqateAddressRow. The field that can be used on input is designated as [in] ,
[out] designates a field that may be present on output, and [in,out] designates that a field
can be used for both input and output.

Field name

Description

Address [in,out]

used to specify the full mailing address in the relevant country.

Address1, Address2, ... Address12
[in,out]

used to specify input data for the address line in the relevant country, split
into individual address lines.

DeliveryAddress [out]

used to specify the full address including line breaks without the
Organization, Locality, AdministrativeArea
and PostalCode fields.

DeliveryAddress1,
DeliveryAddress2DeliveryAddress12 [out]

used to specify the individual lines contained within the
DeliveryAddress field.

Country [in]

used to provide the country name or code.

CountryName [out]

used to provide the ISO 3166 official country name.

ISO3166-2 [out]

used to provide the ISO 3166 2-character country code.

ISO3166-3 [out]

used to provide the ISO 3166 3-character country code.

ISO3166-N [out]

used to provide the ISO 3166 3-digit numeric country code.

SuperAdministrativeArea [in,out]

used to provide the largest geographic data element within a country.

AdministrativeArea [in,out]

used to provide the most common geographic data element within a country. For
instance, USA State, and Canadian Province.

SubAdministrativeArea [in,out]

used to provide the smallest geographic data element within a country. For
instance, USA County.

Locality [in,out]

used to provide the most common population center data element within a country.
For instance, USA City, Canadian Municipality.

DependentLocality [in,out]

used to provide a smaller population center data element, dependent on the
contents of the Locality field. For instance, Turkish
Neighborhood.

DoubleDependentLocality [in,out]

used to provide the smallest population center data element, dependent on both
the contents of the Locality and DependentLocality fields.
For instance, UK Village

Thoroughfare [in,out]

used to provide the most common street or block data element within a country.
For instance, USA Street.

DependentThoroughfare [in,out]

used to provide the dependent street or block data element within a country. For
instance, UK Dependent Street.

Building [in,out]

used to provide the descriptive name identifying an individual location, if such
a name exists.

Premise [in,out]

used to provide the alphanumeric code identifying an individual location, if
such a code exists.

SubBuilding [in,out]

used to provide the secondary identifiers for a particular delivery point. For
instance, “FLAT 1” or “SUITE 212”.

PostalCode [in,out]

used to provide the complete postal code for a particular delivery point, if
such detail can be determined.

PostalCodePrimary [out]

used to provide the primary postal code used for a particular country. For
instance, USA Zip, Canadian Postcode, Indian PINcode.

PostalCodeSecondary [out]

used to provide secondary postal code information, if used in a particular
country and if such detail can be determined and reference data is available. For
instance, USA Zip Plus 4.

Organization [in,out]

used to provide the business name associated with a particular delivery point,
if such a name exists.

PostBox [out]

used to provide the post box for a particular delivery point, if it
exists.

Unmatched [out]

used to list any words that could not be matched to a particular address
component.

Latitude [out]

used to provide the WGS 84 latitude in decimal degrees format.

Longitude [out]

used to provide the WGS 84 longitude in decimal degrees format.

GeoAccuracy [out]

used to provide the GeoAccuracy code. For further information, see GeoAccuracy Code.

GeoDistance [out]

used to provide the radius of accuracy in meters, giving an indication of the
likely maximum distance between the given geocode and the physical location.

GeoAccuracy Code

The GeoAccuracy code is made up of the following values :

  • The geocoding status.

  • The geocoding level.

For example, the P3 geoaccuracy code implies:

  • P: a single geocode was found matching the input address.

  • 3: the geocode level is Thoroughfare.

The tables below give detail description of the geocoding status and level.

Geocoding status

Description

P (Point)

a single geocode was found matching the input address.

I (Interpolated)

a geocode was able to be interpolated from the input addresses location in a
range.

A (Average)

multiple candidate geocodes were found to match the input address, and an
average of these was returned.

U (Unable to geocode)

a geocode was not able to be generated for the input address.

Geocoding level

Description

5

delivery point (PostBox or SubBuilding).

4

premises (Premises or Building).

3

thoroughfare.

2

locality.

1

administrative area.

0

none.

Address verification codes in tLoqateAddressRow

The tLoqateAddressRow component outputs an
ACCURACYCODE column. This column holds the verification codes for processed
addresses.

The verification code is made up of the following values:

Verification code values

Description

The verification status

used to specify the full mailing address in the relevant country.

The post-processed verification match level.

used to specify input data for the address line in the relevant country, split
into individual address lines.

The pre-processed verification match level

used to specify the full address including line breaks without the
Organization, Locality, AdministrativeArea
and PostalCode fields.

The parsing status

used to specify the individual lines contained within the
DeliveryAddress field.

The lexicon identification match level

used to supply the country name or code.

The context identification match level

used to supply the ISO 3166 official country name.

The postcode status

used to supply the ISO 3166 2-character country code.

The matchscore

used to supply the ISO 3166 3-character country code.

For example, the V44-I44-P3-100 verification code implies:

  • Verification status = V (verified): a complete match was made between the input
    address and a single record from the available reference data.

  • Post-processed verification match level = 4 (premises): the level to which the input
    data matches the available reference data once all changes and additions performed during
    the verification process have been taken into account.

  • Pre-processed verification match level = 4 (premises): the level to which the input
    data matches the available reference data prior to any changes or additions performed
    during the verification process.

  • Parsing status = I (identified and parsed): all components of the input data have been
    able to be identified and placed into output fields.

  • Lexicon identification match level = 4 (premises): using pattern matching, a numeric
    value or word has been identified as a premise number or name.

  • Context identification match level = 4 (premises): using a least accurate form of
    matching, a numeric value or word has been identified as a premises number or name.

  • Postcode Status = P3 (added): the primary postal code for the country has been
    added.

  • Match score = 100 (complete similarity): the input data and closest reference data
    match completely.

The following sections explain in more details all segments of the verification
code.

Verification status

The verification status can be one of the followings:

Status

Description

V (Verified)

the address was parsed and an exact match in the reference data was found for
all the address components.

P (Partially Verified)

the reference data has more detail than the input data for the address. The
address was parsed and most of the components of the address were matched against
the reference data.

U (Unverified)

the input data could not be parsed. The output fields will contain the input
data.

A (Ambiguous)

more than one item in the reference data match the input data.

C (Conflict)

individual address components are valid, but the address is not valid when
combining the components together.

R (Reverted)

the address was parsed and verified but a minimum acceptable level of
verification was not reached. The output fields will contain the input
data.

Post-processed verification match level

The post-processed verification match level gives the level to which the input data
matches the available reference data once all changes and additions performed during the
verification process have been taken into account.

Match level

Description

5

delivery point (PostBox or SubBuilding).

4

premises (Premises or Building).

3

thoroughfare.

2

locality.

1

administrative area.

0

none.

Pre-processed verification match level

The pre-processed verification match level gives the level to which the input data
matches the available reference data prior to any changes or additions performed during the
verification process.

Match level

Description

5

delivery point (PostBox or SubBuilding).

4

premises (Premises or Building).

3

thoroughfare.

2

locality.

1

administrative area.

0

none.

Parsing status

The parsing status can be one of the followings:

  • I (identified and parsed): all input data was identified and placed
    into different address fields.

  • U (unable to parse): not all input data was identified and
    parsed.

Lexicon identification match level

The lexicon identification match level gives the level to which the input data has some
recognized form, through the use of:

  • pattern matching, for example a numeric value could be a premises number,
    and

  • lexicon matching, for example rd could be a
    Thoroughfare type (road) and
    London could be a Locality.

Match level

Description

5

delivery point (PostBox or SubBuilding).

4

premises (Premises or Building).

3

thoroughfare.

2

locality.

1

administrative area.

0

none.

Context identification match level

The context identification match level gives the level to which the input data can be
recognized based on the context in which it appears.

This is the least accurate form of matching and is based on identifying a word as, for
instance, a Thoroughfare based on it being preceded by something that could be
a Premise, and followed by something that could be a Locality, the
latter items being identified through a match against the reference data or the
lexicon.

Match level

Description

5

delivery point (PostBox or SubBuilding).

4

premises (Premise or Building).

3

thoroughfare.

2

locality.

1

administrative area.

0

none.

Postcode status

The postal code status can be of the following values:

Status

Description

P8

PostalCodePrimary and PostalCodeSecondary are
verified.

P7

PostalCodePrimary is verified and
PostalCodeSecondary is added or changed.

P6

PostalCodePrimary is verified.

P5

PostalCodePrimary is verified with small change.

P4

PostalCodePrimary is verified with large change.

P3

PostalCodePrimary is added.

P2

PostalCodePrimary is identified by lexicon.

P1

PostalCodePrimary is identified by context.

P0

PostalCodePrimary is empty.

Match score

The match score gives the similarity between the input data and closest reference data
match as a percentage between 0 and 100. 100% means complete similarity.

Process status in tLoqateAddressRow

The tLoqateAddressRow component outputs a
STATUS column. This column holds the status of processing input addresses as
the following:

Status

Description

psOK

the process completed normally. The score must be examined to determine result
accuracy.

psException

an exception occurred during processing input records, normally as a result of
malformed input data.

psServerUninitialized

the process could not be completed because the server has not been
initialized.

psInvalidInputRecord

the input record contains invalid data, often due to the supply of
non-UTF8/Unicode data.

Scenario: Parsing addresses against Loqate data

This scenario describes a three-component Job that:

  • uses the tFixedFlowInput component to generate the
    address data to be analyzed,

  • uses the tLoqateAddressRow component to parse,
    standardize and format the US addresses generated by the tFixedFlowInput component,

  • uses a tFileOutputExcel component to output the
    correct formatted addresses in an .xsl file.

use_case-tloqateaddressrow.png

Prerequisites: Before being able to use the tLoqateAddressRow component, you must order and download the Loqate
Local API and the Global Knowledge Repository from http://
www.loqate.com/
.

tLoqateAddressRow uses the Q4, 2012 release.

Setting up the Job

  1. Drop the following components from the Palette onto
    the design workspace: tFixedFlowInput, tLoqateAddressRow and tFileOutputExcel.

  2. Connect the three components together using the Main links.

Configuring the input component

  1. Double-click tFixedFlowInput to open its Basic settings view in the Component tab.

    use_case-tloqateaddressrow2.png
  2. Create the schema through the Edit Schema
    button.

    In the open dialog box, click the plus button and add the columns that will hold the
    information in the input address, in this example: address_input,
    COUNTRY and data_description.

  3. Click OK.

  4. In the Number of rows field, set the number of rows
    as 1.

  5. In the Mode area, select the Use Inline Content (delimited file) option, and set the row and field
    separators in the corresponding fields.

  6. In the Content table, enter the address data you
    want to analyze, for example:

Configuring the tLoqateAddressRow component

  1. Double-click tLoqateAddressRow to display the
    Basic settings view and define the component
    properties.

    use_case-tloqateaddressrow4.png
  2. Click the Edit schema button and define in the
    output schema all the columns necessary to hold the formatted address you want to get
    from tLoqateAddressRow.

    use_case-tloqateaddressrow3.png

    Two output columns are read-only: STATUS and
    ACCURACYCODE. The first column returns the status of processing input
    addresses. For further information about process status, see Process status in tLoqateAddressRow. The second column
    returns the verification code for the processed address. For further information about
    what values this code is made up of and the implications of each segment, see Address verification codes in tLoqateAddressRow.

    In this example, using the same address-input
    column in the output schema will output the input address. This could be helpful to
    compare how the address elements were parsed and standardized.

  3. Click OK and accept to propagate the
    changes.

  4. In the Input Address table:

    • add lines in the table,

    • in the Address Field column, click a line and
      select from the list, predefined in the component, the fields that hold the input
      address, Address and Country in this example.

    • in the Input Column column, click a line and
      select from the list of the input schema the columns that hold the input address,
      address-input and COUNTRY in this example.

  5. In the Output Address table:

    • add lines in the table,

    • in the Address Field column, click a line and
      select from the list, predefined in the component, the fields that will hold the
      output address.

      The component will map the values of these fields to the output columns you set
      in this table.

      tLoqateAddressRow provides a long list of
      individual fields because some countries have more complex addressing structures
      than others. For further information about the output fields, see Address fields in tLoqateAddressRow.

    • in the Output Column column, click a line and
      select from the list the columns that will hold the standardized output
      address.

  6. In the Loqate Data Path field, set the path to the
    Loqate data folder provided by Loqate and installed locally.

Setting a JVM argument and finalizing the Job

  1. Double-click the tFileOutputExcel component to
    display the Basic settings view and define the
    component properties.

    use_case-tloqateaddressrow5.png
  2. Set the destination file name as well as the sheet name and then select the
    Include header and Define all
    columns auto size
    check boxes.

  3. Click the Run tab and then in the open view click
    Advanced settings.

    use_case-tloqateaddressrow7.png
  4. Select the Use specific JVM arguments check box and
    then click New….

  5. In the pop-up window, set the following JVM argument:
    Djava.library.path=<path/to/libloqatejava.dll/folder/>.

    In this argument, you must indicate the folder where the loqate library, called
    libloqatejava.so on Linux or loqatejava.dll
    on Windows, is installed.

    Without the correct JVM argument setting, the following error is to be expected:
    java.lang.Error: java.lang.UnsatisfiedLinkError.

  6. Save your Job and press F6 to execute it.

    The tLoqateAddressRow reads the input address data.
    It parses, verifies, cleanses, standardizes addresses and gives the result in the output
    rows you defined in the output schema.

    use_case-tloqateaddressrow6.png

    tLoqateAddressRow matches input address data
    against the Loqate data file you downloaded locally.

    The STATUS standard output column returns the psOKstatus
    for all address rows. This means that the verification process of all address rows could
    be completed successfully by the component. For further information about process
    status, see Process status in tLoqateAddressRow.

    The ACCURACYCODE standard output column returns a verification code for
    each of the processed address rows. For example, the first verification code
    V44-I45-P7-100 means:

    • Verification status = V (verified): a complete match was made between the input
      address and a single record from the available reference data.

    • Post-processed verification match level = 4 (premises): the level to which the
      input data matches the available reference data once all changes and additions
      performed during the verification process have been taken into account.

    • Pre-processed verification match level = 4 (premises): the level to which the
      input data matches the available reference data prior to any changes or additions
      performed during the verification process.

    • Parsing status = I (identified and parsed): all components of the input data
      have been able to be identified and placed into output fields.

    • Lexicon identification match level = 4 (premises): using pattern matching, a
      numeric value or word has been identified as a premises number or name.

    • Context identification match level = 5 (delivery point, PostBox or SubBuilding):
      a numeric value or word has been identified as a post box number or sub building
      name.

    • Postcode Status = P7 (added): the primary postal code for the country has been
      verified and a secondary postal code has been added.

    • Match score = 100 (complete similarity): the input data and closest reference
      data match completely.

    For further information about what values this code is made up of and the
    implications of each segment, see Address verification codes in tLoqateAddressRow.


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x