August 15, 2023

The T-Swoosh algorithm – Docs for ESB 6.x

The T-Swoosh algorithm

The T-Swoosh algorithm is based on the same idea as the Simple VSR Matcher algorithm,
but it creates a master record instead of considering existing records to be master
records.

To create master records, you can design survivorship rules to decide which attribute
will survive.

There are two types of survivorship rules:

  • The rules related to matching keys: each attribute used as a matching key can
    have a specific survivorship rule.

  • The default rules: they are applied to all the attributes of the same data type
    (Boolean, String, Date, Number).

If a column is a matching key, the rule related to matching keys specific to this column
is applied.

If the column is not a matching key, the default survivorship rule for this data type is
applied. If the default survivorship rule is not defined for the data type, the
Most common survivorship function is used.

tswoosh_algorithm_example_tmatchgroup.png

Each time two records are merged to create a new master record, this new master record is
added to the queue of records to be examined. The two records that are merged are
removed from the lookup table.

For example, take the following set of records as input:

id

fullName

1

John Doe

2

Donna Lewis

3

John B. Doe

4

Louis Armstrong

The survivorship rule uses the Concatenate function with
, as a parameter to separate values.

At the beginning of the process, the queue contains all the input records and the lookup
is empty. To process the input records, the algorithm iterates until the queue is
empty:

  1. The algorithm takes record 1 and compares it with an empty set of records. Since
    record 1 does not match any record, it is added to the set of master records.
    The queue contains now record 2, record 3 and record 4. The lookup contains
    record 1.

  2. The algorithm takes record 2 and compares it with record 1. Since record 2 does
    not match any record, it is added to the set of master records. The queue
    contains now record 3 and record 4. The lookup contains record 1 and record
    2.

  3. The algorithm takes record 3 and compares it with record 1. Record 3 matches
    record 1. So, record 1 and record 3 are merged to create a new master record
    called record 1,3. The queue contains now record 4 and record 1,3. The lookup
    contains record 2.

  4. The algorithm takes record 4 and compares it with record 2. Since it is not a
    match, record 4 is added to the set of master records. The queue contains now
    record 1,3. The lookup table contains record 2 and record 4.

  5. The algorithm takes record 1,3 and compares it with record 2 and record 4. Record
    1,3 matches record 4. So, record 1,3 and record 4 are merged to create a new
    master record called record 1,3,4. Record 4 is removed from the lookup table.
    Since record 1,3 was the result of a previous merge, it is removed from the
    table. The queue now contains record 1,3,4. The lookup contains record 2.

  6. The algorithm takes record 1,3,4 and compares it with record 2. Since it is not a
    match, record 1,3,4 is added to the set of master records. The queue is now
    empty. The lookup contains records 1,3,4 and record 2.

The output will look like this:

id

fullName

GRP_ID

GRP_SIZE

MASTER

SCORE

GRP_QUALITY

1,3,4

John Doe, John B. Doe, Johnnie B. Doe

0

3

true

1.0

0.72

1

John Doe

0

0

false

0.72

0

3

John B. Doe

0

0

false

0.72

0

4

Johnnie B. Doe

0

0

true

0.78

0

2

Donna Lewis

1

1

true

1.0

1.0


Document get from Talend https://help.talend.com
Thank you for watching.
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x