The T-Swoosh algorithm
but it creates a master record instead of considering existing records to be master
records.
To create master records, you can design survivorship rules to decide which attribute
will survive.
There are two types of survivorship rules:
-
The rules related to matching keys: each attribute used as a matching key can
have a specific survivorship rule. -
The default rules: they are applied to all the attributes of the same data type
(Boolean, String, Date, Number).
If a column is a matching key, the rule related to matching keys specific to this column
is applied.
If the column is not a matching key, the default survivorship rule for this data type is
applied. If the default survivorship rule is not defined for the data type, the
Most common survivorship function is used.
Each time two records are merged to create a new master record, this new master record is
added to the queue of records to be examined. The two records that are merged are
removed from the lookup table.
For example, take the following set of records as input:
|
id |
fullName |
|---|---|
|
1 |
John Doe |
|
2 |
Donna Lewis |
|
3 |
John B. Doe |
|
4 |
Louis Armstrong |
The survivorship rule uses the Concatenate function with
, as a parameter to separate values.
At the beginning of the process, the queue contains all the input records and the lookup
is empty. To process the input records, the algorithm iterates until the queue is
empty:
-
The algorithm takes record 1 and compares it with an empty set of records. Since
record 1 does not match any record, it is added to the set of master records.
The queue contains now record 2, record 3 and record 4. The lookup contains
record 1. -
The algorithm takes record 2 and compares it with record 1. Since record 2 does
not match any record, it is added to the set of master records. The queue
contains now record 3 and record 4. The lookup contains record 1 and record
2. -
The algorithm takes record 3 and compares it with record 1. Record 3 matches
record 1. So, record 1 and record 3 are merged to create a new master record
called record 1,3. The queue contains now record 4 and record 1,3. The lookup
contains record 2. -
The algorithm takes record 4 and compares it with record 2. Since it is not a
match, record 4 is added to the set of master records. The queue contains now
record 1,3. The lookup table contains record 2 and record 4. -
The algorithm takes record 1,3 and compares it with record 2 and record 4. Record
1,3 matches record 4. So, record 1,3 and record 4 are merged to create a new
master record called record 1,3,4. Record 4 is removed from the lookup table.
Since record 1,3 was the result of a previous merge, it is removed from the
table. The queue now contains record 1,3,4. The lookup contains record 2. -
The algorithm takes record 1,3,4 and compares it with record 2. Since it is not a
match, record 1,3,4 is added to the set of master records. The queue is now
empty. The lookup contains records 1,3,4 and record 2.
The output will look like this:
|
id |
fullName |
GRP_ID |
GRP_SIZE |
MASTER |
SCORE |
GRP_QUALITY |
|---|---|---|---|---|---|---|
|
1,3,4 |
John Doe, John B. Doe, Johnnie B. Doe |
0 |
3 |
true |
1.0 |
0.72 |
|
1 |
John Doe |
0 |
0 |
false |
0.72 |
0 |
|
3 |
John B. Doe |
0 |
0 |
false |
0.72 |
0 |
|
4 |
Johnnie B. Doe |
0 |
0 |
true |
0.78 |
0 |
|
2 |
Donna Lewis |
1 |
1 |
true |
1.0 |
1.0 |