Code&Data Insights

[Data Analytics] Entity Linkage - Atomic String Similarity | Gap Distance | Jaccard Distance | Jaro Similarity | Jaro-Winkler similarity 본문

Data Science/Data Analytics

[Data Analytics] Entity Linkage - Atomic String Similarity | Gap Distance | Jaccard Distance | Jaro Similarity | Jaro-Winkler similarity

paka_corn 2023. 10. 24. 03:45

< Data Linkage >

:  the process of identifying and connecting records or data entries that correspond to the same real-world entity or individual in one or more data sources.

 

- Improves data quality and integrity

- Fosters re-use of existing data sources

- Optimize space 

 

 

[ Atomic String Similarity ] 

Atomic String Similarity, why it is important?

-       Information Retrieval : similarity of string

-       Database quality : keep atomicity in database, and avoid the duplicate data

-       Text Matching and recognize the pattern

 

- If all cost is same for insert/update(substitute)/delete 

if the character is not same 

-> insert+delete = 2 , substitute = 2

=> the cost is SAME! 

 

else (the character is same) 

-> insert+delete = 2 , substitute = 1 

=> substitution cost is lower than insertion+deletion! 

 

-> To put the string, open gap + insert cost always added and then add extend gap*(num of characeter) 

formula of gap distance = open gap + insert cost + extend gap*(num of characeter)  

** space also treats as character! ** 

 

 

 

 

- Example : DEIS vs DESI

Steps 

1) Calculate the distance mathing

distance matching = max(4,4)/2 -1 = 1 

 

DEIS

0123

 

DESI

0123

 

** the result of distance matching should be applied to floor function! ** (1,5 -> 1 ) 

 

2) Check C and T 

C(common character, after considered distance matching) = 4 

T(transpose num of character in C) = 1

-> 'IS' & 'SI' 

 

3) Find JaroSim

|S1| = 4, |S2|= 4

JaroSim = 1/3 * ( 4/4 + 4/4 + 3/4) = 0.9167

 

 

[ Jaro-Winkler similarity ]

 Jaro-Winkler similarity = 𝐽𝑎𝑟𝑜𝑆𝑖𝑚 + 𝑃 ∗ 𝐿 ∗ (1 − 𝐽𝑎𝑟𝑜𝑆𝑖𝑚)

p = 0.1

L = (prefix common number of character between two strings) 

 

 

 

 

 

 

Comments