Code&Data Insights
[Data Analytics] Entity Linkage - Atomic String Similarity | Gap Distance | Jaccard Distance | Jaro Similarity | Jaro-Winkler similarity 본문
[Data Analytics] Entity Linkage - Atomic String Similarity | Gap Distance | Jaccard Distance | Jaro Similarity | Jaro-Winkler similarity
paka_corn 2023. 10. 24. 03:45< Data Linkage >
: the process of identifying and connecting records or data entries that correspond to the same real-world entity or individual in one or more data sources.
- Improves data quality and integrity
- Fosters re-use of existing data sources
- Optimize space
[ Atomic String Similarity ]
Atomic String Similarity, why it is important?
- Information Retrieval : similarity of string
- Database quality : keep atomicity in database, and avoid the duplicate data
- Text Matching and recognize the pattern
- If all cost is same for insert/update(substitute)/delete
if the character is not same
-> insert+delete = 2 , substitute = 2
=> the cost is SAME!
else (the character is same)
-> insert+delete = 2 , substitute = 1
=> substitution cost is lower than insertion+deletion!
-> To put the string, open gap + insert cost always added and then add extend gap*(num of characeter)
formula of gap distance = open gap + insert cost + extend gap*(num of characeter)
** space also treats as character! **
- Example : DEIS vs DESI
Steps
1) Calculate the distance mathing
distance matching = max(4,4)/2 -1 = 1
DEIS
0123
DESI
0123
** the result of distance matching should be applied to floor function! ** (1,5 -> 1 )
2) Check C and T
C(common character, after considered distance matching) = 4
T(transpose num of character in C) = 1
-> 'IS' & 'SI'
3) Find JaroSim
|S1| = 4, |S2|= 4
JaroSim = 1/3 * ( 4/4 + 4/4 + 3/4) = 0.9167
[ Jaro-Winkler similarity ]
Jaro-Winkler similarity = 𝐽𝑎𝑟𝑜𝑆𝑖𝑚 + 𝑃 ∗ 𝐿 ∗ (1 − 𝐽𝑎𝑟𝑜𝑆𝑖𝑚)
p = 0.1
L = (prefix common number of character between two strings)