Jaro-Winkler distance

From Wikipedia, the free encyclopedia

The Jaro-Winkler distance (Winkler, 1999) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995) and mainly used in the area of record linkage (duplicate detection). The higher the Jaro-Winkler distance for two strings is, the more similar the strings are. The Jaro-Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.

The Jaro distance metric states that given two strings $s 1$ and $s 2$ , their distance $d j$ is:

$d_j = \frac{1}{3}\left(\frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m-t}{m}\right)$

where:

$m$ is the number of matching characters (see below);
$t$ is the number of transpositions (see below).

Two characters from $s 1$ and $s 2$ respectively, are considered matching only if they are not farther than $\left\lfloor\frac{\max(|s_1|,|s_2|)}{2}\right\rfloor-1$ .

Each character of $s 1$ is compared with all its matching characters in $s 2$ . The number of matching (but different) characters divided by two defines the number of transpositions.

Jaro-Winkler distance uses a prefix scale $p$ which gives more favourable ratings to strings that match from the beginning for a set prefix length $\ell$ . Given two strings $s 1$ and $s 2$ , their Jaro-Winkler distance $d w$ is:

$d_w = d_j + (\ell p (1 - d_j))$

where:

$d j$ is the Jaro distance for strings $s 1$ and $s 2$
$\ell$ is the length of common prefix at the start of the string up to a maximum of 4 characters
$p$ is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. The standard value for this constant in Winkler's work is $p = 0.1$

Although often referred to as a distance metric, the Jaro–Winkler distance is actually not a metric in the mathematical sense of that term.

1 Example
2 References
3 See also
4 External links

[edit] Example

Note that Winkler's "reference" C code differs in at least two ways from published accounts of the Jaro-Winkler metric. First is his use of a typo table (adjwt) and also some optional additional tolerance for long strings.

Given the strings $s 1$ MARTHA and $s 2$ MARHTA we find:

$m = 6$
$| s 1 | = 6$
$| s 2 | = 6$
There are mismatched characters T/H and H/T leading to $t = \frac{2}{2} = 1$

We find a Jaro score of:

$d_j = \frac{1}{3}\left(\frac{6}{6} + \frac{6}{6} + \frac{6-1}{6}\right) = 0.944$

To find the Jaro-Winkler score using the standard weight $d = 0.1$ , we continue to find:

$\ell = 3$

Thus:

d w = 0.944 + (3 * 0.1(1 - 0.944)) = 0.961

Given the strings $s 1$ DWAYNE and $s 2$ DUANE we find:

$m = 4$
$| s 1 | = 6$
$| s 2 | = 5$
$t = 0$

We find a Jaro score of:

$d_j = \frac{1}{3}\left(\frac{4}{6} + \frac{4}{5} + \frac{4-0}{4}\right) = 0.822$

To find the Jaro-Winkler score using the standard weight $d = 0.1$ , we continue to find:

$\ell = 1$

Thus:

d w = 0.822 + (1 * 0.1(1 - 0.822)) = 0.84

Given the strings $s 1$ DIXON and $s 2$ DICKSONX we find:

	D	I	X	O	N
D	1	0	0	0	0
I	0	1	0	0	0
C	0	0	0	0	0
K	0	0	0	0	0
S	0	0	0	0	0
O	0	0	0	1	0
N	0	0	0	0	1
X	0	0	0	0	0

$m = 4$ Note that the two Xs are not considered matches because they are outside the match window of 3.
$| s 1 | = 5$
$| s 2 | = 8$
$t = 0$

We find a Jaro score of:

$d_j = \frac{1}{3}\left(\frac{4}{5} + \frac{4}{8} + \frac{4-0}{4}\right) = 0.767$

To find the Jaro-Winkler score using the standard weight $d = 0.1$ , we continue to find:

$\ell = 2$

Thus:

d w = 0.767 + (2 * 0.1(1 - 0.767)) = 0.813

[edit] References

Jaro, M. A. (1989). "Advances in record linking methodology as applied to the 1985 census of Tampa Florida". Journal of the American Statistical Society 84 (406): 414–420.
Jaro, M. A. (1995). "Probabilistic linkage of large public health data file". Statistics in Medicine 14: 491–498. doi:10.1002/sim.4780140510.
Winkler, W. E. (1999). "The state of record linkage and current research problems". Statistics of Income Division, Internal Revenue Service Publication R99/04.
Winkler, W. E. (2006). "Overview of Record Linkage and Current Research Directions". Research Report Series, RRS.