Jaro-Winkler distance

From Wikipedia, the free encyclopedia

The Jaro-Winkler distance (Winkler, 1999) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995) and mainly used in the area of record linkage (duplicate detection). The higher the Jaro-Winkler distance for two strings is, the more similar the strings are. The Jaro-Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.

The Jaro distance metric states that given two strings s1 and s2, their distance dj is:

d_j = \frac{1}{3}\left(\frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m-t}{m}\right)

where:

  • m is the number of matching characters (see below);
  • t is the number of transpositions (see below).

Two characters from s1 and s2 respectively, are considered matching only if they are not farther than \left\lfloor\frac{\max(|s_1|,|s_2|)}{2}\right\rfloor-1.

Each character of s1 is compared with all its matching characters in s2. The number of matching (but different) characters divided by two defines the number of transpositions.

Jaro-Winkler distance uses a prefix scale p which gives more favourable ratings to strings that match from the beginning for a set prefix length \ell. Given two strings s1 and s2, their Jaro-Winkler distance dw is:

d_w = d_j + (\ell p (1 - d_j))

where:

  • dj is the Jaro distance for strings s1 and s2
  • \ell is the length of common prefix at the start of the string up to a maximum of 4 characters
  • p is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. The standard value for this constant in Winkler's work is p = 0.1

Although often referred to as a distance metric, the Jaro–Winkler distance is actually not a metric in the mathematical sense of that term.

Contents

[edit] Example

Note that Winkler's "reference" C code differs in at least two ways from published accounts of the Jaro-Winkler metric. First is his use of a typo table (adjwt) and also some optional additional tolerance for long strings.

Given the strings s1 MARTHA and s2 MARHTA we find:

  • m = 6
  • | s1 | = 6
  • | s2 | = 6
  • There are mismatched characters T/H and H/T leading to t = \frac{2}{2} = 1

We find a Jaro score of:

d_j = \frac{1}{3}\left(\frac{6}{6} + \frac{6}{6} + \frac{6-1}{6}\right) = 0.944

To find the Jaro-Winkler score using the standard weight d = 0.1, we continue to find:

  • \ell = 3

Thus:

dw = 0.944 + (3 * 0.1(1 − 0.944)) = 0.961

Given the strings s1 DWAYNE and s2 DUANE we find:

  • m = 4
  • | s1 | = 6
  • | s2 | = 5
  • t = 0

We find a Jaro score of:

d_j = \frac{1}{3}\left(\frac{4}{6} + \frac{4}{5} + \frac{4-0}{4}\right) = 0.822

To find the Jaro-Winkler score using the standard weight d = 0.1, we continue to find:

  • \ell = 1

Thus:

dw = 0.822 + (1 * 0.1(1 − 0.822)) = 0.84


Given the strings s1 DIXON and s2 DICKSONX we find:

D I X O N
D 1 0 0 0 0
I 0 1 0 0 0
C 0 0 0 0 0
K 0 0 0 0 0
S 0 0 0 0 0
O 0 0 0 1 0
N 0 0 0 0 1
X 0 0 0 0 0


  • m = 4 Note that the two Xs are not considered matches because they are outside the match window of 3.
  • | s1 | = 5
  • | s2 | = 8
  • t = 0

We find a Jaro score of:

d_j = \frac{1}{3}\left(\frac{4}{5} + \frac{4}{8} + \frac{4-0}{4}\right) = 0.767

To find the Jaro-Winkler score using the standard weight d = 0.1, we continue to find:

  • \ell = 2

Thus:

dw = 0.767 + (2 * 0.1(1 − 0.767)) = 0.813

[edit] References

[edit] See also

record linkage, census

[edit] External links

Languages