Jaro-Winkler distance
From Wikipedia, the free encyclopedia
The Jaro-Winkler distance (Winkler, 1999) is a measure of similarity between two strings. It is a variant of the Jaro distance metric (Jaro, 1989, 1995) and mainly used in the area of record linkage (duplicate detection). The higher the Jaro-Winkler distance for two strings is, the more similar the strings are. The Jaro-Winkler distance metric is designed and best suited for short strings such as person names. The score is normalized such that 0 equates to no similarity and 1 is an exact match.
The Jaro distance metric states that given two strings s1 and s2, their distance dj is:
where:
- m is the number of matching characters (see below);
- t is the number of transpositions (see below).
Two characters from s1 and s2 respectively, are considered matching only if they are not farther than
.
Each character of s1 is compared with all its matching characters in s2. The number of matching (but different) characters divided by two defines the number of transpositions.
Jaro-Winkler distance uses a prefix scale p which gives more favourable ratings to strings that match from the beginning for a set prefix length
. Given two strings s1 and s2, their Jaro-Winkler distance dw is:
where:
- dj is the Jaro distance for strings s1 and s2
is the length of common prefix at the start of the string up to a maximum of 4 characters- p is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. The standard value for this constant in Winkler's work is p = 0.1
Although often referred to as a distance metric, the Jaro–Winkler distance is actually not a metric in the mathematical sense of that term.
Contents |
[edit] Example
Note that Winkler's "reference" C code differs in at least two ways from published accounts of the Jaro-Winkler metric. First is his use of a typo table (adjwt) and also some optional additional tolerance for long strings.
Given the strings s1 MARTHA and s2 MARHTA we find:
- m = 6
- | s1 | = 6
- | s2 | = 6
- There are mismatched characters T/H and H/T leading to

We find a Jaro score of:
To find the Jaro-Winkler score using the standard weight d = 0.1, we continue to find:
Thus:
- dw = 0.944 + (3 * 0.1(1 − 0.944)) = 0.961
Given the strings s1 DWAYNE and s2 DUANE we find:
- m = 4
- | s1 | = 6
- | s2 | = 5
- t = 0
We find a Jaro score of:
To find the Jaro-Winkler score using the standard weight d = 0.1, we continue to find:
Thus:
- dw = 0.822 + (1 * 0.1(1 − 0.822)) = 0.84
Given the strings s1 DIXON and s2 DICKSONX we find:
| D | I | X | O | N | |
| D | 1 | 0 | 0 | 0 | 0 |
| I | 0 | 1 | 0 | 0 | 0 |
| C | 0 | 0 | 0 | 0 | 0 |
| K | 0 | 0 | 0 | 0 | 0 |
| S | 0 | 0 | 0 | 0 | 0 |
| O | 0 | 0 | 0 | 1 | 0 |
| N | 0 | 0 | 0 | 0 | 1 |
| X | 0 | 0 | 0 | 0 | 0 |
- m = 4 Note that the two Xs are not considered matches because they are outside the match window of 3.
- | s1 | = 5
- | s2 | = 8
- t = 0
We find a Jaro score of:
To find the Jaro-Winkler score using the standard weight d = 0.1, we continue to find:
Thus:
- dw = 0.767 + (2 * 0.1(1 − 0.767)) = 0.813
[edit] References
- Jaro, M. A. (1989). "Advances in record linking methodology as applied to the 1985 census of Tampa Florida". Journal of the American Statistical Society 84 (406): 414–420.
- Jaro, M. A. (1995). "Probabilistic linkage of large public health data file". Statistics in Medicine 14: 491–498. doi:.
- Winkler, W. E. (1999). "The state of record linkage and current research problems". Statistics of Income Division, Internal Revenue Service Publication R99/04.
- Winkler, W. E. (2006). "Overview of Record Linkage and Current Research Directions". Research Report Series, RRS.









