Like families, tidy datasets are all alike but every messy dataset is messy in its own way.
- Hadley Wickham
We often work with datasets where there is a unique identifier that can be used to link or filter the data, typically involving some type of a join.
However, sometimes the identifiers that we wish to use to join are near matches. For instance, maybe you are joining a dataset by names, where one dataset has John, and the dataset you are joining has john. Or maybe there are text entry errors and it is instead Jon.
What you need is a way to join data that is a little bit fuzzy. Enter David Robinson’s excellent fuzzyjoin package.
Introducing an example - schools in the Virgin Islands
There are 20 schools listed in the public school database for the National Center for Education Statistics in the Virgin Islands. I’ve downloaded the dataset and you can pull it from my github repo through the code below. A scrollable kableExtra table to explore below (disabling a lot of features just to see the table):
Enter the fuzzyjoin package, which allows the data to be messy in many ways, depending on what you need. Some of the ways to join outlined in the package are as follows:
difference_join - joins that are numeric and within a specified distance
geo_join - joins that use distances based on latitude and longitude
regex_join - joins that look for common regex patterns (text and position)
stringdist_join - joins that take into account small differences in the string
Let’s focus on stringdist_join for special case uses.
Joins that ignore case
Joins that ignore case, where you don’t have to mutate using something like the stringr str_to_lower on the data to change it is a big win. We can use it to match eulalie rivera.
The way that I’ve taken advantage of this is in dealing with typos or data where someone may have subtly different names for a school (e.g., one has the word school while the other drops it). What is nice about the fuzzyjoin package is that you can do the join. What is less nice is that it requires you to do a line check afterwards, especially when you are loose with the distances.
Let’s look at the second row, Joseph Elementary School, and have a maximum distance of 8.
You can see the power, and the danger, of the fuzzyjoin package here. It’s amazing that it picks up the two schools that also have Joseph in their name, but it also suggests that you could change a few letters in the beginning and form the name of another school.
When we used fuzzyjoin, we would arrange the data by the school name and then the distance_col variable so that we could easily do a line check and choose the best match (if available).
Matching off of multiple columns
As a side note, it is nice to use the fuzzyjoin package for seeing errors in multiple columns. Here we can see how it works with the third row of the dataset.
What is nice about this is that, in review, you can see easily through the distance_col variable where the changes were found. For instance, maybe you want the join to include state, but you want to exclude this example because you know that schools in other states may have the same name; it would just be as simple as adding a filter at the end to take out stringdist.state > 0.
Other notes
Others have used fuzzyjoin to great effect - I am inspired reading this vignette in particular on geo_joins
Citation
BibTeX citation:
@online{russell2025,
author = {Russell, John},
title = {Using Fuzzyjoin to Work with {NCES} Data},
date = {2025-01-17},
url = {https://drjohnrussell.github.io/posts/2025-01-17-fuzzyjoin-in-action/},
langid = {en}
}