Tidy data structure, messy data

Like families, tidy datasets are all alike but every messy dataset is messy in its own way.
- Hadley Wickham

We often work with datasets where there is a unique identifier that can be used to link or filter the data, typically involving some type of a join.

However, sometimes the identifiers that we wish to use to join are near matches. For instance, maybe you are joining a dataset by names, where one dataset has John, and the dataset you are joining has john. Or maybe there are text entry errors and it is instead Jon.

What you need is a way to join data that is a little bit fuzzy. Enter David Robinson’s excellent fuzzyjoin package.

Introducing an example - schools in the Virgin Islands

There are 20 schools listed in the public school database for the National Center for Education Statistics in the Virgin Islands. I’ve downloaded the dataset and you can pull it from my github repo through the code below. A scrollable kableExtra table to explore below (disabling a lot of features just to see the table):

Code in R

library(tidyverse)
library(fuzzyjoin)
library(kableExtra)
VI <- read_csv("https://github.com/drjohnrussell/drjohnrussell.github.io/raw/refs/heads/master/posts/2025-01-17-fuzzyjoin-in-action/data/VIschools.csv")

VI |> 
  kbl() |> 
  kable_paper() |> ##nice theme
  kable_styling(bootstrap_options = c("striped", "hover")) |> 
  scroll_box(height = "200px") ##adds scrolling

Public Schools of Virgin Islands
schoolname	nces	district	address	state	low_grade	hi_grade	total
Alfredo Andrews Elementary School	780000200002	Saint Croix School District	RFD 1 KINGSHILL	VI	PK	06	495
Claude O. Markoe Elementary School	780000200006	Saint Croix School District	PLOTS 71 75 MARS HILL	VI	PK	06	403
Eulalie Rivera	780000200011	Saint Croix School District	ROUTE 1 GROVE PLACE	VI	PK	08	645
St. Croix Educational Complex High School	780000200013	Saint Croix School District	RR2 KINGSHILL	VI	09	12	893
Juanita Gardine	780000200021	Saint Croix School District	ESTATE RICHMOND	VI	PK	08	305
Lew Muckle Elementary School	780000200023	Saint Croix School District	317 SION FARM	VI	PK	06	339
Pearl B. Larsen	780000200028	Saint Croix School District	ESTATE ST PETERS	VI	PK	08	432
Ricardo Richards Elementary School	780000200029	Saint Croix School District	491 BARREN SPOT	VI	PK	06	392
St. Croix Central High School	780000200030	Saint Croix School District	RSD 2 KINGSHILL	VI	09	12	717
John H. Woodson Junior High School	780000200037	Saint Croix School District	RURAL ROUTE 1 KINGSHILL	VI	07	08	461
Charlotte Amalie High School	780003000005	Saint Thomas - Saint John School District	8 and 9 ESTATE THOMAS	VI	09	12	1076
Ivanna Eudora Kean High School	780003000015	Saint Thomas - Saint John School District	1 and 2 ESTate NAZARETH	VI	09	12	738
Jane E. Tuitt Elementary School	780003000018	Saint Thomas - Saint John School District	19 LEVOKI STRAEDE	VI	KG	04	154
Joseph Gomez Elementary School	780003000019	Saint Thomas - Saint John School District	142 ANNAS RETREAT	VI	PK	05	462
Joseph Sibilly Elementary School	780003000020	Saint Thomas - Saint John School District	14 15 16 ESTATE ELIZABETH	VI	PK	05	226
Julius E. Sprauve School	780003000022	Saint Thomas - Saint John School District	14 18 ESTATE ENIGHED	VI	PK	08	225
Lockhart Elementary School	780003000024	Saint Thomas - Saint John School District	41 ESTATE THOMAS	VI	KG	08	977
Ulla F. Muller Elementary School	780003000026	Saint Thomas - Saint John School District	7B ESTATE CONTANT	VI	PK	05	401
Yvonne E. Milliner-Bowsky Elementary School	780003000027	Saint Thomas - Saint John School District	15B and 16 ESTATE MANDAHL	VI	PK	05	433
Bertha C. Boschulte Middle School	780003000034	Saint Thomas - Saint John School District	9 1 and 12A BOVONI	VI	06	08	538

Let’s say we have a dataset with the following names of schools, and we want to pull in information from NCES.

Code in R

sample <- read_csv("https://github.com/drjohnrussell/drjohnrussell.github.io/raw/refs/heads/master/posts/2025-01-17-fuzzyjoin-in-action/data/sample.csv")

sample |> 
  kbl() |> 
  kable_paper()

Dataset to join
schoolname	state
eulalie rivera	VI
Joseph Elementary School	VI
Alfredo Andrews Elementary School	NY

Enter the fuzzyjoin package, which allows the data to be messy in many ways, depending on what you need. Some of the ways to join outlined in the package are as follows:

difference_join - joins that are numeric and within a specified distance
geo_join - joins that use distances based on latitude and longitude
regex_join - joins that look for common regex patterns (text and position)
stringdist_join - joins that take into account small differences in the string

Let’s focus on stringdist_join for special case uses.

Joins that ignore case

Joins that ignore case, where you don’t have to mutate using something like the stringr str_to_lower on the data to change it is a big win. We can use it to match eulalie rivera.

This is what would happen as is:

Code in R

sample[1,1] |> 
  inner_join(VI,by=c("schoolname")) |> 
  kbl() |> 
  kable_paper()

An empty inner join
schoolname	nces	district	address	state	low_grade	hi_grade	total

On the other hand, we can use the stringdist_join and set ignore_case to equal TRUE.

Code in R

sample[1,1] |> 
  stringdist_join(VI,
                  by=c("schoolname"),
                  max_dist=0,
                  mode="inner",
                  ignore_case=TRUE) |> 
  kbl() |> 
  kable_paper()

A successful join
schoolname.x	schoolname.y	nces	district	address	state	low_grade	hi_grade	total
eulalie rivera	Eulalie Rivera	780000200011	Saint Croix School District	ROUTE 1 GROVE PLACE	VI	PK	08	645

Joins that take advantage of string distance

The way that I’ve taken advantage of this is in dealing with typos or data where someone may have subtly different names for a school (e.g., one has the word school while the other drops it). What is nice about the fuzzyjoin package is that you can do the join. What is less nice is that it requires you to do a line check afterwards, especially when you are loose with the distances.

Let’s look at the second row, Joseph Elementary School, and have a maximum distance of 8.

Code in R

sample[2,1] |> 
  stringdist_join(VI,
                   by="schoolname",
                   max_dist=8,
                   mode="inner",
                   ignore_case=TRUE,
                  distance_col="stringdistance") |> 
  kbl() |> 
  kable_paper()

Multiple rows loosely match
schoolname.x	schoolname.y	nces	district	address	state	low_grade	hi_grade	total	stringdistance
Joseph Elementary School	Joseph Gomez Elementary School	7.80003e+11	Saint Thomas - Saint John School District	142 ANNAS RETREAT	VI	PK	05	462	6
Joseph Elementary School	Joseph Sibilly Elementary School	7.80003e+11	Saint Thomas - Saint John School District	14 15 16 ESTATE ELIZABETH	VI	PK	05	226	8
Joseph Elementary School	Lockhart Elementary School	7.80003e+11	Saint Thomas - Saint John School District	41 ESTATE THOMAS	VI	KG	08	977	7

You can see the power, and the danger, of the fuzzyjoin package here. It’s amazing that it picks up the two schools that also have Joseph in their name, but it also suggests that you could change a few letters in the beginning and form the name of another school.

When we used fuzzyjoin, we would arrange the data by the school name and then the distance_col variable so that we could easily do a line check and choose the best match (if available).

Matching off of multiple columns

As a side note, it is nice to use the fuzzyjoin package for seeing errors in multiple columns. Here we can see how it works with the third row of the dataset.

Code in R

sample[3,] |> 
  stringdist_join(VI,
                  by=c("schoolname","state"),
                  mode="inner",
                  max_dist = 4,
                  ignore_case=TRUE,
                  distance_col="stringdistance") |> 
  kbl() |> 
  kable_paper()

schoolname.x	state.x	schoolname.y	nces	district	address	state.y	low_grade	hi_grade	total	schoolname.stringdistance	state.stringdistance	stringdistance
Alfredo Andrews Elementary School	NY	Alfredo Andrews Elementary School	780000200002	Saint Croix School District	RFD 1 KINGSHILL	VI	PK	06	495	0	2	NA

What is nice about this is that, in review, you can see easily through the distance_col variable where the changes were found. For instance, maybe you want the join to include state, but you want to exclude this example because you know that schools in other states may have the same name; it would just be as simple as adding a filter at the end to take out stringdist.state > 0.

Other notes

Others have used fuzzyjoin to great effect - I am inspired reading this vignette in particular on geo_joins

Citation

BibTeX citation:

@online{russell2025,
  author = {Russell, John},
  title = {Using Fuzzyjoin to Work with {NCES} Data},
  date = {2025-01-17},
  url = {https://drjohnrussell.github.io/posts/2025-01-17-fuzzyjoin-in-action/},
  langid = {en}
}

For attribution, please cite this work as:

Russell, John. 2025. “Using Fuzzyjoin to Work with NCES Data.” January 17, 2025. https://drjohnrussell.github.io/posts/2025-01-17-fuzzyjoin-in-action/.