← All research

ODMeans: an R package for global and local cluster detection on origin-destination data

· SoftwareX, vol. 26 (2024), article 101732 C. Heredia, S. Moreno, W. F. Yushimito

An R package that implements the OD-means model, applying it to origin-destination data to find global and local travel patterns and plotting the resulting clusters on a map.


The OD-means model finds both the global and local travel patterns of a city from origin-destination data. This entry is about the software side of that work: ODMeans, an R package that implements the model so it can be used without building the algorithm from scratch. It is available on CRAN, includes a real taxi dataset to experiment with, and can plot the resulting clusters directly on a map.

The package is built around two main functions. The first, odmeans(), applies the model to the data and returns the clusters: it takes the origin-destination trips and produces both the global and local patterns. The second, odmeans_graph(), takes that result and plots it on a map, drawing each cluster as an arrow that goes from its origin to its destination, with global and local patterns shown in different colors.

The data is provided as a table of trips, where each row contains the latitude and longitude of the origin and the latitude and longitude of the destination. Each row is therefore a single origin-destination pair, the unit the model works on.

The odmeans() function takes five main hyperparameters. Two of them set the initial number of clusters: numKGlobal for the global hierarchy and numKLocal for the local one. Since the model adjusts the number of clusters on its own, these only define the starting point. The other three are thresholds: limitSeparationGlobal, the drop in within-cluster distance required to split a global cluster; limitSeparationLocal, the same threshold applied in the local hierarchy; and distHierarchical, the distance between a cluster’s origin and destination below which the local hierarchy is triggered.

Beyond these five, the function takes two distance thresholds, maxDistGlobal and maxDistLocal, which set the radius within which separate origins or destinations are treated as the same point, one for each hierarchy. There is also kmeans_pp, a boolean that initializes the centroids using k-means++ when set to true. The full description of every parameter is documented on CRAN.

Using the package looks like this. After installing it from CRAN, we load the taxi data that comes included, run the model, and plot the result. The taxi data is the same dataset of 452,166 trips from Santiago described in the model’s paper, and it ships with the package as ODMeansTaxiData.

install.packages("ODMeans")
library(ODMeans)

# Load the taxi dataset included with the package
data(ODMeansTaxiData)

# Run OD-means
set.seed(42)
odmeans_data <- odmeans(ODMeansTaxiData, 10, 300, 1000, 2200, 3, 50, 100)

# Plot the clusters on a map
odmeans_plot <- odmeans_graph(odmeans_data, "ODMeans Taxi Graph", "roadmap", 11, FALSE)

The call to odmeans() follows the order of the function’s arguments: 10 initial global clusters, a global separation threshold of 300, a global merge distance of 1000 meters, a local-hierarchy trigger distance of 2200 meters, 3 initial local clusters, a local separation threshold of 50, and a local merge distance of 100 meters.

One practical note on odmeans_graph(): it draws the map background using Google Maps, so it needs a Google Maps API key to work. The key is registered in the R session with the register_google() function before calling the graph. You can get one from the Google Maps Platform, and the details are covered in the CRAN documentation.

The odmeans() function returns an S3 object similar to the one produced by R’s own kmeans, with eight properties that describe the result.

PropertyDescription
clusterA vector with one entry per trip, indicating which cluster each trip was assigned to.
centersA matrix with four columns, one row per cluster, giving each centroid as origin latitude, origin longitude, destination latitude, and destination longitude.
totssThe total sum of squares, a measure of the total variance in the data.
withinssA vector with the within-cluster sum of squares for each cluster.
tot.withinssThe total within-cluster sum of squares, the sum of the withinss vector.
betweenssThe between-cluster sum of squares, the difference between totss and tot.withinss.
sizeA vector with the number of trips in each cluster.
level_hierarchyA vector marking each cluster as either global or local.

Seven of these properties are the same ones kmeans returns, which makes the output easy to work with if you already know that object. The one that is specific to OD-means is level_hierarchy: it tells you, for each cluster, whether it belongs to the global hierarchy or the local one, which is what lets you separate the general patterns from the local ones in the result.

Conclusion

ODMeans makes the OD-means model available as a tool that anyone can use. Instead of implementing the two-hierarchy algorithm from scratch, you can install the package, pass in your origin-destination data, and get back the global and local patterns, along with a function to plot them on a map. It also ships with a real taxi dataset, so the model can be tried out before applying it to your own data.

If you work with origin-destination data and want to use it, the package is available on CRAN under the name ODMeans. And if you have any questions about the package or the model behind it, you are welcome to reach out at [email protected].