← All research

OD-means: a clustering model for global and local travel patterns

· IEEE Transactions on Intelligent Transportation Systems (2021) C. Heredia, S. Moreno, W. F. Yushimito

A clustering model that represents each taxi trip as an origin-destination pair and finds both the global and local movement patterns of a city, applied to taxi data from Santiago, Chile.


People move through cities constantly, and understanding how they do so matters for transportation planning. Studying that movement requires data, and the traditional source has been travel surveys. Surveys, however, are limited in both reach and accuracy. GPS-equipped taxi fleets offer an alternative that addresses both problems: they cover much of the city and record where trips actually begin and end. For this work we used GPS traces from taxis operating in Santiago, Chile, applying clustering methods to recover the city’s movement patterns.

The problem

We started by applying the classical clustering methods to our data: k-means, DBSCAN, and Infomap. The goal was to recover the travel patterns of the city, but the patterns these methods produced did not represent the real movement in the data. This is not a matter of tuning the parameters. It comes from how each model is built and how it processes the trips, and as a result each one fails in its own way.

DBSCAN forms clusters by looking at the density of points, growing a cluster wherever trips are packed closely together. In a city like Santiago, the densest concentration of trips sits around downtown, so that is where the clusters form. Almost every pattern found by DBSCAN ends up pointing toward the center, while flows that are spread across the rest of the city, such as trips to and from the airport, are never captured.

In the case of k-means, each cluster is represented by a centroid, which is the average of the trips assigned to it. When trips that share a similar origin but go to different destinations fall into the same cluster, that average lands between the real destinations, at a point where few trips actually go. The clearest case is the flow from the airport: instead of resolving into separate trips toward downtown and toward the financial district, it collapses into a single arrow pointing to the empty space between them.

Infomap detects communities through random walks over a grid of the city. Instead of separating Santiago into meaningful regions, it merges almost everything into a single large cluster. This happens because downtown still concentrates much of the city’s commercial and work activity, a remaining trait of Santiago’s monocentric structure, and that dominant center pulls the whole network into one community.

The approach: OD-means

OD-means stands for Origin-Destination means. Each trip is described by a single point in four dimensions: the latitude and longitude of its origin, together with the latitude and longitude of its destination. A trip is therefore treated as one object that carries both of its endpoints at once, rather than clustering origins and destinations as if they were separate.

Running k-means on these points already recovers the general patterns of the city, the large flows between major origins and destinations. But it fails in two ways. First, some of its clusters are simply wrong, like the airport flow described earlier, where the centroid lands between two real destinations and points to a place where few trips go. When we inspected this cluster by hand, we found that it was actually two separate travel patterns. Second, the small clusters look uninformative, just short trips concentrated in one area. When we inspected them by hand as well, we found that each one was actually hiding several distinct local patterns within the same zone. To solve these problems we developed OD-means, a new clustering model that captures both global and local patterns through an adaptive algorithm.

OD-means has two hierarchies, global and local.

The global hierarchy works as follows. First we run k-means on the OD-pairs. Then we analyze each cluster by running k-means again on the trips inside it, splitting it into two. To decide whether the split is worth keeping, we compare the within-cluster distance (WCD) of the original cluster against the sum of the WCDs of the two new sub-clusters. If the WCD drops by more than a threshold α, the cluster held more than one pattern, so we accept the split; otherwise we keep the original cluster. After checking every cluster, we run k-means again over the updated set and repeat until convergence. A large α keeps only the splits that reduce the WCD substantially, giving fewer and broader patterns, while a small α accepts smaller reductions and gives more patterns with fewer trips each.

After the global hierarchy converges, we merge origins and destinations that belong to the same place. Drop-offs at the airport, for example, can spread along a few hundred meters of road and look like many destinations when they are one. Any origins or destinations within one kilometer of each other are treated as the same point, using the average of their coordinates.

The local hierarchy recovers the patterns hidden inside the small clusters. After the global hierarchy, some clusters describe trips whose average origin and average destination are close together. When that distance falls below a threshold β, we take the trips in that cluster and run the global procedure on them again, on their own, and add the patterns it finds to the final set.

The results

The data comes from Fantaxico, a taxi app that has operated in Santiago since 2013. It records anonymous GPS pulses every 15 seconds, which amounts to around 300 GB of traces collected over several months between 2014 and 2016. From these traces the origin and destination of each trip were extracted, and after removing trips outside the city and trips shorter than 30 meters, we were left with 452,166 trips. The average trip is 10.6 kilometers, and most of them happen on weekdays.

In Santiago, OD-means recovered the structure of the city’s movement that the other methods missed. The global hierarchy found the main flows, and the local hierarchy found the patterns hidden inside small areas.

The global hierarchy captured several general patterns: the traffic to and from the airport, the trips to and from downtown and the financial district, and some highly populated areas in the periphery. It also captured the travel patterns to and from three small towns outside Santiago that usually serve as dormitory towns, Peñaflor, Talagante, and Calera de Tango. These were kept separate from the cluster that captures the trips to San Bernardo, another highly populated area on the periphery. One cluster also detected the traffic pattern between downtown and an industrial area.

Analyzing the local hierarchy, we also observed interesting local patterns. In the center we found several local clusters that capture the travel patterns around downtown, and on the right side local patterns in a high-income side of the city and the financial district. At the south, three short patterns surround a big empty space. Each of them corresponds to a populated middle-class area surrounding La Pintana, a very low income area, and no trips are related to it.

The cluster near the airport also called our attention. It grouped three different trips: from the airport to downtown, and to and from the bus station to Gran Valparaíso, the second largest urban agglomeration in Chile, made up of the twin coastal cities of Viña del Mar and Valparaíso.

Conclusion

OD-means recovers both the global and the local patterns of movement in a city, something the other methods we tried could not do. By treating each trip as an origin-destination pair and analyzing the clusters at two levels, it separates the large flows from the smaller patterns that usually stay hidden inside them. We validated the results against what we already knew about Santiago, and the model also revealed local movements that are not obvious at first sight.

If you would like to read about it in more detail, the full study is published in IEEE Transactions on Intelligent Transportation Systems. And if you work with origin-destination data and want to try it, the method is available as an R package on CRAN under the name ODMeans. If you have any questions about the model, or just want to talk about clustering and urban mobility, you are welcome to reach out at [email protected].