Plotting very large datasets meaningfully, using
There are a variety of approaches for plotting large datasets, but most of them are very unsatisfactory. Here we first show some of the issues, then demonstrate how the
datashader library helps make large datasets truly practical.
We'll use part of the well-studied NYC Taxi trip database, with the locations of all NYC taxi pickups and dropoffs from the month of January 2015. Although we know what the data is, let's approach it as if we are doing data mining, and see what it takes to understand the dataset from scratch.
NOTE: This dataset is also explorable through the Datashader example dashboard. From inside the examples directory, run:
DS_DATASET=nyc_taxi panel serve --show dashboard.ipynb
import dask.dataframe as dd usecols = ['dropoff_x','dropoff_y','pickup_x','pickup_y','dropoff_hour','pickup_hour','passenger_count'] %time df = dd.read_parquet('data/nyc_taxi_wide.parq')[usecols].persist() df.tail()
CPU times: user 1.94 s, sys: 1.37 s, total: 3.31 s Wall time: 3.31 s
As you can see, this file contains about 12 million pickup and dropoff locations (in Web Mercator coordinates), with passenger counts.
1000-point scatterplot: undersampling¶
Any plotting program should be able to handle a plot of 1000 datapoints. Here the points are initially overplotting each other, but if you hit the Reset button (top right of plot) to zoom in a bit, nearly all of them should be clearly visible in the following Bokeh plot of a random 1000-point sample. If you know what to look for, you can even see the outline of Manhattan Island and Central Park from the pattern of dots. We've included geographic map data here to help get you situated, though for a genuine data mining task in an abstract data space you might not have any such landmarks. In any case, because this plot is discarding 99.99% of the data, it reveals very little of what might be contained in the dataset, a problem called undersampling.
import numpy as np import holoviews as hv from holoviews import opts from holoviews.element.tiles import StamenTerrain hv.extension('bokeh')