2016-01-29 (Last Updated: 2024-04-11)

Non-geographical Analysis#

Most of the datashader examples use geographic data, because it is so easily interpreted, but datashading will help exploration of any data dimensions. Here let’s start by plotting trip_distance versus fare_amount for the 12-million-point NYC taxi dataset from nyc_taxi.ipynb.

import numpy as np
import holoviews as hv
import holoviews.operation.datashader as hd
import datashader as ds
from holoviews import opts
hv.extension('bokeh')

WARNING:param.main: pandas could not register all extension types imports failed with the following error: cannot import name 'ABCIndexClass' from 'pandas.core.dtypes.generic' (/home/runner/work/examples/examples/nyc_taxi/envs/default/lib/python3.8/site-packages/pandas/core/dtypes/generic.py)

opts.defaults(
    opts.Scatter(width=800, height=500, color='blue'),
    opts.RGB(width=800, height=500),
    opts.Curve(width=800))

Load NYC Taxi data#

These data have been transformed from the original database to a parquet file. It should take about 5 seconds to load (compared to 10-20 seconds when stored in the inefficient CSV file format…).

import dask.dataframe as dd

usecols = ['trip_distance','fare_amount','tip_amount','passenger_count']

%%time
df = dd.read_parquet('data/nyc_taxi_wide.parq')[usecols].persist()

CPU times: user 891 ms, sys: 144 ms, total: 1.04 s
Wall time: 1.04 s

df.tail()

	trip_distance	fare_amount	tip_amount	passenger_count
11842089	1.0	5.5	1.25	2
11842090	0.8	6.0	2.00	2
11842091	3.4	13.5	0.00	1
11842092	1.3	10.5	2.25	1
11842093	0.7	5.5	0.00	1

1000 points reveals the expected linear relationship#

samples = df.sample(frac=1e-4)
scatter = hv.Scatter(samples, 'trip_distance', 'fare_amount')
labelled = scatter.redim.label(trip_distance="Distance, miles", fare_amount="Fare, $")
labelled.redim.range(trip_distance=(0, 20), fare_amount=(0,40)).opts(size=5)

10,000 points show more detailed, systematic patterns in fares and times#

Perhaps there are different metering options, along with granularity in how times and fares are counted; in any case, the times and fares do not uniformly populate any region of this space:

samples = df.sample(frac=1e-3)
scatter = hv.Scatter(samples, 'trip_distance', 'fare_amount')
labelled = scatter.redim.label(trip_distance="Distance, miles", fare_amount="Fare, $")
labelled.redim.range(trip_distance=(0, 20), fare_amount=(0,40)).opts(alpha=0.05, size=1)

Datashader reveals additional detail, especially when zooming in#

You can now see that there are a lot of points below the linear boundary, representing long trips for very little cost (presumably GPS errors?).

scatter = hv.Scatter(df, 'trip_distance', 'fare_amount')
ranged = scatter.redim.range(trip_distance=(0, 20), fare_amount=(0,40))
shaded = hd.spread(hd.datashade(ranged))
labelled = shaded.redim.label(trip_distance="Distance, miles", fare_amount="Fare, $")
labelled

Here we’re using the default histogram-equalized color mapping function to reveal density differences across this space. If we used a linear mapping (same code as above wirh normalization='linear'), we can mainly see that there are a lot of values near the origin, but all the rest are colored the same minimum (defaulting to light blue) color:

shaded = hd.spread(hd.datashade(ranged, normalization='linear'))
labelled = shaded.redim.label(trip_distance="Distance, miles", fare_amount="Fare, $")
labelled

Fares are discretized to the nearest 50 cents, making patterns less visible, but there is both an upward trend in tips as fares increase (as expected), but also a large number of tips higher than the fare itself, which is surprising:

scatter = hv.Scatter(df, 'tip_amount', 'fare_amount')
ranged = scatter.redim.range(tip_amount=(0,40), fare_amount=(0,20))
shaded = hd.spread(hd.datashade(ranged))
labelled = shaded.redim.label(tip_amount="Tip, $", fare_amount="Fare, $")
labelled

Interestingly, tips go down when the number of passengers is greater than 1:

scatter = hv.Scatter(df, 'passenger_count', 'tip_amount')
ranged = scatter.redim.range(tip_amount=(0,60), passenger_count=(-0.5,6.5))
shaded = hd.spread(hd.datashade(ranged, x_sampling=0.15))
labelled = shaded.redim.label(passenger_count="Passengers", tip_amount="Tip, $")
labelled

Here we’ve reduced the resolution along the x axis so that instead of getting isolated points for this inherently discrete data, you can see more-visible horizontal line segments.

The above plots use the HoloViews library, which builds Bokeh and Matplotlib plots from high-level specifications. For instance, Datashader currently only provides 2D aggregates, but you can easily make a zoomable one-dimensional histogram using HoloViews to dynamically collapse across a second dimension:

dataset = hv.Points(df, kdims=['fare_amount', 'trip_distance'], vdims=[]).select(fare_amount=(0,60))
agg = hd.rasterize(dataset, aggregator=ds.count(), streams=[hv.streams.RangeX()], x_sampling=0.5, width=500, height=2)
agg.apply.reduce(trip_distance=np.sum)

Here datashader is aggregating over both fare_amount and trip_distance, but trip_distance was specified to have only a height of 2, because it will be further collapsed to create the histogram being displayed. You can now use the wheel zoom tool when hovering over the x axis, and the plot will zoom in or out, dynamically resampling at the given location to make a new histogram (as long as there is a live Python server running).

In this particular plot, there is a very wide range of fare amounts, with an implausibly high maximum fare of over 4000 dollars, but you can easily zoom in to the bulk of the data to show that nearly all fares are between 4 and 20 dollars, following something like a gamma distribution, and they are discretized to the nearest 50 cents in this dataset.