Ship Traffic#

Exploring AIS vessel-traffic data#

This Jupyter notebook demonstrates how to use the Datashader-based rendering in HoloViews to explore and analyze US Coast Guard Automatic Identification System (AIS) vessel-location data. AIS data includes vessels identified by their Maritime Mobile Service Identity numbers along with other data such as vessel type. Data is provided here for January 2020 (200 million datapoints), but additional months and years of data can be downloaded for US coastal areas from marinecadastre.gov, and with slight modifications the same code here should work for AIS data available for other regions. This notebook also illustrates a workflow for visualizing large categorical datasets in general, letting users interact with individual datapoints even though the data itself is never sent to the browser for plotting.

import os, requests, numpy as np, pandas as pd, holoviews as hv, holoviews.operation.datashader as hd
import dask.dataframe as dd, panel as pn, colorcet as cc, datashader as ds
import spatialpandas as sp, spatialpandas.io, spatialpandas.geometry, spatialpandas.dask

from PIL import Image
from holoviews.util.transform import lon_lat_to_easting_northing as ll2en
from dask.diagnostics import ProgressBar

hv.extension('bokeh', 'matplotlib')

Vessel categories#

AIS pings come with an associated integer VesselType, which broadly labels what sort of vessel it is. Different types of vessels are used for different purposes and behave differently, as we will see below when we color-code the location of each ping by the VesselType using Datshader.

Here, we’ll use type names defined in a separate file constructed using lists of 100+ AIS Vessel Types. We’ve further collapsed those types into a smaller number of vessel categories:

vessel_types=pd.read_csv("AIS_categories.csv")
vessel_types.iloc[34:37]
num desc category category_desc
34 34 Diving ops 5 Diving
35 35 Military ops 6 Military
36 36 Sailing 7 Sailing

We’ll further reduce the category to the 6 most common, with the rest as Other.

categories = {r.num: r.category if r.category in [0,2,3,19,12,18] else 21 for i, r in vessel_types.iterrows()}
categories[np.NaN] = 0

def category_desc(val):
    """Return description for the category with the indicated integer value"""
    return vessel_types[vessel_types.category==val].iloc[0].category_desc

vessel_mapping = dict(zip(vessel_types.num.to_list(), vessel_types.category.to_list()))

We can print the resulting categories by number, with their description:

groups = {categories[i]: category_desc(categories[i]) for i in vessel_types.num.unique()}
print(" ".join([f"{k}:{v}" for k,v in sorted(groups.items())]))
0:Unknown 2:Fishing 3:Towing 12:Tug 18:Passenger 19:Cargo 21:Other

We’ll map these categories to colors defined by colorcet and construct a color key and legend to use for plotting:

colors    = cc.glasbey_bw_minc_20_minl_30
color_key = {list(groups.keys())[i]:tuple(int(e*255.) for e in v) for i,v in 
              enumerate(colors[:(len(groups))][::-1])}
legend    = hv.NdOverlay({groups[k]: hv.Points([0,0], label=str(groups[k])).opts(
                             color=cc.rgb_to_hex(*v), size=0) 
                          for k, v in color_key.items()})
legend.options(xaxis='bare',yaxis='bare', title='', toolbar=None)

Load traffic data#

Next we will load the data from disk, either directly from a spatially indexed Parquet file (if previously cached) or from the raw CSV files. We’ll also project the location data to the coordinate system we will use later for plotting. There’s a lot of data to process, so we’ll use Dask to ensure that we use all the cores available on this machine. Dask breaks a dataset into partitions that can be processed in parallel, so here we define a function for dealing with one partition’s worth of data, along with a schema showing what the final dataframe’s columnar structure will be.

def convert_partition(df):
    east, north = ll2en(df.LON.astype('float32'), df.LAT.astype('float32'))
    return sp.GeoDataFrame({
        'geometry': sp.geometry.PointArray((east, north)),
        'MMSI':     df.MMSI.fillna(0).astype('int32'),
        'category': df.VesselType.replace(categories).astype('int32')})

example = sp.GeoDataFrame({
    'geometry': sp.geometry.PointArray([], dtype='float32'),
    'MMSI':     np.array([], dtype='int32'),
    'category': np.array([], dtype='int32')})

Next we will define the function that will load our data, reading a much-smaller (and much faster to load) previously cached Parquet-format file from disk if available. To use files covering other date ranges, just download them and change basedir and/or basename to match them.

basedir = './data/AIS_2020_01_broadcast.parq/ship_traffic/'
basename = 'AIS_2020_01'
index = 'MMSI'
dfcols = ['MMSI', 'LON', 'LAT', 'VesselType']
vesselcols = ['MMSI', 'IMO', 'CallSign', 'VesselName', 'VesselType', 'Length', 'Width']

def load_data(spatial_index=False):
    cache_file = basedir+basename+'_broadcast.parq'
    vessels_file = basedir+basename+'_vessels.parq'
    
    if (os.path.exists(cache_file) and os.path.exists(vessels_file)):
        print('Reading vessel info file')
        vessels = dd.read_parquet(vessels_file).compute()

        print('Reading parquet file')
        gdf = sp.io.read_parquet_dask(cache_file).persist()
        
    else:
        csvs = basedir+basename+'*.csv'
        with ProgressBar():
            print('Writing vessel info file')
            df = dd.read_csv(csvs, usecols=vesselcols, assume_missing=True)
            vessels = df.groupby(index).last().reset_index().compute()
            vessels[index] = vessels[index].astype('int32')
            vessels.to_parquet(vessels_file)

            print('Reading CSV files')
            gdf = dd.read_csv(csvs, usecols=dfcols, assume_missing=True)
            gdf = gdf.map_partitions(convert_partition, meta=example).persist()

            print('Writing parquet file')
            gdf = gdf.pack_partitions_to_parquet(cache_file, npartitions=64).persist()
    
    with ProgressBar():            
        if spatial_index: 
            print('Building spatial index') # Takes a couple of minutes for 1 month's data
            gdf = gdf.build_sindex().persist()
        gdf['category'] = gdf['category'].astype('category').cat.as_known()

    return gdf, vessels

Now let’s actually load the data, using the disk cache and memory cache if available. If you set spatial_index to True above it should speed up selection of individual points in the final plot, though load_data will then take several minutes rather than a few seconds.

%%time
df, vessels = pn.state.as_cached('df', load_data)
Reading vessel info file
Reading parquet file
[                                        ] | 0% Completed | 1.82 ms
[####                                    ] | 10% Completed | 105.25 ms
[########                                ] | 20% Completed | 206.61 ms
[###############                         ] | 38% Completed | 307.38 ms
[######################                  ] | 56% Completed | 408.24 ms
[##############################          ] | 76% Completed | 509.05 ms
[#####################################   ] | 93% Completed | 609.91 ms
[########################################] | 100% Completed | 710.93 ms

CPU times: user 11.5 s, sys: 2.1 s, total: 13.6 s
Wall time: 4.1 s

Here we can see that this is a collection of points (latitude and longitude projected to Web Mercator) with an associated integer MMSI and category value:

df.head(1)
geometry MMSI category
hilbert_distance
4673612 Point([-19029536.0, 775638.5]) 737898816 0

Predefined locations#

We’ll provide interactive plots later that let you zoom in anywhere you like, but first let’s highlight a few specific areas of interest for those without a live Python process to interact with:

def ranges(lon_range, lat_range):
    x_range, y_range = ll2en(lon_range, lat_range)
    return dict(x=tuple(x_range), y=tuple(y_range))

x_range, y_range = ll2en([-54,-132], [15,51])
bounds = dict(x=tuple(x_range), y=tuple(y_range))

loc = {
    'Continental US':     ranges((-132.0,  -54.0), (15.0, 51.0)),
    'Vancouver Area':     ranges((-126.0, -120.7), (47.5, 49.5)),
    'NY and NJ':          ranges(( -75.6,  -71.3), (39.4, 41.1)),
    'Gulf of Mexico':     ranges(( -98.0,  -81.0), (23.8, 32.0)),
    'Gulf Coast':         ranges(( -98.0,  -87.0), (25.2, 31.0)),
    'Louisiana Coast':    ranges(( -91.5,  -87.8), (28.4, 30.1)),
    'Mississipi Delta':   ranges(( -90.1,  -89.2), (28.65,29.15)),
    'Louisiana East Bay': ranges(( -89.37, -89.28),(28.86,28.9)),
    'Bering Sea':         ranges((-171.0, -159.0), (52.0, 56.0)),
    'Hawaii':             ranges((-160.0, -154.5), (19.5, 22.1)),
    'LA to San Diego':    ranges((-120.5, -117.0), (32.6, 34.1)),
    'Great Lakes':        ranges(( -89.0,  -77.0), (41.2, 46.1)),
    'Chesapeake Bay':     ranges(( -78.0,  -71.0), (36.4, 39.6)),
    'Pamlico Sound, NC':  ranges(( -80.0,  -72.5), (33.1, 36.8)),
    'Savannah, GA':       ranges(( -81.2,  -80.3), (31.85,32.25)),
    'Florida':            ranges(( -90.0,  -74.5), (23.3, 31.0)),
    'Puerto Rico':        ranges(( -68.1,  -64.2), (17.4, 19.5))}

We can easily render these to PNGs using HoloViews to call Datashader and render the results using Matplotlib:

hv.output(backend='matplotlib')
hv.opts.defaults(
    hv.opts.RGB(xaxis=None, yaxis=None, axiswise=True, bgcolor='black'),
    hv.opts.Layout(hspace=0.0, vspace=0.1, sublabel_format=None, framewise=True, fig_size=400))

plots = [hd.datashade(hv.Points(df), color_key=color_key, cmap=cc.fire, width=1000, height=600,
                                dynamic=True, x_range=ranges['x'], y_range=ranges['y']).relabel(region)
         for region, ranges in loc.items()]
hv.Layout(plots).cols(1)

Even more structure is visible if we color by the vessel category using the color key we defined above:

plots = [hd.datashade(hv.Points(df, vdims='category'), color_key=color_key,
                                aggregator=ds.count_cat('category'), width=1000, height=600,
                                dynamic=True, x_range=ranges['x'], y_range=ranges['y']).relabel(region)
         for region, ranges in loc.items()]
hv.Layout(plots).cols(1)
hv.output(backend='bokeh')

Interactive plots#

To let you explore the data yourself, we can plot it using the Bokeh backend, which provides JavaScript-based interactive plotting in a web browser.

To zoom in & interact with the plot, click the “Wheel zoom” tool in the Bokeh toolbar on the side of the plot and click and drag the plot in order to look around, or use the “Box Zoom” tool to select an area of interest. As you zoom in, finer-grained detail will emerge and fill in, as long as you have a live Python process running to render the data dynamically using Datashader. Depending on the size of the dataset and your machine, updating the plot might take a few seconds.

pts    = hv.Points(df, vdims=['category']).redim.range(**loc['Continental US'])
points = hd.dynspread(hd.datashade(pts, aggregator=ds.count_cat('category'), color_key=color_key))

tiles  = hv.element.tiles.ESRI().opts(alpha=0.4, bgcolor="black").opts(responsive=True, min_height=600)
labels = hv.element.tiles.StamenLabels().opts(alpha=0.7, level='glyph')

tiles * points * labels * legend.opts(xaxis='bare',yaxis='bare', title='')

Using the color key, you can see that vessel behavior is highly dependent on category, with very different patterns of motion between these categories (and presumably the other categories not shown). E.g. fishing vessels tend to hug the coasts in meandering patterns, while cargo vessels travel along straight lines further from the coast. If you zoom in to a river, you can see that passenger vessels tend to travel across narrow waterways, while towing and cargo vessels travel along them. Zooming and panning (using the tools at the right) reveal other patterns at different locations and scales.

Selecting specific datapoints#

Datashader renders data into a screen-sized array of values or pixels, which allows it to handle much larger volumes of data than can be sent to a web browser. What if you what to interact with the underlying data, e.g. to get information about clusters of datapoints or even individual datapoints? For instance, here’s a challenge: look up at the “NY and NJ” plot above, and you’ll see some pink circles and lines that turn out to be in the middle of New York State and Pennsylvania, far from the ocean. What could those be?

To find out more about particular datapoints that we see, we can use HoloViews and Bokeh tools to watch for the x,y location of a tap, then query the underlying dataset for a ping in that region, and then then highlight it on top of the main plot.

columns = ['VesselName', 'MMSI', 'IMO', 'VesselType']

def format_vessel_type(num):
    if np.isnan(num): num = 0
    return f'{num:.0f} ({vessel_types.loc[num].desc})'

def brief_vessel_record(df):
    return df.iloc[:1].merge(vessels, on='MMSI').merge(vessel_types, on='category')[['geometry']+columns]
pts2        = hv.Points(df, vdims=['category']).redim.range(**loc['Vancouver Area'])
pointsp     = hd.dynspread(hd.datashade(pts2, color_key=color_key, aggregator=ds.count_cat('category'), min_alpha=90))

max_hits    = pn.widgets.IntSlider(value=2, start=1, end=10, name="Max hits", sizing_mode='stretch_width')
highlighter = hd.inspect_points.instance(streams=[hv.streams.Tap], transform=brief_vessel_record,
                                         x=-13922122, y=6184391) # optional initial values for static web page

highlight   = highlighter(pointsp).opts(color='white', tools=["hover"], marker='square', 
                                        size=10, fill_alpha=0)

#tiles * pointsp * highlight * legend

We could view the result above by uncommenting the last line, but let’s just go ahead and make a little Panel app so we can add a few extra interactive features. First, some code to fetch a photo of the selected vessels, if available, plus additional info about each vessel:

def get_photo_url(mmsi):
    headers = {'User-Agent': 'Mozilla/5.0'}
    r=requests.get(f'{ship_url}{mmsi}', allow_redirects=True, headers=headers)
    ship_id = [el for el in r.url.split('/') if el.startswith('shipid')]
    if ship_id == []: return ''
    ship_id =ship_id[0].replace('shipid:','')
    return f"https://photos.marinetraffic.com/ais/showphoto.aspx?shipid={ship_id}&size=thumb300&stamp=false"

def get_photos(df=None, n_records=2):
    photos = []
    if df is not None and 'MMSI' in df.columns:
        for mmsi in df.iloc[:n_records].MMSI.drop_duplicates().to_list():
            try:
                url = get_photo_url(mmsi)
                response = requests.get(url, stream=True)
                im = Image.open(response.raw)
                photos += [pn.Column('<b>MMSI: %s' % mmsi,im)]               
            except Exception:
                pass
    return pn.Row(*([pn.Spacer(sizing_mode='stretch_width')]+photos+[pn.Spacer(sizing_mode='stretch_width')]))

ship_url = 'https://tinyurl.com/aispage/mmsi:'
def full_vessel_record(df, n_records=2):
    "Given a dataframe that includes MMSI values, return with URL, vessel info added"
    if not len(df.columns):
        return None
    df_with_info  = df.iloc[:n_records].merge(vessels, on='MMSI')
    df_with_types = df_with_info.merge(vessel_types, how='left', left_on='VesselType', right_on='num')[columns]
    df_with_types['URL'] = df_with_types.MMSI.apply(lambda x: f'{ship_url}{x}')
    df_with_types.VesselType = df_with_types.VesselType.apply(format_vessel_type)
    result = pd.DataFrame(df_with_types).drop_duplicates()
    return pn.pane.DataFrame(result, index=False, render_links=True, na_rep='', sizing_mode='stretch_width')

Then we can build an app with some widgets for controlling the visibility of the background map, the data, and the text labels, plus a table showing information about the selected vessel and a photo if available:

photos        = pn.bind(get_photos, df=highlighter.param.hits, n_records=max_hits)
table         = pn.bind(full_vessel_record, df=highlighter.param.hits, n_records=max_hits)
sopts         = dict(start=0, end=1, sizing_mode='stretch_width')
map_opacity   = pn.widgets.FloatSlider(value=0.7, name="Map opacity",   **sopts)
data_opacity  = pn.widgets.FloatSlider(value=1.0, name="Data opacity",  **sopts)
label_opacity = pn.widgets.FloatSlider(value=0.9, name="Label opacity", **sopts)
overlay       = (tiles.apply.opts(alpha=map_opacity) *
                 pointsp.apply.opts(alpha=data_opacity) *
                 labels.apply.opts(alpha=label_opacity) * highlight * legend)

description = """
## US AIS vessel traffic data, Jan 2020

Zoom or pan to explore the data, then click to select
a particular data point to see more information about
it (after a delay). You may need to zoom in before a
point is selectable.
"""

pn.Column(description,
    pn.Row(map_opacity, data_opacity, label_opacity, max_hits),
    overlay, table, photos, sizing_mode='stretch_width').servable()

Here you should be able to explore this dataset fully, clicking on any interesting datapoints or clusters, including the “Unknown” circles in New York State and Pennsylvania mentioned above (which turn out to be false readings for a US Coast Guard ship based in Guam, clearly not able to be on land in those patterns; perhaps another example of a previously reported GPS disruption). There are no doubt many other interesting patterns to discover here!

The above app should run fine in a Jupyter notebook, but it can also be launched as a separate web server using panel serve --port 5006 ship_traffic.ipynb, allowing you to let other people explore this dataset as well. And you can adapt the code in this notebook to work with just about any other data that you can map onto the x and y axes of a plot, including categorical data if available.