Written by Jean-Luc Stevens
Created: November 20, 2019
Last updated: August 2, 2021

IEX, the Investors Exchange, is a transparent stock exchange that discourages high-frequency trading and makes historical trading data publicly available. The data is offered in the form of daily pcap files where each single packet corresponds to a stock trade.

Even with this specialized pcap file format, these daily records can exceed a gigabyte in size on a given day. In this notebook, we will develop a dashboard that will allow us to explore every single trade that happened in a day, including the associated metadata. To visualize all this data at once both rapidly and interactively, we will use datashader via the HoloViews API.

The IEX stock data is saved in two formats of pcap file called TOPS and DEEP. These formats are complex enough to make it non trivial to parse the trades with standard packet loading tools. For this reason, the trades for Monday 21st of October 2019 are supplied as a CSV file that has been generated from the original pcap file using the IEXTools library.

In [1]:
import datetime
import pandas as pd
print('Dataframe loaded containing %d events' % len(df))

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
/tmp/ipykernel_6216/681879076.py in <module>
1 import datetime
2 import pandas as pd
4 print('Dataframe loaded containing %d events' % len(df))

/usr/share/miniconda3/envs/test-environment/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
309                     stacklevel=stacklevel,
310                 )
--> 311             return func(*args, **kwargs)
312
313         return wrapper

/usr/share/miniconda3/envs/test-environment/lib/python3.7/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
584     kwds.update(kwds_defaults)
585
587
588

480
481     # Create the parser.
--> 482     parser = TextFileReader(filepath_or_buffer, **kwds)
483
484     if chunksize or iterator:

/usr/share/miniconda3/envs/test-environment/lib/python3.7/site-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
809             self.options["has_index_names"] = kwds["has_index_names"]
810
--> 811         self._engine = self._make_engine(self.engine)
812
813     def close(self):

1038             )
1039         # error: Too many arguments for "ParserBase"
-> 1040         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
1041
1042     def _failover_to_python(self):

/usr/share/miniconda3/envs/test-environment/lib/python3.7/site-packages/pandas/io/parsers/c_parser_wrapper.py in __init__(self, src, **kwds)
49
50         # open handles
---> 51         self._open_handles(src, kwds)
52         assert self.handles is not None
53

/usr/share/miniconda3/envs/test-environment/lib/python3.7/site-packages/pandas/io/parsers/base_parser.py in _open_handles(self, src, kwds)
227             memory_map=kwds.get("memory_map", False),
228             storage_options=kwds.get("storage_options", None),
--> 229             errors=kwds.get("encoding_errors", "strict"),
230         )
231

/usr/share/miniconda3/envs/test-environment/lib/python3.7/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
705                 encoding=ioargs.encoding,
706                 errors=errors,
--> 707                 newline="",
708             )
709         else:

FileNotFoundError: [Errno 2] No such file or directory: './data/IEX_2019-10-21.csv'

We can now look at the head of this DataFrame to see its structure:

In [ ]:
df.head()


Each row above corresponds to a stock trade where price indicates the stock price, the size indicates the size of the trade and the symbol specifies which stock was traded. Every trade also has a timestamp specified in nanoseconds.

Note that multiple trades can occur on the same timestamp.

## Visualizing trade with Spikes¶

We can now load HoloViews with the Bokeh plotting extension to start visualizing some of this data:

In [ ]:
import holoviews as hv
from bokeh.models import HoverTool
hv.config.image_rtol = 10e-3 # Fixes datetime issue at high zoom level
hv.extension('bokeh')


One way to visualize events that occur over time is to use the Spikes element. Here we look at the first hundred spikes in this dataframe:

In [ ]:
hv.Spikes(df.head(100), ['timestamp'],
['symbol', 'size', 'price']).opts(xrotation=90,  tools=['hover'],
spike_length=1, position=0)


As in the dataframe tables shown above, the timestamps are expressed as integers counting the nanoseconds since Unix epoch (UTC). While many domains may use integers as their time axis (e.g CPU cycle for processor events), in this case we would like to recover the timestamp as a date.

We will do this in two steps (1) we map the integers to datetime64[ns] to get datetime objects and (2) we subtract 4 hours to go from UTC to the local time at the exchange (located in New Jersey):

In [ ]:
df.timestamp = df.timestamp.astype('datetime64[ns]')
df.timestamp -= datetime.timedelta(hours=4)


Here every line corresponds to a trade where the position along the x-axis indicates the time at which that trade occurred (the timestamp in nanoseconds). If you hover over the spikes above, you can view all the timestamp values for the trades underneath the cursor as well as their corresponding stock symbols.

Using Bokeh we can only visualize a small number of trades effectively, but using datashader we can visualize all 1.2 million trades available:

In [ ]:
spikes = hv.Spikes(df, ['timestamp'], ['symbol', 'size', 'price'])
rasterized = spikes_aggregate(spikes,
aggregator='count', spike_length=1).opts(
width=600, colorbar=True, cmap='blues',
yaxis=None, xrotation=90,
default_tools=['xwheel_zoom', 'xpan', 'xbox_zoom'])
rasterized


Using the count aggregator, we can see the density of trades over time colormapped such that white indicates the highest trade density and black indicates a lack of trades at that time. In the next notebook we will aggregate over the 'size' column to visualize a more useful metric, namely the trade volume.

We use spike_length plot option to give all the spikes a fixed height regardless of any value dimensions specified. The Spikes element also supports variable heights according to a value dimension as seen on the reference page.

Note that the above plot is interactive: when you zoom in, datashader will recompute and update the visualization accordingly. When zoomed out, you will notice that trade volume goes up at the end of the day - these are all the trades being made at the last minute before the exchange closes for that day!

Using Bokeh, we can view individual trade metadata using the hover tool for a small number of trades and using datashader, we can visualize a very large number of trades. Now we can investigate whether it is possible to have the best of both worlds.

When zoomed out, viewing all the trades for an entire day, hover information wouldn't be particularly useful. A single pixel would represent the accumulation of many trades across many neighboring timestamps which means hovering would not be able to target any specific trades effectively.

Instead, what we can do is use datashader to visualize trades at all zoom levels and use Bokeh to make hover information visible once sufficiently zoomed in to point to individual trades. This avoids the problem of pushing the entire dataset to your web browser and makes Bokeh's hover feature available when it is most useful.

### Using HoloViews to build custom interactivity¶

Enabling Bokeh hover information at a specific zoom level is not the best approach as different days will have different trade volumes and no fixed window will be appropriate for all datasets.

Instead, what we want to do is to always show hover information for as many trades as Bokeh can effectively handle. In other words, we want to count how many trades can be handed by the browser and use that number to decide whether or not hover information should be made available.

Counting trades is as simple as counting the number of rows in the pandas DataFrame that datashader is rendering for a given zoom level. We don't need to hook in the HoloViews datashader operation to achieve this: instead, we can disable the normal interactive behavior and rebuild it using HoloViews streams and the apply method.

Before this approach is demonstrated, we will want to define a custom Bokeh hover tool to format our datetime timestamps nicely:

In [ ]:
hover = HoverTool(tooltips=[
('Symbol', '@symbol'),
('Size', '@size'),
('Price', '@price'),
('Timestamp', '@timestamp{%F %H:%M %Ss %3Nms}')],
formatters={'timestamp': 'datetime'})


Next we declare a RangeX stream to get the plot range from Bokeh from our spikes object:

In [ ]:
range_stream = hv.streams.RangeX(source=spikes)


Using this stream, we can write a callback that uses the supplied x-ranges to do the following:

1. First it slices all the spikes across the whole dataset to those visible in the current viewport (spikes[pd.to_datetime(low):pd.to_datetime(high)])
2. Next it checks if there are fewer than 600 spikes. If so, it returns this sliced set of spikes, otherwise it returns ranged.iloc[:0] which is a Spikes object containing zero spikes.
3. We make sure these spikes are plotted with a length of one and make them invisible (we only want the associated hover information).
In [ ]:
def xrange_filter(spikes, x_range):
low, high = (None, None) if x_range is None else x_range
ranged = spikes[pd.to_datetime(low):pd.to_datetime(high)]
return (ranged if len(ranged) < 600 else ranged.iloc[:0]).opts(spike_length=1, alpha=0)


Finally, we can combine our range_stream with this callback using the apply method on spikes. This creates a DynamicMap that will offer the hover information for 600 or fewer spikes once sufficiently zoomed in. The only thing left to do is to overlay this on top of the interactive, zoomable rasterization generated by the datashader operation:

In [ ]:
filtered = spikes.apply(xrange_filter, streams=[range_stream])
hover_filtered =  filtered.opts(tools=[hover])
rasterized * hover_filtered


Try zooming into the last 500 millisecond region before the exchange closes to see the last few stock symbols that were traded on that day.

## Next steps¶

This notebook illustrates how a large number of events (1.2 million) can be visualized interactively with datashader and how we can inspect the data for individual events by zooming in and using the hover tool.

Visualizing all the data at once in this way allows you to see the overall structure of the data and identify any particularities in it. For instance, the increase in trading volume at the end of the day is immediately obvious and by zooming in, it is possible to identify a handful of trades that occur after 4pm after the bulk of trading has ceased.

What this visualization fails to offer is any way to identify the trading patterns of individual stocks out of the entire volume of trades. The hover tool only activates when zoomed in and there is no suitable method for partitioning out the trading times by stock. The next notebook will extend the approach developed here to analyze the most traded stocks on this day.

This web page was generated from a Jupyter notebook and not all interactivity will work on this website. Right click to download and run locally for full Python-backed interactivity.