1a-flights-histogram.ipynb

This notebook covers how to get data via eland (https://github.com/elastic/eland) from Elasticsearch and discusses some different approaches how to visualize small multiples of binned histograms with eland and Altair. It covers the basic requirements to successfully create and publish a custom visualization from Jupyter to Kibana.

In [1]:
import pandas as pd
import altair as alt
import eland as ed
import json
import numpy as np
import matplotlib.pyplot as plt
alt.data_transformers.disable_max_rows()
Out[1]:
DataTransformerRegistry.enable('default')

Connect to the 'flights' example dataset, can be installed via Kibana.

In [2]:
df = ed.DataFrame('localhost:9200', 'kibana_sample_data_flights')
df.head()
Out[2]:
AvgTicketPrice Cancelled Carrier Dest DestAirportID DestCityName DestCountry DestLocation DestRegion DestWeather ... FlightTimeMin Origin OriginAirportID OriginCityName OriginCountry OriginLocation OriginRegion OriginWeather dayOfWeek timestamp
-Cl-InUB9GmdWRW_Y91Q 841.265642 False Kibana Airlines Sydney Kingsford Smith International Airport SYD Sydney AU {'lon': '151.177002', 'lat': '-33.94609833'} SE-BD Rain ... 1030.770416 Frankfurt am Main Airport FRA Frankfurt am Main DE {'lon': '8.570556', 'lat': '50.033333'} DE-HE Sunny 0 2020-10-05 00:00:00
-Sl-InUB9GmdWRW_Y91Q 882.982662 False Logstash Airways Venice Marco Polo Airport VE05 Venice IT {'lon': '12.3519', 'lat': '45.505299'} IT-34 Sunny ... 464.389481 Cape Town International Airport CPT Cape Town ZA {'lon': '18.60169983', 'lat': '-33.96480179'} SE-BD Clear 0 2020-10-05 18:27:00
-il-InUB9GmdWRW_Y91Q 190.636904 False Logstash Airways Venice Marco Polo Airport VE05 Venice IT {'lon': '12.3519', 'lat': '45.505299'} IT-34 Cloudy ... 0.000000 Venice Marco Polo Airport VE05 Venice IT {'lon': '12.3519', 'lat': '45.505299'} IT-34 Rain 0 2020-10-05 17:11:14
-yl-InUB9GmdWRW_Y91Q 181.694216 True Kibana Airlines Treviso-Sant'Angelo Airport TV01 Treviso IT {'lon': '12.1944', 'lat': '45.648399'} IT-34 Clear ... 222.749059 Naples International Airport NA01 Naples IT {'lon': '14.2908', 'lat': '40.886002'} IT-72 Thunder & Lightning 0 2020-10-05 10:33:28
_Cl-InUB9GmdWRW_Y91Q 730.041778 False Kibana Airlines Xi'an Xianyang International Airport XIY Xi'an CN {'lon': '108.751999', 'lat': '34.447102'} SE-BD Clear ... 785.779071 Licenciado Benito Juarez International Airport AICM Mexico City MX {'lon': '-99.072098', 'lat': '19.4363'} MX-DIF Damaging Wind 0 2020-10-05 05:13:00

5 rows × 27 columns

Output the data frame's info. Note its low memory footprint because of the data residing in Elasticsearch.

In [3]:
df.info()
<class 'eland.dataframe.DataFrame'>
Index: 13059 entries, -Cl-InUB9GmdWRW_Y91Q to -ip-InUB9GmdWRW_cBCr
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   AvgTicketPrice      13059 non-null  float64
 1   Cancelled           13059 non-null  bool
 2   Carrier             13059 non-null  object
 3   Dest                13059 non-null  object
 4   DestAirportID       13059 non-null  object
 5   DestCityName        13059 non-null  object
 6   DestCountry         13059 non-null  object
 7   DestLocation        13059 non-null  object
 8   DestRegion          13059 non-null  object
 9   DestWeather         13059 non-null  object
 10  DistanceKilometers  13059 non-null  float64
 11  DistanceMiles       13059 non-null  float64
 12  FlightDelay         13059 non-null  bool
 13  FlightDelayMin      13059 non-null  int64
 14  FlightDelayType     13059 non-null  object
 15  FlightNum           13059 non-null  object
 16  FlightTimeHour      13059 non-null  object
 17  FlightTimeMin       13059 non-null  float64
 18  Origin              13059 non-null  object
 19  OriginAirportID     13059 non-null  object
 20  OriginCityName      13059 non-null  object
 21  OriginCountry       13059 non-null  object
 22  OriginLocation      13059 non-null  object
 23  OriginRegion        13059 non-null  object
 24  OriginWeather       13059 non-null  object
 25  dayOfWeek           13059 non-null  int64
 26  timestamp           13059 non-null  datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(4), int64(2), object(18)
memory usage: 96.0 bytes

Next we output small multiples of binned histograms via matplotlib.

The following code is great because it's short and concise. However, we cannot deploy these bitmap based charts to Kibana and make them dynamic. And as of Elastic Stack 7.9, these kind of charts (both small multiples and binned histograms) are a bit cumbersome to create with native Kibana tools.

In [4]:
df_number = df.select_dtypes(include=np.number)
df_number.hist(figsize=[6,10])
plt.show()

Let's try to create the same chart type using Altair, here's a most basic example with dummy data:

In [5]:
source = pd.DataFrame({
    'a': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
    'b': [28, 55, 43, 91, 81, 53, 19, 87, 52]
})

alt.Chart(source, height=160, width=120).mark_bar().encode(
    x='a',
    y='b'
)
Out[5]:

To move forward, we convert the eland based data frame to a native pandas one. Note how this one requires more memory because it's kept locally now.

In [6]:
df_altair = ed.eland_to_pandas(df_number)
df_altair.info()
<class 'pandas.core.frame.DataFrame'>
Index: 13059 entries, -Cl-InUB9GmdWRW_Y91Q to -ip-InUB9GmdWRW_cBCr
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   AvgTicketPrice      13059 non-null  float64
 1   DistanceKilometers  13059 non-null  float64
 2   DistanceMiles       13059 non-null  float64
 3   FlightDelayMin      13059 non-null  int64
 4   FlightTimeMin       13059 non-null  float64
 5   dayOfWeek           13059 non-null  int64
dtypes: float64(4), int64(2)
memory usage: 714.2+ KB
In [7]:
df_altair.head()
Out[7]:
AvgTicketPrice DistanceKilometers DistanceMiles FlightDelayMin FlightTimeMin dayOfWeek
-Cl-InUB9GmdWRW_Y91Q 841.265642 16492.326654 10247.856676 0 1030.770416 0
-Sl-InUB9GmdWRW_Y91Q 882.982662 8823.400140 5482.606665 0 464.389481 0
-il-InUB9GmdWRW_Y91Q 190.636904 0.000000 0.000000 0 0.000000 0
-yl-InUB9GmdWRW_Y91Q 181.694216 555.737767 345.319439 180 222.749059 0
_Cl-InUB9GmdWRW_Y91Q 730.041778 13358.244200 8300.428125 0 785.779071 0

The above cell shows the data frame structure with the same schema as it was stored in Elasticsearch. This type of format could be used, but it's more cumbersome to work with in Altair/Vega. For more details, see Altair's docs on "Long-form vs. Wide-form Data" https://altair-viz.github.io/user_guide/data.html#long-form-vs-wide-form-data

The next cell converts the data frame to "long form" which makes it more easy to split data based on an entity/attribute.

In [8]:
df_melt = df_altair.melt(var_name='attribute', value_name='value')
df_melt.head()
Out[8]:
attribute value
0 AvgTicketPrice 841.265642
1 AvgTicketPrice 882.982662
2 AvgTicketPrice 190.636904
3 AvgTicketPrice 181.694216
4 AvgTicketPrice 730.041778

Finally we're able to recreate the original chart with Altair/Vega:

In [9]:
data = df_melt

chart = alt.Chart(data).mark_bar().encode(
    alt.X('value:Q', bin=True, title=''),
    alt.Y('count()', title=''),
    tooltip=[
        alt.Tooltip('value:Q', bin=True, title='x'),
        alt.Tooltip('count()', title='y')
    ]
).properties(
    width=130,
    height=130
)

alt.ConcatChart(
    concat=[
      chart.transform_filter(alt.datum.attribute == value).properties(title=value)
      for value in sorted(data.attribute.unique())
    ],
    columns=3
).resolve_axis(
    x='independent',
    y='independent'
).resolve_scale(
    x='independent',
    y='independent'
)
Out[9]: