Home

To make sure the Vega charts render correctly, view the notebook not from the Github repo but the published website here: https://walterra.github.io/jupyter2kibana/viz-1a-flights-histogram.html

1a-flights-histogram.ipynb

This notebook covers how to get data via eland (https://github.com/elastic/eland) from Elasticsearch and discusses some different approaches how to visualize small multiples of binned histograms with eland and Altair. It covers the basic requirements to successfully create and publish a custom visualization from Jupyter to Kibana.

In [1]:
import datetime
import pandas as pd
import altair as alt
import eland as ed
from elasticsearch import Elasticsearch
import elastic_transport
import json
import numpy as np
import matplotlib.pyplot as plt
import urllib3
import logging
import requests
import warnings

alt.data_transformers.disable_max_rows()
logging.getLogger("elastic_transport").setLevel(logging.ERROR)

# Suppress insecure SSL connection warnings
# In dev environments with the `verify_certs=False`
# you might want to reduce those warnings.
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
urllib3.disable_warnings(elastic_transport.SecurityWarning)

# For rendering the notebook to HTML hide all warnings
warnings.filterwarnings('ignore')

Connect to the 'flights' example dataset, can be installed via Kibana.

In [2]:
with open('config.json') as config_file:
  es_config = json.load(config_file)

# First instantiate an 'Elasticsearch' instance with the supplied config
es = Elasticsearch(
    hosts=[es_config['es_client']],
    basic_auth=[
        es_config['user'],
        es_config['password']
    ],
    # Only in development environments with self signed certificates fall back to use `verify_certs=False`
    verify_certs=False
)

df = ed.DataFrame(es, 'kibana_sample_data_flights')
df.head()
Out[2]:
AvgTicketPrice Cancelled Carrier Dest DestAirportID DestCityName DestCountry DestLocation DestRegion DestWeather ... FlightTimeMin Origin OriginAirportID OriginCityName OriginCountry OriginLocation OriginRegion OriginWeather dayOfWeek timestamp
FK7wlYUB84Z0EQSGKFH6 841.265642 False Kibana Airlines Sydney Kingsford Smith International Airport SYD Sydney AU {'lat': '-33.94609833', 'lon': '151.177002'} SE-BD Rain ... 1030.770416 Frankfurt am Main Airport FRA Frankfurt am Main DE {'lat': '50.033333', 'lon': '8.570556'} DE-HE Sunny 0 2023-01-02 00:00:00
Fa7wlYUB84Z0EQSGKFH6 882.982662 False Logstash Airways Venice Marco Polo Airport VE05 Venice IT {'lat': '45.505299', 'lon': '12.3519'} IT-34 Sunny ... 464.389481 Cape Town International Airport CPT Cape Town ZA {'lat': '-33.96480179', 'lon': '18.60169983'} SE-BD Clear 0 2023-01-02 18:27:00
Fq7wlYUB84Z0EQSGKFH6 190.636904 False Logstash Airways Venice Marco Polo Airport VE05 Venice IT {'lat': '45.505299', 'lon': '12.3519'} IT-34 Cloudy ... 0.000000 Venice Marco Polo Airport VE05 Venice IT {'lat': '45.505299', 'lon': '12.3519'} IT-34 Rain 0 2023-01-02 17:11:14
F67wlYUB84Z0EQSGKFH6 181.694216 True Kibana Airlines Treviso-Sant'Angelo Airport TV01 Treviso IT {'lat': '45.648399', 'lon': '12.1944'} IT-34 Clear ... 222.749059 Naples International Airport NA01 Naples IT {'lat': '40.886002', 'lon': '14.2908'} IT-72 Thunder & Lightning 0 2023-01-02 10:33:28
GK7wlYUB84Z0EQSGKFH6 730.041778 False Kibana Airlines Xi'an Xianyang International Airport XIY Xi'an CN {'lat': '34.447102', 'lon': '108.751999'} SE-BD Clear ... 785.779071 Licenciado Benito Juarez International Airport AICM Mexico City MX {'lat': '19.4363', 'lon': '-99.072098'} MX-DIF Damaging Wind 0 2023-01-02 05:13:00

5 rows × 27 columns

Output the data frame's info. Note its low memory footprint because of the data residing in Elasticsearch.

In [3]:
df.info()
<class 'eland.dataframe.DataFrame'>
Index: 13059 entries, FK7wlYUB84Z0EQSGKFH6 to Fq7wlYUB84Z0EQSGMYQh
Data columns (total 27 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   AvgTicketPrice      13059 non-null  float64       
 1   Cancelled           13059 non-null  bool          
 2   Carrier             13059 non-null  object        
 3   Dest                13059 non-null  object        
 4   DestAirportID       13059 non-null  object        
 5   DestCityName        13059 non-null  object        
 6   DestCountry         13059 non-null  object        
 7   DestLocation        13059 non-null  object        
 8   DestRegion          13059 non-null  object        
 9   DestWeather         13059 non-null  object        
 10  DistanceKilometers  13059 non-null  float64       
 11  DistanceMiles       13059 non-null  float64       
 12  FlightDelay         13059 non-null  bool          
 13  FlightDelayMin      13059 non-null  int64         
 14  FlightDelayType     13059 non-null  object        
 15  FlightNum           13059 non-null  object        
 16  FlightTimeHour      13059 non-null  object        
 17  FlightTimeMin       13059 non-null  float64       
 18  Origin              13059 non-null  object        
 19  OriginAirportID     13059 non-null  object        
 20  OriginCityName      13059 non-null  object        
 21  OriginCountry       13059 non-null  object        
 22  OriginLocation      13059 non-null  object        
 23  OriginRegion        13059 non-null  object        
 24  OriginWeather       13059 non-null  object        
 25  dayOfWeek           13059 non-null  int64         
 26  timestamp           13059 non-null  datetime64[ns]
dtypes: bool(2), datetime64[ns](1), float64(4), int64(2), object(18)
memory usage: 64.000 bytes
Elasticsearch storage usage: 5.777 MB

Next we output small multiples of binned histograms via matplotlib.

The following code is great because it's short and concise. However, we cannot deploy these bitmap based charts to Kibana and make them dynamic. And as of Elastic Stack 7.9, these kind of charts (both small multiples and binned histograms) are a bit cumbersome to create with native Kibana tools.

In [4]:
df_number = df.select_dtypes(include=np.number)
df_number.hist(figsize=[6,10])
plt.show()

Let's try to create the same chart type using Altair, here's a most basic example with dummy data:

In [5]:
source = pd.DataFrame({
    'a': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
    'b': [28, 55, 43, 91, 81, 53, 19, 87, 52]
})

alt.Chart(source, height=160, width=120).mark_bar().encode(
    x='a',
    y='b'
)
Out[5]:

To move forward, we convert the eland based data frame to a native pandas one. Note how this one requires more memory because it's kept locally now.

In [6]:
df_altair = ed.eland_to_pandas(df_number)
df_altair.info()
<class 'pandas.core.frame.DataFrame'>
Index: 13059 entries, FK7wlYUB84Z0EQSGKFH6 to Fq7wlYUB84Z0EQSGMYQh
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   AvgTicketPrice      13059 non-null  float64
 1   DistanceKilometers  13059 non-null  float64
 2   DistanceMiles       13059 non-null  float64
 3   FlightDelayMin      13059 non-null  int64  
 4   FlightTimeMin       13059 non-null  float64
 5   dayOfWeek           13059 non-null  int64  
dtypes: float64(4), int64(2)
memory usage: 714.2+ KB
In [7]:
df_altair.head()
Out[7]:
AvgTicketPrice DistanceKilometers DistanceMiles FlightDelayMin FlightTimeMin dayOfWeek
FK7wlYUB84Z0EQSGKFH6 841.265642 16492.326654 10247.856676 0 1030.770416 0
Fa7wlYUB84Z0EQSGKFH6 882.982662 8823.400140 5482.606665 0 464.389481 0
Fq7wlYUB84Z0EQSGKFH6 190.636904 0.000000 0.000000 0 0.000000 0
F67wlYUB84Z0EQSGKFH6 181.694216 555.737767 345.319439 180 222.749059 0
GK7wlYUB84Z0EQSGKFH6 730.041778 13358.244200 8300.428125 0 785.779071 0

The above cell shows the data frame structure with the same schema as it was stored in Elasticsearch. This type of format could be used, but it's more cumbersome to work with in Altair/Vega. For more details, see Altair's docs on "Long-form vs. Wide-form Data" https://altair-viz.github.io/user_guide/data.html#long-form-vs-wide-form-data

The next cell converts the data frame to "long form" which makes it more easy to split data based on an entity/attribute.

In [8]:
df_melt = df_altair.melt(var_name='attribute', value_name='value')
df_melt.head()
Out[8]:
attribute value
0 AvgTicketPrice 841.265642
1 AvgTicketPrice 882.982662
2 AvgTicketPrice 190.636904
3 AvgTicketPrice 181.694216
4 AvgTicketPrice 730.041778

Finally we're able to recreate the original chart with Altair/Vega:

In [9]:
data = df_melt

chart = alt.Chart(data).mark_bar().encode(
    alt.X('value:Q', bin=True, title=''),
    alt.Y('count()', title=''),
    tooltip=[
        alt.Tooltip('value:Q', bin=True, title='x'),
        alt.Tooltip('count()', title='y')
    ]
).properties(
    width=130,
    height=130
)

alt.ConcatChart(
    concat=[
      chart.transform_filter(alt.datum.attribute == value).properties(title=value)
      for value in sorted(data.attribute.unique())
    ],
    columns=3
).resolve_axis(
    x='independent',
    y='independent'
).resolve_scale(
    x='independent', 
    y='independent'
)
Out[9]:

To achieve the same with the data still in Elasticsearch, we need to change the example using a remote URL.

Note this is meant as a middle step towards our final chart specification for demonstration purposes. This one doesn't work with security enabled for Elasticsearch, so take care. You'll need to add the following settings to your elasticsearch.yml, again take care, this isn't recommended for production configs at all:

xpack.security.enabled: false
http.cors.allow-origin: "/.*/"
http.cors.enabled: true

The other difference to the previous code is that instead of using panda's .melt() we're using Vega's .transform_fold() to transpose the data from wide to long form.

The important bit is that we're using Altair only to do data transformations and not raw Python or pandas code, otherwise we're not able to publish the chart specification to Kibana later on.

In [10]:
# Note: To create the Vega spec using Altair we reference ES via URL first. This will only work
# for non-secured ES instances. If your ES instance runs using SSL and/or authentication the chart
# in this cell will render empty. You can still save the visualization in Kibana correctly in the
# next cell because there the URL gets replaced with an Elasticsearch query
# to be used via the Kibana Vega plugin.

# WARNING:
# Do the following approach using a proxy only for demo purposes in a development environment.
# It will expose a secured ES instance unsecured!
# To make this work for demo purposes run the nodejs based proxy in a separate terminal like this:
# NODE_TLS_REJECT_UNAUTHORIZED='0' node proxy

# URL as ES endpoint
# url = 'http://localhost:9220/kibana_sample_data_flights/_search?size=10000'

# URL static fallback
url = 'https://walterra.github.io/jupyter2kibana/data/kibana_sample_data_flights.json'

url_data = alt.Data(url=url, format=alt.DataFormat(property='hits.hits',type='json'))

fields = [
    'AvgTicketPrice',
    'DistanceKilometers',
    'DistanceMiles',
    'FlightDelayMin',
    'FlightTimeMin',
    'dayOfWeek'
]

rename_dict = dict((a, 'datum._source.'+a) for a in fields)

url_chart = alt.Chart(url_data).transform_calculate(**rename_dict).transform_fold(
    fields,
    as_=['attribute', 'value']
).mark_bar().encode(
    alt.X('value:Q', bin=True, title=''),
    alt.Y('count()', title=''),
    tooltip=[
        alt.Tooltip('value:Q', bin=True, title='x'),
        alt.Tooltip('count()', title='y')
    ]
).properties(
    width=150,
    height=150
)

url_charts = alt.ConcatChart(
    concat=[
      url_chart.transform_filter(alt.datum.attribute == attribute).properties(title=attribute)
      for attribute in sorted(fields)
    ],
    columns=2
).resolve_axis(
    x='independent',
    y='independent'
).resolve_scale(
    x='independent', 
    y='independent'
)

url_charts
Out[10]:

Next we're picking up the Vega spec from the chart above, apply some options and save it as a Saved Object in Kibana.

In [11]:
def saveVegaVis(index, visName, altairChart):
    chart_json = json.loads(altairChart.to_json())
    chart_json['data']['url'] = {
        "%context%": True,
        "index": index,
        "body": { 
            "size": 10000
        }
    }

    visState = {
      "type": "vega",
      "aggs": [],
      "params": {
        "spec": json.dumps(chart_json, sort_keys=True, indent=4, separators=(',', ': ')),
      },
      "title": visName
    }

    visSavedObject={
        "attributes" : {
          "title" : visName,
          "visState" : json.dumps(visState, sort_keys=True, indent=4, separators=(',', ': ')),
          "uiStateJSON" : "{}",
          "description" : "",
          "version" : 1,
          "kibanaSavedObjectMeta" : {
            "searchSourceJSON" : json.dumps({
              "query": {
                "language": "kuery",
                "query": ""
              },
              "filter": []
            }),
          }
        },
    }
    
    return requests.post(
        es_config['kibana_client'] + '/api/saved_objects/visualization/' + visName,
        json=visSavedObject,
        auth=(es_config['user'], es_config['password']),
        headers={"kbn-xsrf":"jupyter2kibana"},
        # Only in development environments with self signed certificates fall back to use `verify=False`
        verify=False
    )
In [12]:
r = saveVegaVis('kibana_sample_data_flights', 'def-vega-1', url_charts)
print(r.status_code)
409