Data transformers¶
Before a Vega-Lite or Vega specification can be passed to a renderer, it typically has to be transformed in a number of ways:
- Pandas Dataframe has to be sanitized and serialized to JSON.
- The rows of a Dataframe might need to be sampled or limited to a maximum number.
- The Dataframe might be written to a
.csv
of.json
file for performance reasons.
These data transformations are managed by the data transformation API of Altair.
Note
The data transformation API of Altair should not be confused with the transform
API of Vega and Vega-Lite.
A data transformer is a Python function that takes a Vega-Lite data dict
or
Pandas DataFrame
and returns a transformed version of either of these types:
from typing import Union
Data = Union[dict, pd.DataFrame]
def data_transformer(data: Data) -> Data:
# Transform and return the data
return transformed_data
Built-in data transformers¶
Altair includes a default set of data transformers with the following signatures.
Raise a MaxRowsError
if a Dataframe has more than max_rows
rows:
limit_rows(data, max_rows=5000)
Randomly sample a DataFrame (without replacement) before visualizing:
sample(data, n=None, frac=None)
Convert a Dataframe to a separate .json
file before visualization:
to_json(data, prefix='altair-data'):
Convert a Dataframe to a separate .csv
file before visualization:
to_csv(data, prefix='altair-data'):
Convert a Dataframe to inline JSON values before visualization:
to_values(data):
Piping¶
Multiple data transformers can be piped together using pipe
:
from altair import pipe, limit_rows, to_values
pipe(data, limit_rows(10000), to_values)
Managing data transformers¶
Altair maintains a registry of data transformers, which includes a default data transformer that is automatically applied to all Dataframes before rendering.
To see the registered transformers:
>>> import altair as alt
>>> alt.data_transformers.names()
['default', 'json', 'csv']
The default data transformer is the following:
def default_data_transformer(data):
return pipe(data, limit_rows, to_values)
The json
and csv
data transformers will save a Dataframe to a temporary
.json
or .csv
file before rendering. There are a number of performance
advantages to these two data transformers:
- The full dataset will not be saved in the notebook document.
- The performance of the Vega-Lite/Vega JavaScript appears to be better for standalone JSON/CSV files than for inline values.
There are disadvantages of the JSON/CSV data transformers:
- The Dataframe will be exported to a temporary
.json
or.csv
file that sits next to the notebook. - That notebook will not be able to re-render the visualization without that temporary file (or re-running the cell).
In our experience, the performance improvement is significant enough that
we recommend using the json
data transformer for any large datasets:
alt.data_transformers.enable('json')
We hope that others will write additional data transformers - imagine a transformer which saves the dataset to a JSON file on S3, which could be registered and enabled as:
alt.data_transformers.register('s3', lambda data: pipe(data, to_s3('mybucket')))
alt.data_transformers.enable('s3')