.. _data-transformers:

Data transformers
=================

Before a Vega-Lite or Vega specification can be passed to a renderer, it typically
has to be transformed in a number of ways:

* Pandas Dataframe has to be sanitized and serialized to JSON.
* The rows of a Dataframe might need to be sampled or limited to a maximum number.
* The Dataframe might be written to a ``.csv`` of ``.json`` file for performance
  reasons.

These data transformations are managed by the data transformation API of Altair.

.. note::

    The data transformation API of Altair should not be confused with the ``transform``
    API of Vega and Vega-Lite.

A data transformer is a Python function that takes a Vega-Lite data ``dict`` or
Pandas ``DataFrame`` and returns a transformed version of either of these types::

    from typing import Union
    Data = Union[dict, pd.DataFrame]

    def data_transformer(data: Data) -> Data:
        # Transform and return the data
        return transformed_data

Built-in data transformers
~~~~~~~~~~~~~~~~~~~~~~~~~~

Altair includes a default set of data transformers with the following signatures.

Raise a ``MaxRowsError`` if a Dataframe has more than ``max_rows`` rows::

    limit_rows(data, max_rows=5000)

Randomly sample a DataFrame (without replacement) before visualizing::

    sample(data, n=None, frac=None)

Convert a Dataframe to a separate ``.json`` file before visualization::

    to_json(data, prefix='altair-data'):

Convert a Dataframe to a separate ``.csv`` file before visualization::

    to_csv(data, prefix='altair-data'):

Convert a Dataframe to inline JSON values before visualization::

    to_values(data):

Piping
~~~~~~

Multiple data transformers can be piped together using ``pipe``::

    from altair import pipe, limit_rows, to_values
    pipe(data, limit_rows(10000), to_values)

Managing data transformers
~~~~~~~~~~~~~~~~~~~~~~~~~~

Altair maintains a registry of data transformers, which includes a default
data transformer that is automatically applied to all Dataframes before rendering.

To see the registered transformers::

    >>> import altair as alt
    >>> alt.data_transformers.names()
    ['default', 'json', 'csv']

The default data transformer is the following::

    def default_data_transformer(data):
        return pipe(data, limit_rows, to_values)

The ``json`` and ``csv`` data transformers will save a Dataframe to a temporary
``.json`` or ``.csv`` file before rendering. There are a number of performance
advantages to these two data transformers:

* The full dataset will not be saved in the notebook document.
* The performance of the Vega-Lite/Vega JavaScript appears to be better
  for standalone JSON/CSV files than for inline values.

There are disadvantages of the JSON/CSV data transformers:

* The Dataframe will be exported to a temporary ``.json`` or ``.csv``
  file that sits next to the notebook.
* That notebook will not be able to re-render the visualization without
  that temporary file (or re-running the cell).

In our experience, the performance improvement is significant enough that
we recommend using the ``json`` data transformer for any large datasets::

    alt.data_transformers.enable('json')

We hope that others will write additional data transformers - imagine a
transformer which saves the dataset to a JSON file on S3, which could
be registered and enabled as::

    alt.data_transformers.register('s3', lambda data: pipe(data, to_s3('mybucket')))
    alt.data_transformers.enable('s3')