Data Transformations¶
It is often necessary to transform or filter data in the process of visualizing it. In Altair you can do this one of two ways:
- Before the chart definition, using standard Pandas data transformations.
- Within the chart definition, using Vega-Lite’s data transformation tools.
In most cases, we suggest that you use the first approach, because it is more straightforward to those who are familiar with data manipulation in Python, and because the Pandas package offers much more flexibility than Vega-Lite in available data manipulations.
The second approach becomes useful when the data source is not a dataframe, but, for example, a URL pointer to a JSON or CSV file. It can also be useful in a compound chart where different views of the dataset require different transformations.
This second approach – specifying data transformations within the chart
specification itself – can be accomplished using the transform_*
methods of top-level objects:
Method | Description |
---|---|
transform() |
Generic transform; passes keywords to any of the following methods. |
transform_aggregate() |
Create a new data column by aggregating an existing column. |
transform_bin() |
Create a new data column by binning an existing column. |
transform_calculate() |
Create a new data column using an arithmetic calculation on an existing column. |
transform_filter() |
Select a subset of data based on a condition. |
transform_lookup() |
One-sided join of two datasets based on a lookup key. |
transform_timeunit() |
Discretize/group a date by a time unit (day, month, year, etc.) |
transform_window() |
Compute a windowed aggregation |
We will see some examples of these transforms in the following sections.
Aggregate Transforms¶
There are two ways to aggregate data within Altair: within the encoding itself, or using a top level aggregate transform.
The aggregate property of a field definition can be used to compute aggregate summary statistics (e.g., median, min, max) over groups of data.
If at least one fields in the specified encoding channels contain aggregate, the resulting visualization will show aggregate data. In this case, all fields without aggregation function specified are treated as group-by fields in the aggregation process.
For example, the following bar chart aggregates mean of acceleration
,
grouped by the number of Cylinders.
import altair as alt
from vega_datasets import data
cars = data.cars.url
alt.Chart(cars).mark_bar().encode(
y='Cylinders:O',
x='mean(Acceleration):Q',
)
The Altair shorthand string:
# ...
y='mean(acceleration):Q',
# ...
is made available for convenience, and is equivalent to the longer form:
# ...
y=alt.Y(field='acceleration', aggregate='mean', type='quantitative'),
# ...
For more information on shorthand encodings specifications, see Binning and Aggregation.
The same plot can be shown using an explicitly computed aggregation, using the
transform_aggregate()
method:
alt.Chart(cars).mark_bar().encode(
y='Cylinders:O',
x='mean_acc:Q'
).transform_aggregate(
mean_acc='mean(Acceleration)',
groupby=["Cylinders"]
)
For a list of available aggregates, see Binning and Aggregation.
Bin transforms¶
As with Aggregate Transforms, there are two ways to apply a bin transform in Altair: within the encoding itself, or using a top-level bin transform.
An common application of a bin transform is when creating a histogram:
import altair as alt
from vega_datasets import data
movies = data.movies.url
alt.Chart(movies).mark_bar().encode(
alt.X("IMDB_Rating:Q", bin=True),
y='count()',
)
But a bin transform can be useful in other applications; for example, here we bin a continuous field to create a discrete color map:
import altair as alt
from vega_datasets import data
cars = data.cars.url
alt.Chart(cars).mark_point().encode(
x='Horsepower:Q',
y='Miles_per_Gallon:Q',
color=alt.Color('Acceleration:Q', bin=alt.Bin(maxbins=5))
)
In the first case we set bin = True
, which uses the default bin settings.
In the second case, we exercise more fine-tuned control over the bin parameters
by passing a Bin
object.
If you are using the same binnings in multiple chart components, it can be useful
to instead define the binning at the top level, using transform_bin()
method.
Here is the above histogram created using a top-level bin transform:
import altair as alt
from vega_datasets import data
movies = data.movies.url
alt.Chart(movies).mark_bar().encode(
x='binned_rating:O',
y='count()',
).transform_bin(
'binned_rating', field='IMDB_Rating'
)
And here is the transformed color scale using a top-level bin transform:
import altair as alt
from vega_datasets import data
cars = data.cars.url
alt.Chart(cars).mark_point().encode(
x='Horsepower:Q',
y='Miles_per_Gallon:Q',
color='binned_acc:O'
).transform_bin(
'binned_acc', 'Acceleration', bin=alt.Bin(maxbins=5)
)
The advantage of the top-level transform is that the same named field can be used in multiple places in the chart if desired. Note the slight difference in binning behavior between the encoding-based binnings (which preserve the range of the bins) and the transform-based binnings (which collapse each bin to a single representative value.
Calculate Transform¶
The calculate transform allows the user to define new fields in the dataset which are calculated from other fields using an expression syntax.
As a simple example, here we take data with a simple input sequence, and compute a some trigonometric quantities:
import altair as alt
import pandas as pd
data = pd.DataFrame({'t': range(101)})
alt.Chart(data).mark_line().encode(
x='x:Q',
y='y:Q',
order='t:Q'
).transform_calculate(
x='cos(datum.t * PI / 50)',
y='sin(datum.t * PI / 25)'
)
Each argument within transform_calculate
is a Vega expression string,
which is a well-defined set of javascript-style operations that can be used
to calculate a new field from an existing one.
To streamline building these vega expressions in Python, Altair provides the
altair.expr
module which provides constants and functions to allow
these expressions to be constructed with Python syntax; for example:
from altair import expr, datum
alt.Chart(data).mark_line().encode(
x='x:Q',
y='y:Q',
order='t:Q'
).transform_calculate(
x=expr.cos(datum.t * expr.PI / 50),
y=expr.sin(datum.t * expr.PI / 25)
)
Altair expressions are designed to output valid Vega expressions. The benefit of
using them is that proper syntax is ensured by the Python interpreter, and tab
completion of the expr
submodule can be used to explore the
available functions and constants.
These expressions can also be used when constructing a Filter Transform, as we shall see next.
Filter Transform¶
The filter transform removes objects from a data stream based on a provided
filter expression, selection, or other filter predicate. A filter can be
added at the top level of a chart using the Chart.transform_filter()
method. The argument to transform_filter
can be one of a number of
expressions and objects:
- A Vega expression expressed as a string or built using the
expr
module - A Field predicate, such as
FieldOneOfPredicate
,FieldRangePredicate
,FieldEqualPredicate
,FieldLTPredicate
,FieldGTPredicate
,FieldLTEPredicate
,FieldGTEPredicate
, - A Selection predicate or object created by
selection()
- A Logical operand that combines any of the above
We’ll show a brief example of each of these in the following sections
Filter Expression¶
A filter expression uses the Vega expression language, either specified
directly as a string, or built using the expr
module.
This can be useful when, for example, selecting only a subset of data.
For example:
import altair as alt
from altair import datum
from vega_datasets import data
pop = data.population.url
alt.Chart(pop).mark_area().encode(
x='age:O',
y='people:Q',
).transform_filter(
(datum.year == 2000) & (datum.sex == 1)
)
Notice that, like in the Filter Transform, data values are
referenced via the name datum
.
Field Predicates¶
Field predicates overlap somewhat in function with expression predicates, but have the advantage that their contents are validated by the schema. Examples are:
FieldEqualPredicate
evaluates whether a field is equal to a particular valueFieldOneOfPredicate
evaluates whether a field is among a list of specified values.FieldRangePredicate
evaluates whether a continuous field is within a range of values.FieldLTPredicate
evaluates whether a continuous field is less than a given valueFieldGTPredicate
evaluates whether a continuous field is greater than a given valueFieldLTEPredicate
evaluates whether a continuous field is less than or equal to a given valueFieldGTEPredicate
evaluates whether a continuous field is greater than or equal to a given value
Here is an example of a FieldEqualPredicate
used to select just the
values from year 2000 as in the above chart:
import altair as alt
from vega_datasets import data
pop = data.population.url
alt.Chart(pop).mark_line().encode(
x='age:O',
y='sum(people):Q',
color='year:O'
).transform_filter(
alt.FieldEqualPredicate(field='year', equal=2000)
)
A FieldOneOfPredicate
is similar, but allows selection of any number
of specific values:
import altair as alt
from vega_datasets import data
pop = data.population.url
alt.Chart(pop).mark_line().encode(
x='age:O',
y='sum(people):Q',
color='year:O'
).transform_filter(
alt.FieldOneOfPredicate(field='year', oneOf=[1900, 1950, 2000])
)
Finally, a FieldRangePredicate()
allows selecting values within a
particular continuous range:
import altair as alt
from vega_datasets import data
pop = data.population.url
alt.Chart(pop).mark_line().encode(
x='age:O',
y='sum(people):Q',
color='year:O'
).transform_filter(
alt.FieldRangePredicate(field='year', range=[1960, 2000])
)
Selection Predicates¶
Selection predicates can be used to filter data based on a selection. While
these can be constructed directly using a SelectionPredicate
class,
in Altair it is often more convenient to construct them using the
selection()
function. For example, this chart uses a multi-selection
that allows the user to click or shift-click on the bars in the bottom chart
to select the data to be shown in the top chart:
import altair as alt
from vega_datasets import data
pop = data.population.url
selection = alt.selection_multi(fields=['year'])
top = alt.Chart().mark_line().encode(
x='age:O',
y='sum(people):Q',
color='year:O'
).properties(
width=600, height=200
).transform_filter(
selection
)
bottom = alt.Chart().mark_bar().encode(
x='year:O',
y='sum(people):Q',
color=alt.condition(selection, alt.value('steelblue'), alt.value('lightgray'))
).properties(
width=600, height=100,
selection=selection
)
alt.vconcat(
top, bottom,
data=pop
)
Logical Operands¶
At times it is useful to combine several types of predicates into a single selection. This can be accomplished using the various logical operand classes:
These are not yet part of the Altair interface
(see Issue 693)
but can be constructed explicitly; for example, here we plot US population
distributions for all data except the years 1950-1960,
by applying a LogicalNotPredicate
schema to a FieldRangePredicate
:
import altair as alt
from vega_datasets import data
pop = data.population.url
alt.Chart(pop).mark_line().encode(
x='age:O',
y='sum(people):Q',
color='year:O'
).properties(
width=600, height=200
).transform_filter(
{'not': alt.FieldRangePredicate(field='year', range=[1900, 1950])}
)
Lookup Transform¶
The lookup transform extends a primary data source by looking up values from
another data source; it is similar to a one-sided join. A lookup can be added
at the top level of a chart using the Chart.transform_lookup()
method.
By way of example, imagine you have two sources of data that you would like
to combine and plot: one is a list of names of people along with their height
and weight, and the other is some information about which groups they belong
to. This example data is available in vega_datasets
:
from vega_datasets import data
people = data.lookup_people()
groups = data.lookup_groups()
We know how to visualize each of these datasets separately; for example:
top = alt.Chart(people).mark_square(size=200).encode(
x=alt.X('age:Q', scale=alt.Scale(zero=False)),
y=alt.Y('height:Q', scale=alt.Scale(zero=False)),
color='name:N',
tooltip='name:N'
).properties(
width=400, height=200
)
bottom = alt.Chart(groups).mark_rect().encode(
x='person:N',
y='group:O'
).properties(
width=400, height=100
)
alt.vconcat(top, bottom)
If we would like to plot features that reference both datasets (for example, the
average age within each group), we need to combine the two datasets.
This can be done either as a data preprocessing step, using tools available
in Pandas, or as part of the visualization using a LookupTransform
in Altair.
Combining Datasets with pandas.merge¶
Pandas provides a wide range of tools for merging and joining datasets; see Merge, Join, and Concatenate for some detailed examples. For the above data, we can merge the data and create a combined chart as follows:
import pandas as pd
merged = pd.merge(groups, people, how='left',
left_on='person', right_on='name')
alt.Chart(merged).mark_bar().encode(
x='mean(age):Q',
y='group:O'
)
We specify a left join, meaning that for each entry of the “person” column in the groups, we seek the “name” column in people and add the entry to the data. From this, we can easily create a bar chart representing the mean age in each group.
Combining Datasets with a Lookup Transform¶
For some data sources (e.g. data available at a URL, or data that is streaming),
it is desirable to have a means of joining data without having to download
it for pre-processing in Pandas.
This is where Altair’s transform_lookup()
comes in.
To reproduce the above combined plot by combining datasets within the
chart specification itself, we can do the following:
alt.Chart(groups).mark_bar().encode(
x='mean(age):Q',
y='group:O'
).transform_lookup(
lookup='person',
from_=alt.LookupData(data=people, key='name',
fields=['age', 'height'])
)
Here lookup
names the field in the groups dataset on which we will match,
and the from_
argument specifies a LookupData
structure where
we supply the second dataset, the lookup key, and the fields we would like to
extract.
Example: Lookup Transforms for Geographical Visualization¶
Lookup transforms are often particularly important for geographic visualization, where it is common to combine tabular datasets with datasets that specify geographic boundaries to be visualized; for example, here is a visualization of unemployment rates per county in the US:
import altair as alt
from vega_datasets import data
counties = alt.topo_feature(data.us_10m.url, 'counties')
unemp_data = data.unemployment.url
alt.Chart(counties).mark_geoshape().encode(
color='rate:Q'
).transform_lookup(
lookup='id',
from_=alt.LookupData(unemp_data, 'id', ['rate'])
).properties(
projection={'type': 'albersUsa'},
width=500, height=300
)
TimeUnit Transform¶
TimeUnit transforms are used to discretize dates and times within Altair. As with the Aggregate Transforms and Bin transforms discussed above, they can be defined either as part of the encoding, or as a top-level transform.
These are the available time units:
"year"
,"yearquarter"
,"yearquartermonth"
,"yearmonth"
,"yearmonthdate"
,"yearmonthdatehours"
,"yearmonthdatehoursminutes"
,"yearmonthdatehoursseconds"
."quarter"
,"quartermonth"
"month"
,"monthdate"
"date"
(Day of month, i.e., 1 - 31)"day"
(Day of week, i.e., Monday - Friday)"hours"
,"hoursminutes"
,"hoursminutesseconds"
"minutes"
,"minutesseconds"
"seconds"
,"secondsmilliseconds"
"milliseconds"
TimeUnit Within Encoding¶
Any temporal field definition can include a timeUnit
argument to discretize
the temporal data.
For example, here we plot a dataset that consists of hourly temperature measurements in Seattle during the year 2010:
import altair as alt
from vega_datasets import data
temps = data.seattle_temps.url
alt.Chart(temps).mark_line().encode(
x='date:T',
y='temp:Q'
)
The plot is too busy due to the amount of data points squeezed into the short time; we can make it a bit cleaner by discretizing it, for example, by month and plotting only the mean monthly temperature:
alt.Chart(temps).mark_line().encode(
x='month(date):T',
y='mean(temp):Q'
)
Notice that by default timeUnit output is a continuous quantity; if you would
instead like it to be a categorical, you can specify the ordinal (O
) or
nominal (N
) type.
This can be useful when plotting a bar chart or other discrete chart type:
alt.Chart(temps).mark_bar().encode(
x='month(date):O',
y='mean(temp):Q'
)
Multiple time units can be combined within a single plot to yield interesting views of your data; for example, here we extract both the month and the day to give a profile of Seattle temperatures through the year:
alt.Chart(temps).mark_rect().encode(
alt.X('date(date):O', title='day'),
alt.Y('month(date):O', title='month'),
color='max(temp):Q'
).properties(
title="2010 Daily High Temperatures in Seattle (F)"
)
TimeUnit as a Transform¶
Other times it is convenient to specify a timeUnit as a top-level transform,
particularly when the value may be reused.
This can be done most conveniently using the Chart.transform_timeunit()
method. For example:
alt.Chart(temps).mark_line().encode(
alt.X('month:T', axis=alt.Axis(format='%b')),
y='mean(temp):Q'
).transform_timeunit(
month='month(date)'
)
Notice that because the timeUnit
is not part of the encoding channel here,
it is often necessary to add an axis formatter to ensure appropriate axis
labels.
Window Transform¶
The Window transform performs calculations over sorted groups of data objects. These calculations including ranking, lead/lag analysis, and aggregates such as running sums and averages. Calculated values are written back to the input data stream, where they can be referenced in encodings.
For example, consider the following chart showing time spent on various activities during a day:
import altair as alt
import pandas as pd
activities = pd.DataFrame({'Activity': ['Sleeping', 'Eating', 'TV', 'Work', 'Exercise'],
'Time': [8, 2, 4, 8, 2]})
alt.Chart(activities).mark_bar().encode(
x='Time:Q',
y='Activity:N'
)
You might wish to plot these bars in units of percentage of total time rather than
in units of hours. You can do this by combining a calculate transform with a
window transform, using transform_window()
:
alt.Chart(activities).transform_window(
TotalTime='sum(Time)',
frame=[None, None]
).transform_calculate(
PercentOfTotal="100 * datum.Time / datum.TotalTime"
).mark_bar().encode(
x='PercentOfTotal:Q',
y='Activity:N'
)
In the window transform, we specify frame=[None, None]
, which indicates that
the aggregation at each point is performed on the entire dataset.
Window transforms are quite flexible, and are not yet well documented within
Altair. For more information on the arguments of the window transform, see
WindowTransform
, or see the
Vega-Lite window transform examples.