Hello, Altair!

There are many Python libraries we can use to visualize data. Here, we will explore how you can visually explore data using two popular libraries, pandas and Vega-Altair. By the end of this module, you will be able to:

What is Altair?

Altair is a declarative statistical visualization library for Python. It is based on Vega-Lite, a high-level grammar of interactive graphics.

And by declarative, we mean that you can provide a high-level specification of what you want the visualization to include, in terms of data, graphical marks, and encoding channels. You don’t have to implement the visualization using for-loops, low-level drawing commands, etc.

The key idea is that you declare links between data fields and visual encoding channels such as the x-axis, y-axis, color. The other details of the plot are handled automatically. As a result, a surprising range of simple to sophisticated visualizations can be created using a concise grammar.

Using Altair, you have a friendly Python API (Application Programming Interface) for generating visual specifications in interactive environments like Jupyter Notebooks or in a regular Python file that are rendered in the web browser.

altair figures cr: Vega-Altair
Image credit: Screenshot from Vega-Altair


Import Altair

To begin, we need to import the libraries needed to use their functions: pandas for dataframes and altair for visualization. We will assign short aliases pd for pandas and alt for altair that we will refer to throughout our code.

import pandas as pd
import altair as alt
Load Dataframe

Data in Altair is built around the pandas dataframe, which consists of a table of rows and data columns. These rows are often referred to as data items or entries, and columns as data attributes, fields, or variables.

As in the Data Analysis lecture notes, we load our dataset as a dataframe using a built-in function from pandas: read_csv(). Let’s read in the Hawks dataset from that lecture!

Note: First install Altair with pip install altair

hawks = pd.read_csv('https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/refs/heads/master/csv/Stat2Data/Hawks.csv', index_col=0)
hawks.head()
preview of hawks dataframe

The Chart Object

In Python, objects are special variables that contain not only data, but also their own set of functions and methods that can be used to manipulate data.

The fundamental object in Altair is the Chart. We can call this Chart object from the Altair library by writing alt.Chart(). This object accepts a dataframe as a single argument inside its parenthesis, such as the Hawks dataset that we have stored in the variable hawks.

alt.Chart(hawks)
Chart object error message

Yet, defining a Chart object alone does not produce a chart. What else do we need to tell Altair to draw the data?

Marks and Channels

Once we have defined the Chart() object and passed it our hawks dataframe, we can now specify how we would like the data in the chart to be visualized.

We introduce the role of marks and channels as we start to specify charts with Altair. A mark is the geometric shape that we use to represent data elements in our graph. Channels are the visual styling/characteristics that we assign to each of our marks. This includes how and where we position the mark, or the color that we assign.

Here, we first indicate what kind of graphical mark we want to use to represent the data. We can set the mark attribute of the chart object using mark_* methods.

For example, we can show the data as a line using by mark_tick().

alt.Chart(hawks).mark_tick()
tick marks without channels

Why is this code only showing us one vertical line? Here, the chart is drawing one tick per row in the dataset. They are plotted directly on top of each other because we have not specified positions for these ticks yet.

To visually separate the points, we can map encoding channels, or channels for short, to data columns in the dataset. For example, we could encode the field Wing using the X channel, which represents the x-axis or horizontal position of the ticks. To specify this, we use the encode() method and chain this function after the mark_tick() method.

The encode() method builds a mapping between visual encoding channels (such as X, Y, color, shape, size, etc.) to fields in the dataset, accessed by field name. Altair provides construction methods for defining encoding channels, using a method syntax e.g., alt.X(“FieldName”).

alt.Chart(hawks).mark_tick().encode(
    alt.X("Wing") 
)
tick marks with X channel

And now all of our tick marks are spread out along the x-axis of our chart, where each tick mark indicates a given hawk’s wing length. Through this simple visualization, we can also get a sense of how wing lengths are distributed among the hawks in our dataset.


Aggregating Data

Though we’ve separated the data by "Wing", we still have multiple data points overlapping (i.e., there are many birds with the same wing length and we cannot tease them apart easily in the previous graphic).

We can further separate these by using a Y encoding channel, which represents the y-axis or vertical position of the circles. We can map the Y axis to the total number of hawks with the same or similar wing lengths. To do so, we need to combine and aggregate our passenger values in some way. For this, let’s use bars, mark_bar, as our marks of choice.

alt.Chart(hawks).mark_bar().encode(
    alt.X("Wing"),
    alt.Y("count()")
)
bar marks with counts

The most straightforward way to aggregate this information uses count(). This method counts the total number of hawks for every “Wing” value that appears in the dataset. Then, we can group these values so there aren’t so many, and to make it easier to see coarser age groupings–this is called binning. Luckily, there is a parameter bin=True we can easily add in altair to do this.

With this code, we have created a histogram, a type of bar chart that can be used to summarize continuous data and conveniently shows the distribution of data.

alt.Chart(hawks).mark_bar().encode(
    alt.X("Wing", bin=True),
    alt.Y("count()")
)
histogram

Color and Interaction

We can also map the Y encoding channel to a completely different field, such as "Weight". This will allow us to see the spread of hawks according to their weight against their wing length. For this, let’s use circles, mark_circle, as our marks of choice.

alt.Chart(hawks).mark_circle().encode(
    alt.X("Wing"),
    alt.Y("Weight")
)
scatterplot

Through this visualization, we can see that there seem to be two or three clusters of data – what could be causing these clusters? Could it have something to do with the hawk species? We can check our hypothesis by color coding the data points by hawk species. We do this by linking the color channel with the "Species" data column.

alt.Chart(hawks).mark_circle().encode(
    alt.X("Wing"),
    alt.Y("Weight"),
    color="Species"
)
scatterplot color-coded by species

With this static graphic, we now have a sense of the overall trends between weight, wing length, and species in our dataset. We still have the choice to look at individual data points in detail, for example, by making the chart interactive. Altair offers an easy and convenient way to zoom into and pan around the vast amount of data: interactive()

alt.Chart(hawks).mark_circle().encode(
    alt.X("Wing"),
    alt.Y("Weight"),
    color="Species"
).interactive()

We can also customize chart elements on a whole throughout the chart, for example, using a color that better relates to the dataset (e.g., using brown color for hawks). Or changing the opacity of marks to show the density of the data points. Or simply for personal preference.

There are two ways to customize marks: inside mark_* directly or inside encode() through a channel such as color or opacity. Both of these code snippets will create the same chart!

alt.Chart(hawks).mark_circle(color="brown", opacity=0.2).encode(
    alt.X("Wing"),
    alt.Y("Weight")
)
alt.Chart(hawks).mark_circle().encode(
    alt.X("Wing"),
    alt.Y("Weight"),
    color=alt.value("brown"),
    opacity=alt.value(0.2)
)
scatterplot with custom color and opacity

Repeating Charts

With Altair, we can also combine multiple charts together in a visualization. We’ll show one example of a multi-view chart called a scatterplot matrix, used to visually show correlations in the data.

For example, if we wanted to visually analyze the weight, wing length and tail length in a large chart…

How would we do this manually? Create individual charts, stack individual charts horizontally in a row, and then stack those rows vertically into columns.

base = alt.Chart().mark_circle().encode(
    color='Species'
)

chart = alt.vconcat(data=hawks)
for y_encoding in ['Weight', 'Wing', 'Tail']:
    row = alt.hconcat()
    for x_encoding in ['Weight', 'Wing', 'Tail']:
        row |= base.encode(x=x_encoding, y=y_encoding) #create individual charts and stack horizontally into rows
    chart &= row # stack rows vertically into columns
chart

Or…

We can use the repeat() method from Altair to specify a set of encodings for the row and column – the attributes/columns we want to visualize and find relationships between.

The repeater ties a channel to the row or column within a repeated chart - repeating rows alt.repeat("column") and columns alt.repeat("row") are assigned to the X and Y-axes of our visualization.

alt.Chart(hawks).mark_circle().encode(
    alt.X(alt.repeat("column"), type='quantitative'),
    alt.Y(alt.repeat("row"), type='quantitative'),
    color='Species'
    ).repeat(
        row=["Weight", "Wing", "Tail"],
        column=["Weight", "Wing", "Tail"]
    )
scatterplot matrix

Saving the Chart

Once you have visualized your data, perhaps you would like to publish it somewhere on the web. Here we are assigning a Chart object to a variable chart and generating a stand-alone HTML document for it using the Chart.save() method:

chart = alt.Chart(hawks).mark_circle().encode(
    alt.X("Wing"),
    alt.Y("Weight"),
    color="Species"
)

chart.save("hawks.html") # or any file name/path of choice

You can open this HTML file in a web browser (e.g., Safari, Google Chrome) to view the visualization that you’ve made!