Pandas Dataframes
Introduction

(designed by Nerdy Vogue)
Humanity generates an enormous amount of data daily. Every time you tap the screen, enter the bus, pay for a lunch, or just walk by the city centre - you’re making data, that is stored somewhere and in a certain format. The most common way to imagine data for us is data tables: where the rows represent observations and the columns represent attributes for those observations.
When it comes to working with data tables via computers, Excel, Sheets or R are first tools that come to mind. They all decent, but none of them provide as much great scalability, support from community, and versatility of opportunities for further work with the data at the same time.
What if we want to analyse tabular data in Python? The answer is Pandas
package! It is suited for different formats, such as:
- Tabular data stored in relational databases with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet.
- Ordered and unordered time series and panel data (which gave the name to pandas)
- Arbitrary matrix data
- Any other form of observational / statistical data sets
Basics
Data table in pandas is called DataFrame
. It’s a 2-dimensional, size-mutable, potentially heterogeneous tabular data. You can think about it as a dict
, where keys are variable names, and values are data in the dataframe. Moreover, this is exactly how we can create a dataframe from scratch:
# load the package (if ModuleNotFoundError: pip install pandas)
import pandas as pd
## option 1: from a dictionary with key as column name and values as column values
gene_data = {
"Gene": [ # column name
"BRCA1", # column values
"TP53",
"MYC"
],
"Level": [150, 320, 210]
}
df = pd.DataFrame(gene_data)
print(df, "\n\n", type(df)) # <class 'pandas.core.frame.DataFrame'>
# Gene Level
# 0 BRCA1 150
# 1 TP53 320
# 2 MYC 210
## option 2. from a 2D list of data plus a regular list with the headers
colnames = ["Gene", "Level"]
values = [
["BRCA1", 150],
["TP53", 320],
["MYC", 210]]
df = pd.DataFrame(values, columns = colnames)
print(df, "\n\n", type(df)) # <class 'pandas.core.frame.DataFrame'>
# Gene Level
# 0 BRCA1 150
# 1 TP53 320
# 2 MYC 210
However, usually we’re working with already gathered dataframes. They might be stored in different formats, most popular of them are .csv
, .xls
and .json
. One of the most famous datasets from training is iris - let’s play with it:
import csv
import requests
import io
csv_url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'
## option 1: load a CSV file from the internet, convert it to dict, pass to DataFrame
# Download the CSV file from Github
response = requests.get(csv_url) # Download the CSV file
response.raise_for_status() # optional: Ensure we got the file successfully
data_string = response.text
# Read csv file just like we did it last week!
reader = csv.DictReader(io.StringIO(data_string), delimiter=',', quotechar="'")
headers = reader.fieldnames
data = list(reader)
iris = pd.DataFrame(data)
print(iris.head(5)) # display the N rows
## option 2: use the read_csv method (don't use it on the exam!)
iris2 = pd.read_csv(csv_url,
dtype={'sepal_length': float,
'sepal_width': float,
'petal_length': float,
'petal_width': float,}) # , index_col=0
print(iris2.head()) # (5 is a default)
## both will return:
# sepal_length sepal_width petal_length petal_width species
# 0 5.1 3.5 1.4 0.2 setosa
# 1 4.9 3.0 1.4 0.2 setosa
# 2 4.7 3.2 1.3 0.2 setosa
# 3 4.6 3.1 1.5 0.2 setosa
# 4 5.0 3.6 1.4 0.2 setosa
iris = iris2
And just like with dictionaries, we can create new columns in dataframes by assigning to new columnname that wasn’t existing in dataframe before.
import random
print('Before appending new column:', iris.shape) # (150, 5) - 150 rows, 5 columns
random_values = [round(random.random(), 2) for i in range(len(iris))] # make sure that the length of the list is the same as the number of rows
# create new column
iris['nonsence_random_floats'] = random_values
print('After appending:', iris.shape) # (150, 6) - new column appeared
# delete existing column
iris = iris.drop(columns=['nonsence_random_floats'])
print('After dropping:', iris.shape) # (150, 5) - bye bye column!
Besides .head()
and .tail()
methods that allow you to see first and last N rows, we can also view dataaframe’s columns, and convert it back to numpy’s ndarray
:
print(iris.head(2), '\n\n') # display the first 2 rows
print(iris.tail(2), '\n\n') # display the last 2 rows
print(iris.columns, '\n\n') # display the column names: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width','species']
print(iris.index, '\n\n') # display the row indexes (just a range if not specified)
print(iris.to_numpy()[0:3]) # convert the DataFrame to a numpy array
Indexes and Subsets
Just like any data container in Python, dataframes allow one to have an access to particular data. However, due to the nature of pd.Series
(a datatype of a 1D array from a pd.DataFrame, basically np.ndarray with axis labels), we can use both keys (column names) and indexes (row labels) to access pieces of data:
# extracting columns
species_column = iris['species'] # just like with dicts
species_column2 = iris.species # or like an attribute
assert species_column.equals(species_column2) # True
print(species_column, "\n\n", type(species_column)) # <class 'pandas.core.series.Series'>
# 0 setosa
# 1 setosa
# 2 setosa
# ...
# 149 virginica
# Name: species, dtype: object
# extracting rows
row = iris.loc[2] # 3rd obs/row (with index 2)
print(row, "\n\n", type(row)) # <class 'pandas.core.series.Series'>
# sepal_length 4.7
# sepal_width 3.2
# petal_length 1.3
# petal_width 0.2
# species setosa
# Name: 2, dtype: object
In both scenarios we got a pd.Series object - a 1D array. But in the first print we displayed the entire “Gene” column, whereas in the second one we outputed the last row of the dataframe.
If you want to edit a single cell’s value, you can rewrite it just with assignment operator:
print(iris.iloc[5, 2]) # 1.7
iris.iat[5, 2] = 1.8 # change the value of the 6th row and 3rd column to 1.8
We can also create subset of a dataframe in 2 ways:
- by specifying which columns and rows should be selected as a part of a dataframe (remember to put double square brackets, as we accessing dataframe’s subset (e.g.
data[]
) with several columns written as list (e.g.['col1', 'col2']
)), - by filtering our dataframe by boolean masks. Just like we did it with numpy arrays: create a logical condition, and pass it inside square brackets under the dataframe.
Example of filtering:

(src: pandas tutorials)
# creating a subset without filters
irises_sepals_data = iris[['sepal_length', 'sepal_width' , 'species']] # select only columns about sepals and species
half_of_irises_sepals_data = irises_sepals_data.sample(frac=0.5) # bonus: get randon 50% of observations
print(half_of_irises_sepals_data.head()) # display the N rows
# sepal_length sepal_width species
# 127 6.1 3.0 virginica
# 84 5.4 3.0 versicolor
# 130 7.4 2.8 virginica
# 114 5.8 2.8 virginica
# 100 6.3 3.3 virginica...
# Filter rows where sepal_width is greater than 3
more_than_3_mask = iris['sepal_width'].astype(float) > 3 # filter rows by a condition
print(more_than_3_mask, '\n') # Boolean mask
# 0 True
# 1 False
# 2 True
# ...
# Keep only rows where the condition is True
print(iris[more_than_3_mask], '\n') # Filtered rows
# # [67 rows x 5 columns]
# Or shortly in 1 line (reverse condition):
print(iris[iris['sepal_width'].astype(float) <= 3]) # Filtered rows with reverse condition
# [83 rows x 5 columns]
Joining two dataframes
Sometimes, we need to merge two sources of data to make a single analysis on a new dataframe. For this purpose, we can do:
concat
- just stick a series / dataframe to another, just like with numpy arrays;merge
- join operations similar to relational databases like SQL: it combines two tables based on a common key, aligning rows where keys match.
Let’s split iris dataframe and then join it back together:
# split iris dataframe in two halves
iris1 = iris.iloc[:75] # first half
iris2 = iris.iloc[75:] # second half
# concatenate two dataframes
iris_combined = pd.concat([iris1, iris2], axis=0) # axis=0 means rows
# create a new Series with sizes of the flowers by observational description (how we call them)
sizes = pd.Series({'virginica': 'big', 'versicolor': 'medium', 'setosa': 'small'}, name='size')
iris = iris.merge(sizes, left_on='species', right_index=True, how='left') # merge irises with new data by column 'species'
print(iris.sample(3)) # display random 3 rows
Operations with Dataframes
Dataframe objects inherited some of attributes from np.ndarray class, so you can freely check .size
, .shape
and .dtypes
of the data.
When it comes to methods (which are pandas functions that can be applied after the dataframe, like data.head()
), we got a bunch of new and useful functions to play with.
# get shape of the DataFrame
print("\nShape of DataFrame:", iris.shape) # (150, 5)
# get data types of each column
print("\nData types of each column:\n", iris.dtypes, sep = "")
# sepal_length float64
# sepal_width float64
# petal_length float64
# petal_width float64
# species object
But the biggest advantage is that you can calulate summary statistics of the dataframe right on place. For this purpose, you need to make sure that all columns of your dataframe are of the correct datatype:
iris.info()
summary = iris.describe() # summary statistics
print("Summary statistics on Gene data:\n", summary.transpose(), sep = "") # for readability
# Summary statistics on Gene data:
# count mean std min 25% 50% 75% max
# sepal_length 150.0 5.843333 0.828066 4.3 5.1 5.80 6.4 7.9
# sepal_width 150.0 3.057333 0.435866 2.0 2.8 3.00 3.3 4.4
# petal_length 150.0 3.758000 1.765298 1.0 1.6 4.35 5.1 6.9
# petal_width 150.0 1.199333 0.762238 0.1 0.3 1.30 1.8 2.5
Grouping
This statistics can also be calculated by each Specie - that’s what’s called groupping, or group by. This usually refers to a process involving one or more of the following steps:
- Splitting the data into groups based on some criteria
- A pplying a function to each group independently
- Combining the results into a data structure

(inspired by pandas doc)
grouped_medians = iris.groupby('species')['petal_length'].median() # group by species and calculate medians
print("Grouped Summary Statistics by Species:\n", grouped_medians)