Pandas Dataframes

Introduction
import pandas!

(designed by Nerdy Vogue)

Humanity generates an enormous amount of data daily. Every time you tap the screen, enter the bus, pay for a lunch, or just walk by the city centre - you’re making data, that is stored somewhere and in a certain format. The most common way to imagine data for us is data tables: where the rows represent observations and the columns represent attributes for those observations.

When it comes to working with data tables via computers, Excel, Sheets or R are first tools that come to mind. They all decent, but none of them provide as much great scalability, support from community, and versatility of opportunities for further work with the data at the same time.
What if we want to analyse tabular data in Python? The answer is Pandas package! It is suited for different formats, such as:

Basics

Data table in pandas is called DataFrame. It’s a 2-dimensional, size-mutable, potentially heterogeneous tabular data. You can think about it as a dict, where keys are variable names, and values are data in the dataframe. Moreover, this is exactly how we can create a dataframe from scratch:

# load the package (if ModuleNotFoundError: pip install pandas)
import pandas as pd

## option 1: from a dictionary with key as column name and values as column values
gene_data = {
    "Gene": [       # column name
        "BRCA1",    # column values
        "TP53",
        "MYC"
    ],
    "Level": [150, 320, 210]
    }

df = pd.DataFrame(gene_data)
print(df, "\n\n", type(df)) #  <class 'pandas.core.frame.DataFrame'>
                            #     Gene  Level
                            # 0  BRCA1    150
                            # 1   TP53    320
                            # 2    MYC    210 


## option 2. from a 2D list of data plus a regular list with the headers
colnames = ["Gene", "Level"]
values = [
    ["BRCA1", 150],
    ["TP53", 320],
    ["MYC", 210]]
df = pd.DataFrame(values, columns = colnames)
print(df, "\n\n", type(df)) #  <class 'pandas.core.frame.DataFrame'>
                            #     Gene  Level
                            # 0  BRCA1    150
                            # 1   TP53    320
                            # 2    MYC    210

However, usually we’re working with already gathered dataframes. They might be stored in different formats, most popular of them are .csv, .xls and .json. One of the most famous datasets from training is iris - let’s play with it:

import csv
import requests
import io
csv_url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'

## option 1: load a CSV file from the internet, convert it to dict, pass to DataFrame

# Download the CSV file from Github
response = requests.get(csv_url) # Download the CSV file
response.raise_for_status()  # optional: Ensure we got the file successfully
data_string = response.text

# Read csv file just like we did it last week!
reader = csv.DictReader(io.StringIO(data_string), delimiter=',', quotechar="'")
headers = reader.fieldnames
data = list(reader)

iris = pd.DataFrame(data)
print(iris.head(5)) # display the N rows 

## option 2: use the read_csv method (don't use it on the exam!)

iris2 = pd.read_csv(csv_url,
                    dtype={'sepal_length': float,
                           'sepal_width': float,
                           'petal_length': float,
                           'petal_width': float,}) # , index_col=0
print(iris2.head()) # (5 is a default)

## both will return:

#    sepal_length  sepal_width  petal_length  petal_width species
# 0           5.1          3.5           1.4          0.2  setosa
# 1           4.9          3.0           1.4          0.2  setosa
# 2           4.7          3.2           1.3          0.2  setosa
# 3           4.6          3.1           1.5          0.2  setosa
# 4           5.0          3.6           1.4          0.2  setosa

iris = iris2

And just like with dictionaries, we can create new columns in dataframes by assigning to new columnname that wasn’t existing in dataframe before.

import random
print('Before appending new column:', iris.shape) # (150, 5) - 150 rows, 5 columns
random_values = [round(random.random(), 2) for i in range(len(iris))] # make sure that the length of the list is the same as the number of rows

# create new column
iris['nonsence_random_floats'] = random_values 
print('After appending:', iris.shape) # (150, 6) - new column appeared

# delete existing column
iris = iris.drop(columns=['nonsence_random_floats']) 
print('After dropping:', iris.shape) # (150, 5) - bye bye column!         

Besides .head() and .tail() methods that allow you to see first and last N rows, we can also view dataaframe’s columns, and convert it back to numpy’s ndarray:

print(iris.head(2), '\n\n') # display the first 2 rows
print(iris.tail(2), '\n\n') # display the last 2 rows
print(iris.columns, '\n\n') # display the column names: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width','species']
print(iris.index, '\n\n')   # display the row indexes (just a range if not specified)
print(iris.to_numpy()[0:3]) # convert the DataFrame to a numpy array
Indexes and Subsets

Just like any data container in Python, dataframes allow one to have an access to particular data. However, due to the nature of pd.Series (a datatype of a 1D array from a pd.DataFrame, basically np.ndarray with axis labels), we can use both keys (column names) and indexes (row labels) to access pieces of data:

# extracting columns
species_column = iris['species']  # just like with dicts
species_column2 = iris.species    # or like an attribute
assert species_column.equals(species_column2)  # True
print(species_column, "\n\n", type(species_column))  # <class 'pandas.core.series.Series'>
                                                    # 0       setosa
                                                    # 1       setosa
                                                    # 2       setosa
                                                    # ...
                                                    # 149    virginica
                                                    # Name: species, dtype: object

# extracting rows
row = iris.loc[2]  # 3rd obs/row (with index 2)
print(row, "\n\n", type(row))  # <class 'pandas.core.series.Series'>
                               # sepal_length       4.7
                               # sepal_width        3.2
                               # petal_length       1.3
                               # petal_width        0.2
                               # species         setosa
                               # Name: 2, dtype: object

In both scenarios we got a pd.Series object - a 1D array. But in the first print we displayed the entire “Gene” column, whereas in the second one we outputed the last row of the dataframe.

If you want to edit a single cell’s value, you can rewrite it just with assignment operator:

print(iris.iloc[5, 2])  # 1.7
iris.iat[5, 2] = 1.8    # change the value of the 6th row and 3rd column to 1.8

We can also create subset of a dataframe in 2 ways:

Example of filtering:

how filter works

(src: pandas tutorials)

# creating a subset without filters
irises_sepals_data = iris[['sepal_length', 'sepal_width' , 'species']]  # select only columns about sepals and species
half_of_irises_sepals_data = irises_sepals_data.sample(frac=0.5)        # bonus: get randon 50% of observations
print(half_of_irises_sepals_data.head())                                # display the N rows
#      sepal_length  sepal_width     species
# 127           6.1          3.0   virginica
# 84            5.4          3.0  versicolor
# 130           7.4          2.8   virginica
# 114           5.8          2.8   virginica
# 100           6.3          3.3   virginica...
# Filter rows where sepal_width is greater than 3
more_than_3_mask = iris['sepal_width'].astype(float) > 3   # filter rows by a condition
print(more_than_3_mask, '\n')  # Boolean mask
# 0       True
# 1      False
# 2       True
# ...

# Keep only rows where the condition is True
print(iris[more_than_3_mask], '\n')  # Filtered rows
# # [67 rows x 5 columns] 

# Or shortly in 1 line (reverse condition):
print(iris[iris['sepal_width'].astype(float) <= 3])  # Filtered rows with reverse condition
# [83 rows x 5 columns]
Joining two dataframes

Sometimes, we need to merge two sources of data to make a single analysis on a new dataframe. For this purpose, we can do:

Let’s split iris dataframe and then join it back together:

# split iris dataframe in two halves
iris1 = iris.iloc[:75]  # first half
iris2 = iris.iloc[75:] # second half

# concatenate two dataframes
iris_combined = pd.concat([iris1, iris2], axis=0)  # axis=0 means rows

# create a new Series with sizes of the flowers by observational description (how we call them)
sizes = pd.Series({'virginica': 'big', 'versicolor': 'medium', 'setosa': 'small'}, name='size')
iris = iris.merge(sizes, left_on='species', right_index=True, how='left')  # merge irises with new data by column 'species' 
print(iris.sample(3))                                                      # display random 3 rows
Operations with Dataframes

Dataframe objects inherited some of attributes from np.ndarray class, so you can freely check .size, .shape and .dtypes of the data.

When it comes to methods (which are pandas functions that can be applied after the dataframe, like data.head()), we got a bunch of new and useful functions to play with.

# get shape of the DataFrame
print("\nShape of DataFrame:", iris.shape)    # (150, 5)

# get data types of each column
print("\nData types of each column:\n", iris.dtypes, sep = "") 
# sepal_length    float64
# sepal_width     float64
# petal_length    float64
# petal_width     float64
# species          object

But the biggest advantage is that you can calulate summary statistics of the dataframe right on place. For this purpose, you need to make sure that all columns of your dataframe are of the correct datatype:

iris.info()
summary = iris.describe() # summary statistics
print("Summary statistics on Gene data:\n", summary.transpose(), sep = "") # for readability

# Summary statistics on Gene data:
#               count      mean       std  min  25%   50%  75%  max
# sepal_length  150.0  5.843333  0.828066  4.3  5.1  5.80  6.4  7.9
# sepal_width   150.0  3.057333  0.435866  2.0  2.8  3.00  3.3  4.4
# petal_length  150.0  3.758000  1.765298  1.0  1.6  4.35  5.1  6.9
# petal_width   150.0  1.199333  0.762238  0.1  0.3  1.30  1.8  2.5
Grouping

This statistics can also be calculated by each Specie - that’s what’s called groupping, or group by. This usually refers to a process involving one or more of the following steps:

how grouping works

(inspired by pandas doc)

grouped_medians = iris.groupby('species')['petal_length'].median()  # group by species and calculate medians
print("Grouped Summary Statistics by Species:\n", grouped_medians)