Intro to Data Analysis

Once we got a brief introduction to both numpy and pandas, it is time to start our first Data Analysis project.

There are so many technologies out there so it might become overwhelming to expect yourself to remember all new methods and functions. The truth is: no one knows them all. Even data scientists with 900 light years of experience constantly google stuff or read docs. Our goal here is to show you, how using numpy and pandas can build a flowing narrative in data science project, so you can return to these notes any time you’ll need to refresh something.


The best definition for data analysis would be trying to answer a question using data. And it really is just this, in a nutshell. But, as it always have been, devil in the details. It’s all the specificities of the context of the field you’re working in and generally accepted approaches to solve that kind of questions in that particular field that makes data analysis so difficult to generalise - even for a definition.

Just as an illustration: all examples below are using data analysis to answer their questions, from daily tasks to space research:

Lets meet our data

Today, we’re going to work with the dataframe about Hawks. The data was collected during bird capture sessions by students and faculty at Cornell College in Mount Vernon, Iowa, USA. The dataset include birds of the following species: Red-tailed, Sharp-shinned, and Cooper’s hawks:

all three hawks species

(images from wikipedia and allaboutbirds.org)

import pandas as pd
import numpy as np
hawks = pd.read_csv('https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/refs/heads/master/csv/Stat2Data/Hawks.csv', index_col=0)
hawks.head()

First, for a proper data analysis, we need to have some kind of goal, some question to be answered. Ususally it is dictated by the project you’re doing, like your job or subject, as analysis can be descriptive, diagnostic, or predictive. We’re going to investigate this question:

How different hawks species are differs based on their physical characteristics?

To effectively analyze this question, we need to break it into smaller, manageable steps. This means:

  • In this research question, the dependent variable (a.k.a. response, predicted, explained, and all other synonyms) is the hawk species because we are examining how different species vary based on their physical characteristics.

  • Since Species is a categorical variable (one of three species), the analysis would likely involve comparisons between groups or even classification techniques.

  • All other variables will be explanatory / independent variables, that we will analyse answering our question.

It is always important to understand what data you’re going to analyse is about. Variable descriptions usually helps with this (src):

ColumnDescription
Month8 September to 12 December
DayDate in the month
YearYear: 1992-2003
CaptureTimeTime of capture (HH:MM)
ReleaseTimeTime of release (HH:MM)
BandNumberID band code
SpeciesCH=Cooper’s, RT=Red-tailed, SS=Sharp-shinned
AgeA=Adult or I=Immature
SexF=Female or M=Male
WingLength (in mm) of primary wing feather from tip to wrist it attaches to
WeightBody weight (in gram)
CulmenLength (in mm) of the upper bill from the tip to where it bumps into the fleshy part of the bird
HalluxLength (in mm) of the killing talon
TailMeasurement (in mm) related to the length of the tail (invented at the MacBride Raptor Center)
StandardTailStandard measurement of tail length (in mm)
TarsusLength of the basic foot bone (in mm)
WingPitFatAmount of fat in the wing pit
KeelFatAmount of fat on the breastbone (measured by feel)
CropAmount of material in the crop, coded from 1=full to 0=empty
Overview of the data

The first step when working with any dataset is to get a high-level overview of its structure and content:

hawks.info()
# Index: 908 entries, 1 to 908
# Data columns (total 19 columns):
#  #   Column        Non-Null Count  Dtype  
# ---  ------        --------------  -----  
#  0   Month         908 non-null    int64  
#  1   Day           908 non-null    int64  
#  2   Year          908 non-null    int64  
#  3   CaptureTime   908 non-null    object 
#  4   ReleaseTime   907 non-null    object 
#  5   BandNumber    908 non-null    object 
#  6   Species       908 non-null    object 
#  7   Age           908 non-null    object 
#  8   Sex           332 non-null    object 
#  9   Wing          907 non-null    float64
#  10  Weight        898 non-null    float64
#  11  Culmen        901 non-null    float64
#  12  Hallux        902 non-null    float64
#  13  Tail          908 non-null    int64  
#  14  StandardTail  571 non-null    float64
#  15  Tarsus        75 non-null     float64
#  16  WingPitFat    77 non-null     float64
#  17  KeelFat       567 non-null    float64
#  18  Crop          565 non-null    float64
# dtypes: float64(9), int64(4), object(6)
# memory usage: 141.9+ KB

It’s crucial to explore summary statistics (like mean, median, and range) grouped by species to identify key differences in physical traits. This helps reveal patterns, such as whether one species tends to be larger or heavier than another.

We start with counting how many observations do we have by each species. Note: one eagle can be observed several times.

# count how many times on average a single bird was observed
print((hawks['BandNumber'].value_counts().sort_values(ascending=False)), "\n")

# leave only distinct rows by bandnumber
hawks_unique = hawks.drop_duplicates(subset=['BandNumber'], keep='last')  # keep last observation
# count by species 
print(hawks_unique['Species'].value_counts().sort_values())
# Species
# CH     69
# SS    261
# RT    577

Let’s look at some summary statistics by species - first for numerics, then for factors. For the simplicity, let’s convert some of the columns in different units.

hawks['Weight_kg'] = hawks['Weight'] / 1000 # convert weight to kg
hawks['Culmen_cm'] = hawks['Culmen'] / 10   # convert culmen to cm
hawks['Hallux_cm'] = hawks['Hallux'] / 10   # convert hallux to cm
hawks['Wing_cm'] = hawks['Wing'] / 10       # convert wing to cm

print(hawks[['Wing_cm', 'Weight_kg', 'Culmen_cm', 'Hallux_cm', 'Species']].groupby('Species').mean().round(2))
#          Wing_cm  Weight_kg  Culmen_cm  Hallux_cm
# Species                                          
# CH         24.41       0.42       1.76       2.28
# RT         38.33       1.09       2.70       3.20
# SS         18.49       0.15       1.15       1.50

Also, instead of manually deriving all metrics we’re interested in, we can use .describe method:

print(hawks.describe())

Summary statistics are useful, but they don’t always tell the full story:

But what about categorical variables?

#  print object dtype columns
print(hawks.select_dtypes(include='object').sample(10))

What conclusions can we make from here?

Data Cleaning

There are two things we can notice from this output:

  1. our data already converted into correct datatypes, which makes data analysis process a bit easier.

    This is great news for us now, because we don’t need to spend our time inspecting column by column to check which datatype should we convert it into. However, it is common situations where we need to do this - check out pandas’ notes and iris dataframe, we have this issue there.

  2. there are 908 observations in total, but some variables have missing values (NaN, doesn’t contain any values).

    The second point might be the problem. Dataframes operations in general exclude missing data, but some of them might bring undesired returns:

try:
    hawks['Weight_int'] = hawks['Weight'].astype('int')
except ValueError as e:
    print("IntCastingNaNError:", e)
# IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

So, we need to deal somehow with missing values. There are several strategies:

  1. leave it as it is :)

  2. dropping missing data - just deleting columns / rows entirely.

    • this works fine when a lot of values are missing for a certain feature/observation, otherwise you just going to lose too much information.
  3. filling with fixed values - means, medians, whatever you think makes sense.

    • okay, if variation of the data is not big enough, and there are only couple missing values.
  4. interpolation - filling values by guessing (or predicting) from a range of observed data

  5. predicing missing values (e.g. with linear regression)

columns_to_fill = ['Weight', 'Wing', 'Tail']                    # important with not so much NaN
columns_to_drop = ['Tarsus', 'WingPitFat']                      # not so important with many NaN
columns_to_interpolate = ['StandardTail', 'KeelFat', 'Crop']    # something between
# 1. dropping values
hawks = hawks.drop(columns=columns_to_drop)    # droppin a variable out of analysis
# or  
bad_rows = hawks[hawks.isna().sum(axis=1) > 3].index    # identifying observations with more than 3 missing values
hawks = hawks.drop(index=bad_rows)                      # dropping that observations

# 3. interpolation 
for column in columns_to_interpolate:
    hawks[column] = hawks[column].interpolate(method='linear')  # interpolate missing values
# 2. filling with fixed value
for column in columns_to_fill:  
    columns_mean = hawks[column].mean()                 # calculate mean of the columns
    hawks[column] = hawks[column].fillna(columns_mean)  # fill missing values with mean

# more correct way to fill with means would be 
# to calculate the mean of the species - specific means
hawks = pd.read_csv('https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/refs/heads/master/csv/Stat2Data/Hawks.csv', index_col=0)

for column in columns_to_fill:
    species_means = hawks.groupby('Species')[column].transform('mean')  # calculate species-specific means (908)
    hawks[column] = hawks[column].fillna(species_means)  # fill missing values with species-specific means
Correlations

Correlation measures both the strength and direction of the linear relationship between two numeric variables. In its simplest form, it refers to the connection between two variables. We measure correlations between -1 and 1. Let’s look at some examples:

Let’s calculate correlations between numeric variables in the dataframe:

# Let's calculate correlations between numeric variables in the dataframe:
hawks.select_dtypes(include='float').corr().round(2)

There is a well known principle in statistics that correlation does not imply causation. This means that even if two things change in a similar way, it doesn’t necessarily mean that one is causing the other. Just because two variables are related doesn’t mean one is influencing the other directly—they might be connected by something else or just coincidentally moving together. I attach my favourite example below, but you can see more funny stuff on https://tylervigen.com/spurious-correlations :

cage's correlation
What’s next?

While all these statistics are mega interesting and informative, it’s hard to build full understanding of what’s going on here just by looking at numbers. Like, look at this correlation matrixes: we need to spend several minutes trying to elicit insights that we can use to answer our research questions, it requires too much cognitive power, concentration, and might just be boring.

But don’t worry: next week’s lecture will be about data visualisations, and we’re gonna build cool plots that tells this story for you. Stay tuned!