Pandas in the Pandaemic. (COVID-19 in — Scotland statistics) Part 2
This is the 2nd part of the exploration of pandas package. Part 1 can be found here: https://www.vallka.com/blog/leaning-pandas-data-manipulation-and-visualization-covid-19-in-scotland-statistics/
All data are taken from the official website:
weekly-deaths-by-location-age-sex.xlsx
Taken from official the site:
Let’s quickly repeat data load process from the Part 1:
import numpy as np import pandas as pd data_2021 = pd.read_excel ('https://www.nrscotland.gov.uk/files//statistics/covid19/weekly-deaths-by-location-age-sex.xlsx', sheet_name='Data', skiprows=4, skipfooter=2, usecols='A:F', header=None, names=['week','location','sex','age','cause','deaths'] ) data_2021['year'] = data_2021.week.str.slice(0,2).astype(int) data_2021['week'] = data_2021.week.str.slice(3,5).astype(int) data_1519 = pd.read_excel ('https://www.nrscotland.gov.uk/files//statistics/covid19/weekly-deaths-by-location-age-group-sex-15-19.xlsx', sheet_name='Data', skiprows=4, skipfooter=2, usecols='A:F', header=None, names=['year','week','location','sex','age','deaths']) data_1519['cause'] = 'Pre-COVID-19' data=data_1519.copy() data=data_1519.append(data_2021,ignore_index=True) data
One note here. In part 1 I was using a simple assignment od DateFrame for making a copy. This is wrong. Simple assignment does not create a copy of a DataFrame, it just creates another reference to it, so all changes to 2nd DataFrame affects the original one. To make things even more interesting, pandas provides two versions of copy() function — copy(deep=False) and copy(deep=True) deep=True is default. Here is some discussion about all three.
It is not clear for me now what is the difference between simple assignment and shallow copy() (deep=False), nor my results confirm this discussion. But deep copy() (with default parameter deep=True) seems to be working as expected all the time.
Let’s see the original data_1519 to ensure we didn’t modify it accidentally:
Let’s quickly create totals
totals = data.groupby(['year','week']).agg({'deaths': np.sum}) totals.loc[15,'deaths_15']=totals['deaths'] totals.loc[16,'deaths_16']=totals['deaths'] totals.loc[17,'deaths_17']=totals['deaths'] totals.loc[18,'deaths_18']=totals['deaths'] totals.loc[19,'deaths_19']=totals['deaths'] totals.loc[20,'deaths_20']=totals['deaths'] totals.loc[21,'deaths_21']=totals['deaths'] totals
Let’s get rid of multi-index and transform it into additional columns — I’m still thinking this is the quickest way of plotting data in a single plot:
totals.reset_index(inplace=True) totalstotals.plot(x='week',y=['deaths_15','deaths_16','deaths_17','deaths_18','deaths_19','deaths_20','deaths_21'], title='Total deaths by week',figsize=(12,8))
Again, let’s look at current year only. Ok, a couple of weeks passed since I wrote Part 1. How many weeks in this year now? And more generic question, how many weeks in each year?
wpy = totals.groupby('year').agg({'week':np.max}) wpy['week']
totals[totals['week']<=wpy['week'][21]].plot(x='week', y=['deaths_15','deaths_16','deaths_17','deaths_18','deaths_19','deaths_20','deaths_21'], title='Total deaths by week',figsize=(12,8))
Let’s move on and add cumulative values (or ‘running totals’). Thanks to pandas, it is very easy — comparing to SQL. (There should be other ways to achieve the same result. I used the one which seemed the most simple for me on this stage of learning)
totals['cumdeaths_15']=totals.groupby('year').agg({'deaths_15':np.cumsum}) totals['cumdeaths_16']=totals.groupby('year').agg({'deaths_16':np.cumsum}) totals['cumdeaths_17']=totals.groupby('year').agg({'deaths_17':np.cumsum}) totals['cumdeaths_18']=totals.groupby('year').agg({'deaths_18':np.cumsum}) totals['cumdeaths_19']=totals.groupby('year').agg({'deaths_19':np.cumsum}) totals['cumdeaths_20']=totals.groupby('year').agg({'deaths_20':np.cumsum}) totals['cumdeaths_21']=totals.groupby('year').agg({'deaths_21':np.cumsum}) totalstotals.plot(x='week', y=['cumdeaths_15','cumdeaths_16','cumdeaths_17','cumdeaths_18','cumdeaths_19','cumdeaths_20','cumdeaths_21'], title='Total deaths by week',figsize=(12,8))
For a year-wide values year 2020 is definitely the worst. Also on this graph we can clearly see the difference of lengths of different years (in weeks), so it is not quite correct to compare year 2020 with 53 weeks with year 2019 with only 52 weeks. Which year is worse on this graph — 2015 or 2018?
Again, let’s take the beginning of year:
totals[totals['week']<=wpy['week'][21]].plot(x='week', y=['cumdeaths_15','cumdeaths_16','cumdeaths_17','cumdeaths_18','cumdeaths_19','cumdeaths_20','cumdeaths_21'], title='Total deaths by week',figsize=(12,8))
Ok, what’s next? I think it would be interesting to look at year-length data back from the current date. That is, from March 2020 to March 2021 — and compare these data to the previous years. How to do this? Not obvious…
Please read the rest of the article on my blog directly. I couldn't make Medium import the whole story…
Originally published at https://www.vallka.com/blog/pandas-in-the-pandaemic-covid-19-in-scotland-statistics-part-2/