pandas 101

This section provides a brief introduction to pandas. The pandas library is a key component for doing data science in Python for a couple of reasons. Most importantly, it provides two data types, series and data frame, that allow you to store and manipulate data in a way that is useful for management and analysis. Second, pandas is incredibly useful for importing and exporting data in a wide variety of formats. Finally, it makes descriptive analysis, including both summary statistics and visualizations. This section provides an introduction to the main capabilities of pandas relevant to data analysis.

Most of the things that you will want to do it Python require importing libraries. convention, pandas is imported as pd. Additionally, we enable the ability for pandas graphics to be displayed within the notebook with %matplotlib inline.

%matplotlib inline

import pandas as pd

Reading data

In the summer of 2018, the Washington Post produced a report on murder clearance rates by cities. The also released the data they collected on Github as a csv file. We can create a new dataframe, called df, using the pandas read_csv method.

df = pd.read_csv('homicide.csv')

By default, read_csv assumes a comma separator, but this could be customized by adding the sep= parameter. So a pipe-delimited version of the file could be read with:

df = pd.read_csv('data/homicide.psv', sep='|')

Additionally, read_csv can access file directly from the web.

csv_url = 'https://raw.githubusercontent.com/washingtonpost/data-homicides/master/homicide-data.csv'

df = pd.read_csv(csv_url)

Pandas has a suite of methods for importing a variety of file formats, discussed in the Bonus_File_Reading section.

Learning about your dataframe

After loading a dataframe, best practice is to get a sense of the data with the head, info and describe methods. head shows the first five rows of the dataframe.

df.head()

	uid	reported_date	victim_last	victim_first	victim_race	victim_age	victim_sex	city	state	lat	lon	disposition
0	Alb-000001	20100504	GARCIA	JUAN	Hispanic	78	Male	Albuquerque	NM	35.095788	-106.538555	Closed without arrest
1	Alb-000002	20100216	MONTOYA	CAMERON	Hispanic	17	Male	Albuquerque	NM	35.056810	-106.715321	Closed by arrest
2	Alb-000003	20100601	SATTERFIELD	VIVIANA	White	15	Female	Albuquerque	NM	35.086092	-106.695568	Closed without arrest
3	Alb-000004	20100101	MENDIOLA	CARLOS	Hispanic	32	Male	Albuquerque	NM	35.078493	-106.556094	Closed by arrest
4	Alb-000005	20100102	MULA	VIVIAN	White	72	Female	Albuquerque	NM	35.130357	-106.580986	Closed without arrest

In addition to the data in the csv file, an index has been created to identifiy each row. By default, this is an interger starting with 0.

If the dataset is wide, middle columns will not be displayed. Also, if text fields are long, only the first few characters will be shown. These can both be adjusted using pandas display settings.

info can be used to explore the data types and the number of non-missing cases for each variable.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52179 entries, 0 to 52178
Data columns (total 12 columns):
uid              52179 non-null object
reported_date    52179 non-null int64
victim_last      52179 non-null object
victim_first     52179 non-null object
victim_race      52179 non-null object
victim_age       52179 non-null object
victim_sex       52179 non-null object
city             52179 non-null object
state            52179 non-null object
lat              52119 non-null float64
lon              52119 non-null float64
disposition      52179 non-null object
dtypes: float64(2), int64(1), object(9)
memory usage: 4.8+ MB

describe provides summary statistics for all the numeric variables.

df.describe()

	reported_date	lat	lon
count	5.217900e+04	52119.000000	52119.000000
mean	2.013090e+07	37.026786	-91.471094
std	1.123420e+06	4.348647	13.746378
min	2.007010e+07	25.725214	-122.507779
25%	2.010032e+07	33.765203	-95.997198
50%	2.012122e+07	38.524973	-87.710286
75%	2.015091e+07	40.027627	-81.755909
max	2.015111e+08	45.051190	-71.011519

The column headers can be extracted using keys.

df.keys()

Index(['uid', 'reported_date', 'victim_last', 'victim_first', 'victim_race',
       'victim_age', 'victim_sex', 'city', 'state', 'lat', 'lon',
       'disposition'],
      dtype='object')

If you wanted to look at the bottom of the dataframe, you can use tail. Both head and tail allow you to change the number of rows displayed from the default five.

df.tail(3)

	uid	reported_date	victim_last	victim_first	victim_race	victim_age	victim_sex	city	state	lat	lon	disposition
52176	Was-001382	20161114	WASHINGTON	WILLIE	Black	23	Male	Washington	DC	38.828025	-77.002511	Open/No arrest
52177	Was-001383	20161130	BARNES	MARCUS	Black	24	Male	Washington	DC	38.820476	-77.008640	Open/No arrest
52178	Was-001384	20160901	JACKSON	KEVIN	Black	17	Male	Washington	DC	38.866689	-76.982409	Closed by arrest

Working with variables

We can get some descriptive statistics on the variable using describe. The dataframe can be subset to just this series by placing the variable name in brackets.

df['victim_race'].describe()

count     52179
unique        6
top       Black
freq      33361
Name: victim_race, dtype: object

In the 52,179 cases, there are six different values for victim_race. The most common of these is Black which occurs with a frquency of 33,361.

The value_counts method reports frequncies for all values in descending order.

df['victim_race'].value_counts()

Black       33361
Hispanic     6901
White        6333
Unknown      4199
Other         700
Asian         685
Name: victim_race, dtype: int64

For numeric variables, describe reports information about the distribution of the variable.

df['lat'].describe()

count    52119.000000
mean        37.026786
std          4.348647
min         25.725214
25%         33.765203
50%         38.524973
75%         40.027627
max         45.051190
Name: lat, dtype: float64

Applied to the entire dataframe, describe reports the distribution of all of the numeric variable.

df.describe()

	reported_date	lat	lon
count	5.217900e+04	52119.000000	52119.000000
mean	2.013090e+07	37.026786	-91.471094
std	1.123420e+06	4.348647	13.746378
min	2.007010e+07	25.725214	-122.507779
25%	2.010032e+07	33.765203	-95.997198
50%	2.012122e+07	38.524973	-87.710286
75%	2.015091e+07	40.027627	-81.755909
max	2.015111e+08	45.051190	-71.011519

Missing from the list is victim_age. read_csv is really good about recognizing data types, so there must be some cases with non-numeric values.

df['victim_age'].describe()

count       52179
unique        102
top       Unknown
freq         2999
Name: victim_age, dtype: object

There are 52,179 cases, including 102 unique values. The most frequent of these in “Unknown”.

We can extract all of the values using values. Since the result is quite long, pandas will only display the head and tail.

df['victim_age'].values

array(['78', '17', '15', ..., '23', '24', '17'], dtype=object)

This is not super useful by itself, but it can be stored as a list-like object for subsquent analysis

age_values = df['victim_age'].values

print(len(age_values))
print(age_values[:3])

52179
['78' '17' '15']

Since the ages are printed with single quotation marks around them, we know that Python considers them strings, and not numbers. We can confirm this.

first_age = age_values[0]
type(first_age)

str

Conversions

We can create a new variable, victim_age_numeric using the pandas to_numeric method. A coerce value for the errors parameter results in a missing value (“NaN”) for instances where a non-numeric string is encountered. In this case, it will be for the “Unknown” cases. Excluding the errors option would result in a error for variables like victim_age that contain a mix of strings and numbers.

df['victim_age_numeric'] = pd.to_numeric(df['victim_age'], errors='coerce')

describe on the new variable shows we are down to 49,180 cases, with a median victim age of 28.

df['victim_age_numeric'].describe()

count    49180.000000
mean        31.801220
std         14.418692
min          0.000000
25%         22.000000
50%         28.000000
75%         40.000000
max        102.000000
Name: victim_age_numeric, dtype: float64

We can also display the contents of the new variable directly, or at least the top and bottom 30 values.

df['victim_age_numeric']

      78.0
      17.0
      15.0
      32.0
      72.0
      91.0
      52.0
      52.0
      56.0
      43.0
     20.0
      NaN
     46.0
     16.0
     54.0
     43.0
     52.0
     52.0
     22.0
     49.0
     15.0
     25.0
     20.0
     47.0
     88.0
     27.0
     36.0
     47.0
      1.0
     58.0
         ... 
  46.0
  24.0
  35.0
  31.0
  18.0
  40.0
  37.0
  26.0
   3.0
  30.0
  25.0
  19.0
  26.0
  25.0
  27.0
  22.0
  44.0
  25.0
  38.0
  22.0
  22.0
  25.0
  35.0
  37.0
  20.0
  29.0
  19.0
  23.0
  24.0
  17.0
Name: victim_age_numeric, Length: 52179, dtype: float64

Plots

pandas also has plotting capabilies, such as histograms (hist) and a correlation matrix (scatter_matrix). These are essential tools for understanding the data.

df['victim_age_numeric'].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x11ad6f630>

png

By default, plots in a notebook will include something like <matplotlib.axes._subplots.AxesSubplot at 0x11f4a7a90> above the figure. This is annoying.

Plot of individual variables, or series in pandas terminology, are attributes of the data type. That is, you start wit the thing you want plotted, in this case df['victim_age_numeric'], and append what you want to do, such as .hist().

A second type of plots, such as scatter plots, are methods of the dataframe.

df.plot.scatter(x='lon', y='lat')

<matplotlib.axes._subplots.AxesSubplot at 0x11b5cd470>

png

You could look at the other dataframe plotting methods on the helpful pandas visualizations page. Alternatively, typing tab after df.plot. also reveals your options.

<p>

Want to know about hexbin? Again, the help page on the web is useful, but appending a question mark to the end of the command will bring up the documentation.


<img src="images/docstring.png" width = "80%" align="left"/>

A third group of plots are part of the pandas plotting library. In these cases, the thing you want plotted is the first, or only, parameter passed, as is the case with the correlation matrix. 



{:.input_area}
```python
pd.plotting.scatter_matrix(df)

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x11e5f7b70>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11e6216d8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11eecfd68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11b6d3438>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x11b6f9ac8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11b6f9b00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11b737828>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11b75feb8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x11b813588>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11f456c18>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11f48a2e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11f4b0978>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x11f4e1048>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11f5086d8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11f532d68>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11f562438>]],
      dtype=object)

png

You can also create subplots using the by option. Note that by accepts a series, or dataframe column, rather than a column name.

df['victim_age_numeric'].hist(by   = df['victim_sex'], 
                              bins = 20)

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x11f86a668>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11f3862b0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x11f3a9860>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11e8c3ef0>]],
      dtype=object)

png

By default, by produces separate x and y scales for each subgraph. This is why it appears to be a relatively large number of deaths of very young females. The numbers between men and women at this age are comparable, but the very large number of male deaths in their 20s results in very different xscales for the graphs. This option can be changed with the sharex or sharey option.

df['victim_age_numeric'].hist(by     = df['victim_sex'], 
                              bins   = 20,
                              sharex = True,
                              sharey = True)

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x11e8760b8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12057a940>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x11e735d30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11e767160>]],
      dtype=object)

png

Finally, the results of value counts of a variable can also be plotted.

df['victim_race'].value_counts().plot(kind = 'barh')

<matplotlib.axes._subplots.AxesSubplot at 0x11f1748d0>

png

Other descriptives

Pandas also has a method for producing crosstabs.

pd.crosstab(df['victim_race'], df['disposition'])

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-1-8fd97d59d68d> in <module>
----> 1 pd.crosstab(df['victim_race'], df['disposition'])

NameError: name 'pd' is not defined

Note that since this is a pandas method, and not one of a specific dataframe, you need to be explicit about which datatframe each variable is coming from. That is why the first parameter is not 'victim_race' but df['victim_race'].

normalize can be used to display percentages instead of frequencies. A value of index normalized by row, columns by column, and all by all values.

pd.crosstab(df['victim_race'], df['disposition'], normalize='index')

disposition	Closed by arrest	Closed without arrest	Open/No arrest
victim_race
Asian	0.598540	0.105109	0.296350
Black	0.463475	0.044843	0.491682
Hispanic	0.482829	0.058108	0.459064
Other	0.522857	0.042857	0.434286
Unknown	0.510836	0.076209	0.412955
White	0.625138	0.095216	0.279646

Since this returns a dataframe, it can be saved or plotted.

cross_tab = pd.crosstab(df['victim_race'], df['disposition'], normalize='index')

cross_tab

disposition	Closed by arrest	Closed without arrest	Open/No arrest
victim_race
Asian	0.598540	0.105109	0.296350
Black	0.463475	0.044843	0.491682
Hispanic	0.482829	0.058108	0.459064
Other	0.522857	0.042857	0.434286
Unknown	0.510836	0.076209	0.412955
White	0.625138	0.095216	0.279646

cross_tab.to_csv('data/crosstab.csv')

In order to highlight a meaningful characteristics of the data, you can sort before plotting.

cross_tab.sort_values(by='Closed by arrest')

cross_tab.sort_values(by='Closed by arrest').plot(kind   = 'barh',
                                                  title  = 'Disposition by race')

Subsets

Similar to a list, a dataframe or series can be sliced to subset the data being shown. For example, df[:2] will return the first two rows of the dataframe. (This is identical to df.head(2).)

df[:2]

df.head(2)

This also works for specific columns.

df['reported_date'][:3]

Dates

A new variable can be created from reported_date that pandas understands is a date variable using the to_datetime method. The format is %Y%m%d because the original date is in the “YYYMMDD” format, and coerce places missing values where the data can be translated, rather than stopping the variable creation completely.

df['date'] = pd.to_datetime(df['reported_date'], 
                            format='%Y%m%d', 
                            errors='coerce')

df['date'][:3]

From the new series, we can extract specific elements, such as the year.

df['year'] = df['date'].dt.year

As before, value_counts and plots can give some sense of the distribution of the values.

df['year'].value_counts()

Value counts returns a pandas series with an index equal to the original values, in the case the year, and the series values based on the frequency. Since years have an order, it makes sense to sort by the index before plotting them.

df['year'].value_counts().sort_index(ascending = False).plot(kind='barh')

crosstab can also group based on more than one variable for the x or y axis. In that case, you pass a list rather than a single variable or series. To make this clearer, you can create the lists before creating the crosstab.

y_vars = [df['state'], df['city']]
x_vars = df['year']

pd.crosstab(y_vars, x_vars)

Crosstab returns a dataframe with the column and index names from the values in the original dataset. Since a list was passed, the datatframe has a MultiIndex. The can be useful for cases where you have nested data, like cities with states or annual data on multiple countries.

pd.crosstab(y_vars, x_vars).index.names

By default, the index is a series that starts with 0. If your data has unique identifiers, however, it is helpful to use that as the index, especially if you intend on merging your data with other data sources. In this dataframe, each row has a unique value for uid.

df.set_index('uid', inplace=True)

df[:5]

Subseting

You can view a subset of a dataframe based on the value of a column.

Let’s say that you wanted to look at the cases where the victim’s first name was “Juan”. You could create a new series which is either True or False for each case.

df['victim_first'] == 'JUAN'

You could store this new true/false series. If you placed this in brackets after the name of the dataframe, pandas would display only the rows with a True value.

is_juan = df['victim_first'] == 'JUAN'
df[is_juan]

More commonly, the two statements are combined.

df[df['victim_first'] == 'JUAN']

With this method of subsetting, pandas doesn’t return a new dataframe, but rather is just hiding some of the rows. So if you want to create a new dataframe based on this subset, you need to append copy() to the end.

new_df = df[df['victim_first'] == 'JUAN'].copy()

new_df.head()

As this selection method returns a dataframe, it can be stored. The following creates two dataframes, one with just the 2016 and one with just the 2017 cases.

df_2017 = df[df['year'] == 2017].copy()
df_2016 = df[df['year'] == 2016].copy()


df_2017['year'].value_counts()

df_2016['year'].value_counts()

value_counts confirms that the correct cases were grabbed.

Alternatively you may want to limit your dataset by column. In this case, you create a list of the columns you want. This list is also placed in brackets after the dataframe name.

More subsets

columns_to_keep = ['victim_last', 'victim_first', 'victim_race', 'victim_age', 'victim_sex']

df[columns_to_keep]

As before, you can you use copy to create a new dataset.

victim_df = df[columns_to_keep].copy()

victim_df.head()

As with the row selection, you don’t need to store the column names in a list first. By convention, these two steps are combined. However, combining the steps does create an awkward pair of double brackets.

place_df = df[['city', 'state', 'lat', 'lon']].copy()

place_df.head()

Merging

There are several different ways to combine datasets. The most straightforward is to merge two different datasets who share a key in common. To merge place_df with victim_df, for example, you can use the datframe merge method.

merged_df = place_df.merge(victim_df, left_index=True, right_index=True)

merged_df.head()

Stacking dataframes

df_2016 = df[df['year'] == 2016]
len(df_2016)

recent_df = pd.concat([df_2017, df_2016])

len(recent_df)

pd.concat([victim_df, place_df], axis = 1)

df['birth_year'] = df['year'] - df['victim_age_numeric']

df['birth_year'].describe()

df['minor'] = df['victim_age'] <= 18

df['minor'][:10]

df['minor'].mean()

'Back to some pandas string manipulation fun.'.title()

def title_case(text):
    return text.title()

title_case('JUAN')

The apply magic

df['victim_first2'] = df['victim_first'].apply(title_case)

df['victim_first2'].head(10)

df[['victim_first', 'victim_first2']].head(10)

Exercise

Write a function that extracts the last name from the name field on your Titanic dataframe. Create a new variable called Family Name to store the results. What is the most common family name?

Working on more than one column

def victim_name(row):
    first_name = row['victim_first']
    last_name  = row['victim_last']
    name       = last_name + ', ' + first_name
    name       = title_case(name)
    return name

df.apply(victim_name, axis=1)

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

<ipython-input-3-bee86a03d99d> in <module>
----> 1 df.apply(victim_name, axis=1)

NameError: name 'df' is not defined

df['victim_name'] = df.apply(victim_name, axis=1)

df.head()

Bonus: Encodings

Text files may also include special characters, such as 😎 which can be written as \U0001f60e

print('\U0001f60e')

In this case, you might need to set the encoding parameter. With most CSV, files, unfortunately, this is usually a process of trial and error:

csv_url = 'https://raw.githubusercontent.com/washingtonpost/data-homicides/master/homicide-data.csv'

df = pd.read_csv(csv_url)

df = pd.read_csv(csv_url, encoding='iso-8859-1')

In this case, the csv file was not encoding in utf-8, but iso-8859-1, also know as latin1. According to someone one the intenet, the most frequently used encodings on the web are:

UTF-8 (92% of web pages) - Default for Python 3.x
ISO-8859-1 (4%) (encoding = iso-8859-1 or encoding = latin1)
Windows-1251 (1%) - Used for Cyrillic script (encoding = windows-1251)
Windows-1252 (<1%) - Old fork of ISO-8859-1 (encoding = windows-1252)
Shift JIS (<1%) - Japanese characters (encoding = shift-jis)

You might need to Google around for other langauage specific encodings.