If you click the Interactive button above, you can run the code on this page and complete each of the exercises. If you click the Notebook button, you will be redirected to an interactive Jupyter notebook. Both of these options are powered by Binder.

The answers to each of the exercises can be revealed by selecting the plus button on the right side of the page below each exercise.

Python 101

This section introduces some of the most relevant aspects of working with Python for social scientists. This includes the different data types available and ways to modify them.

In 2009, as part of his first State of the Union address, President Barack Obama said:

Let us invest in our people without leaving them a mountain of debt

To store that in Python, create a new variable called sentence

sentence =  'Let us invest in our people without leaving them a mountain of debt.'

The text is surrounded by a single quote (') on each side. To make sure that you typed the tweet correctly, you can type sentence:

sentence
'Let us invest in our people without leaving them a mountain of debt.'

You can get almost the same response using the print function:

print(sentence)
Let us invest in our people without leaving them a mountain of debt.

The only difference is that the first response was wrapped in single quotes and the second wasn’t. As a side note, the single quotes weren’t because you put them there. If you used double quotes, Python would still show a single-quote.

sentence =  "Let us invest in our people without leaving them a mountain of debt."

sentence
'Let us invest in our people without leaving them a mountain of debt.'

In addition to ' and ", strings can also be marked with a '''. This last one is particularly useful when your text contains contractions or quotation marks.

new_sentence = '''Let's invest in our people without leaving them a mountain of debt.'''
print(new_sentence)
Let's invest in our people without leaving them a mountain of debt.

Your turn

Create a new string called food that is a sentence about your most recent meal. Display the contents of your new string.

 
food = 'My standard lunch is a veggie burrito.'
print(food)
My standard lunch is a veggie burrito.

Strings

Python has a few tools for manipulating text, such as lower for making the string lower-case.

sentence.lower()
'let us invest in our people without leaving them a mountain of debt.'

This did not alter the original string, however.

sentence
'Let us invest in our people without leaving them a mountain of debt.'

In Python, strings are immmutable, meaning once created, they can not be altered in place. We could store the results in a new variable.

lower_sentence = sentence.lower()

lower_sentence
'let us invest in our people without leaving them a mountain of debt.'

Your turn

Create a new, lower cased version of your food string.

 

lower_food = food.lower()

print(lower_food)

We can also replace words within the string.

sentence.replace("nation", "country")
'Let us invest in our people without leaving them a mountain of debt.'

replace can also be used to remove text by not including anything with the replacement quotation marks.

sentence.replace(".", "")
'Let us invest in our people without leaving them a mountain of debt'

As before, this does not alter the original string. If you wanted to save the string edits, you would need to create a new variable.

edited_sentence = sentence.lower()
print(edited_sentence)
let us invest in our people without leaving them a mountain of debt.

If you were doing a series of manipulations, you could reuse a varaiable name, although it is best practices to keep a version of the original string in case you ever need to go back to it.

edited_sentence = sentence.lower()
print(edited_sentence)

edited_sentence = edited_sentence.replace(".", "")
print(edited_sentence)
let us invest in our people without leaving them a mountain of debt.
let us invest in our people without leaving them a mountain of debt

You can also stack multiple transformations together, although combining too many may make your code harder to follow.

edited_sentence.replace(".", "").lower()
'let us invest in our people without leaving them a mountain of debt'

Your turn

Create a new string called boring that removes the exclamation marks and capitalization from the sentence “Way to go!!!”.

 
boring = "Way to go!!!".lower().replace('!', '')

print(boring)

Slicing

If you had a very long text, such as the entire text of the State of the Union, you might only want to look at the first few characters. In Python, this is called by slicing.

sentence
'Let us invest in our people without leaving them a mountain of debt.'
sentence[0:20]
'Let us invest in our'

A slice is signaled with brackets ([]). The first number is the starting position, where 0 indicates the beginning. This is followed by a colon (:) and then the end position, which, in this case, is a 20. Note that this is splitting on characters, not words.

Here is a section from the middle of the string:

sentence[20:32]
' people with'

For convience, if you ommit the number before the colon, it defaults to the string beginning.

sentence[:40]
'Let us invest in our people without leav'

Ommitting the second number defaults to the end.

sentence[40:]
'ing them a mountain of debt.'

Finally, negative numbers are interpreted as distance from the end of the string.

sentence[-20:]
' a mountain of debt.'

Your turn

Create a new string called s that contains The weather is hot and humid today. Find the slices for each of the following :

  • “The w”
  • “today.”
  • “hot and humid”
s = 'The weather is hot and humid today.'

print(s[:5])
print(s[-6:])
print(s[15:28])
The w
today.
hot and humid

Numbers

We can also count the number of characters in a string with the len function.

len(sentence)
68

In this case, Python returned an interger instead of string. This also can be stored in a variable.

sentence_length = len(sentence)
sentence_length
68

Your turn

What is the length of How many dogs do you own?? Store it in a variable called sl.

 
question = 'How many dogs do you own?'
sl = len(question)
print(sl)

Since the length of a string is a number, we can do standard math operations with it.

print(sentence_length * 3)
204

print(sentence_length / 2)
34.0

print(sentence_length + sentence_length)
136

Your turn

What is one-third the length of sl?

sl/3

As with strings, these can be saved in new variables.

double_length = sentence_length + sentence_length

print(double_length)
136

These same operators also work with strings.

print(sentence * 2)
Let us invest in our people without leaving them a mountain of debt.Let us invest in our people without leaving them a mountain of debt.

print(sentence + sentence)
Let us invest in our people without leaving them a mountain of debt.Let us invest in our people without leaving them a mountain of debt.

The operators can’t be used to combine different data types, however.

print("The sentence was " + sentence_length + "characters.")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-63-9041347a8e39> in <module>
----> 1 print("The sentence was " + sentence_length + "characters.")

TypeError: can only concatenate str (not "int") to str

Conviently, the str function will convert an interger to a string.

print("The sentence was " + str(sentence_length) + " characters.")
The sentence was 68 characters.

I manually had to include the spaces before and after sentence_length. Otherwise, it all is smushed together.

print("The sentence was" + str(sentence_length) + "characters.")
The sentence was68characters.

Your turn

Print The length of the word "hippopotamus" is [x]. where [x] is the length of the word hippopotamus .

 
l = len('hippopotamus')

print('The length of the word "hippopotamus" is ' + str(l) + '.')

Lists

You can also split the sentence into a series of strings. By default, this splits based on spaces and other whitespace characters such as a line break (\n) or tab character (\t).

print(sentence.split())
['Let', 'us', 'invest', 'in', 'our', 'people', 'without', 'leaving', 'them', 'a', 'mountain', 'of', 'debt.']

What is returned here is a third data type (the first two were strings and intergers) called a list. A list is enclosed in brackets ([]) and the items are seperated by commas. In this case each item is in quotation marks because they are all strings. Items in a list, however, can be of any sort.

my_list = ['Speeches', 7, 'Data']
my_list
['Speeches', 7, 'Data']

While len returned the number of characters in a string, it returns the number the items in a list.

len(my_list)
3
sentence_length = len(sentence.split())
sentence_length
13

In the second example, the list created by sentence.split() is not saved in any way; only its length.

Your turn

Create a list called ate that includes at least three things you ate today. Use len to count the number of items in the list.

 
ate = ['apple', 'dosa', 'pizza slice']

print(len(ate))

Like, strings, lists can also be sliced. The first three items of a list:

words = sentence.split()
print(words[:3])
['Let', 'us', 'invest']

We can also extract specific items from a list by their position. As it did with strings, slicing in Python starts with 0.

words[0]
'Let'

The third word:

words[2]
'invest'

The fifth word from the end:

words[-5]
'them'

The last two words:

words[-2:]
['of', 'debt.']

Your turn

Display the first two items of your ate list. What is the last item?

 
print(ate[:2])
print(ate[-1])
['apple', 'dosa']
pizza slice

Slicing a list returns a list. If you ask for the first three items, you will get a list made up of those items. In contrast, if your request a specific location, such as words[2], Python returns the specific object stored in the place, which may be a string, number, or event an entire list.

Unlike a string, lists are mutable. That means that we can remove or as is more frequently the case text analysis, add things to it. This is done with append.

male_words = ['his', 'him', 'father']
male_words.append('brother')
print(male_words)
['his', 'him', 'father', 'brother']

Since append is changing male_words, we do not want to use an =. The Python interpreter is editing our original list but not returning anything.

not_going_to_work = male_words.append('brother')
print(not_going_to_work)
None

Lists can be also be combined using +.

gendered_words = male_words + ['her', 'she', 'mother']
print(gendered_words)
['his', 'him', 'father', 'brother', 'brother', 'her', 'she', 'mother']

As note above, the items in a list can include a variety of data types. This includes lists.

gendered_lists = [ male_words ,  ['her', 'she', 'mother'] ]

Note the two closing brackets next to each other. The first closes the list that ends with ‘mother’ while the second closes our gendered_lists.

len(gendered_lists)
2

gendered_lists has a length of two because it contains just two items, each a list of varying lengths.

print(gendered_lists)
[['his', 'him', 'father', 'brother', 'brother'], ['her', 'she', 'mother']]

Your turn

Add three more items to your food list. Use append for one. For the other two, places them in a list and then combine the two lists.

# solution

Dictionaries

A fourth useful data type is a dictionary. A dictionary is like a list in that it holds multiples items. The items in a list can be identified by their position in the list. In contrast, the values in a dictionary are associated with a keyword. The analogy here is a to a physical dictionary, which has a list of unique words, and each word has a definition. In this case, the entries are called keys, and the definitions, which can be any data type, are called values.

Alternatively, you can think of a dictionary as a single row of data from a dataset, where the keys are the variable names.

respondent = {'sex'   : "female",
              'abany' : 1,
              'educ'  : 'College'}

Dictionaries are surrounded by curly brackets ({}). Each entry is pair consisting of the key, which must be a string, followed, by a colon and then the value. Like in a list, entries are seperated by commas.

You can access the contents of a dictionary by enclosing the key in brackets ([]).

respondent['sex']
'female'

If the key is not dictionary, you will get a KeyError.

respondent['gender']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-10-4da2e0df882d> in <module>
----> 1 respondent['gender']

KeyError: 'gender'

You can inspect all the keys in a dictionary, in case you forgot or someone else made it.

respondent.keys()
dict_keys(['sex', 'abany', 'educ'])
len(respondent.keys())
3

Dictionaries are mutable, so we can change the value of existing keys, remove keys, or add new ones.

respondent['race'] = 'Black'

print(respondent)
{'sex': 'female', 'abany': 1, 'educ': 'College', 'race': 'Black'}

respondent['abany'] = 'Yes'

print(respondent)
{'sex': 'female', 'abany': 'Yes', 'educ': 'College', 'race': 'Black'}

Your turn

Add a new key to the dictionary called age with a value of 37. Confirm that you did it correctly by displaying the value of age.

 
respondent['age'] = 37

print(respondent['age'])

As noted above, while the keys have to be strings, the values can be any data type. You could add the ages of the respondent’s children as a list.

respondent['children ages'] = [3, 5, 10]

print(respondent)
{'sex': 'female', 'abany': 'Yes', 'educ': 'College', 'race': 'Black', 'age': 37, 'children ages': [3, 5, 10]}

Spaces

Within the Python community, there are strong norms about how code should be written. Many of these are centered around have code be readable, both by others and by your future self. As a trivial example, 2+2 is allowed, by is almost always written 2 + 2. Likewise I defined my respondent dictionary with plenty of white space in order to maximize readability.

respondent = {'sex'   : "female",
              'abany' : 1,
              'educ'  : 'College'}

This is identical to:

respondent={'sex':'female','abany':1,'educ':'College'}

but putting it all on one line obscures the logic of the dictionary. In this case, what is a key and what is a value is quite clear in the first version, while distinguishing between the two is more problematic in the single-line version.

r2 = {'sex':'male',   'abany':1, 'educ':'College'     }
r3 = {'sex':'female', 'abany':0, 'educ':'High School' }
r4 = {'sex':'male',   'abany':0, 'educ':'Some College'}
respondents = [respondent, r2, r3, r4]
respondents
[{'sex': 'female', 'abany': 1, 'educ': 'College'},
 {'sex': 'male', 'abany': 1, 'educ': 'College'},
 {'sex': 'female', 'abany': 0, 'educ': 'High School'},
 {'sex': 'male', 'abany': 0, 'educ': 'Some College'}]

This is now looks a lot like the common data format JSON!

Loops

for person in respondents:
    print(person['educ'])
College
College
High School
Some College

for item in [1,2,'bobcat']:
    print(item)
1
2
bobcat

Your turn

Loop over the items in your food list. For each item, print its length.

 
# Answer

Functions

For those who come from Stata or R background, one of the more striking aspects of Python code is the frequency of user defined functions. They are deployed not just for things where you think there should be a function, like counting words in a sentence, but also for highly-custom situations, such as scraping the contents of a particular web page. This style of programming, with many small functions, tends to make code more readable easier to debug than code written in a more traditional social science style.

A standard function has three parts. First the function is named and defined. Subsequent line or lines actually do the thing. Finally the results are returned.

A trivial function that returns the Hello! might look like:

def make_hello():
    word = 'Hello!'
    return word

Of note, def signals that your are defining a function. This is followed by the name of the function. In this case, make_hello. Since this function doesn’t take any arguments, such as accepting a variable to modify or have any options, it is followed by (). The first line ends with a colon.

All subsequent lines are indented. The second line creates a new string variable called word which contains Hello!. The third and final line of the functions returns the value stored in word.

make_hello()
'Hello!'

More commonly in text analysis, a user-defined function modifies an existing string. In this case, the variable name that will be used within the function is established within the parenthesis on the opening line.

A second trivial function takes a text string and returns an all-caps version.

def scream(text):
    text_upper = text.upper()
    return text_upper
scream('Hi there!')
'HI THERE!'

The text and text_upper variable only exist within the function. That means that you can pass a variable not called text to the function.

scream(sentence)
'LET US INVEST IN OUR PEOPLE WITHOUT LEAVING THEM A MOUNTAIN OF DEBT.'

It also means that everything but the returned text disappears.

text_upper
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-37-9374ae9251ac> in <module>
----> 1 text_upper

NameError: name 'text_upper' is not defined

It is a good idea to include a comment within the function that explains the function. This is helpful for other people reading your code and when you return to your own code months and days later.

def scream(text):
    '''Returns an all-caps version of text string.'''
    text_upper = text.upper()
    return text_upper

Your turn

Make a function called whisper that replaces all exclamation marks with a period and returns a lower case version of a string. Test it out.

def whisper(text):
    ''''''
    
    return quite_text