If you click the Interactive button above, you can run the code on this page and complete each of the exercises. If you click the Notebook button, you will be redirected to an interactive Jupyter notebook. Both of these options are powered by Binder.
The answers to each of the exercises can be revealed by selecting the plus button on the right side of the page below each exercise.
Python 101
This section introduces some of the most relevant aspects of working with Python for social scientists. This includes the different data types available and ways to modify them.
In 2009, as part of his first State of the Union address, President Barack Obama said:
Let us invest in our people without leaving them a mountain of debt
To store that in Python, create a new variable called sentence
sentence = 'Let us invest in our people without leaving them a mountain of debt.'
The text is surrounded by a single quote ('
) on each side.
To make sure that you typed the tweet correctly, you can type sentence
:
sentence
'Let us invest in our people without leaving them a mountain of debt.'
You can get almost the same response using the print
function:
print(sentence)
Let us invest in our people without leaving them a mountain of debt.
The only difference is that the first response was wrapped in single quotes and the second wasn’t. As a side note, the single quotes weren’t because you put them there. If you used double quotes, Python would still show a single-quote.
sentence = "Let us invest in our people without leaving them a mountain of debt."
sentence
'Let us invest in our people without leaving them a mountain of debt.'
In addition to '
and "
, strings can also be marked with a '''
. This last one is particularly useful when your text contains contractions or quotation marks.
new_sentence = '''Let's invest in our people without leaving them a mountain of debt.'''
print(new_sentence)
Let's invest in our people without leaving them a mountain of debt.
Your turn
Create a new string called food
that is a sentence about your most recent meal. Display the contents of your new string.
My standard lunch is a veggie burrito.
Strings
Python has a few tools for manipulating text, such as lower
for making the string lower-case.
sentence.lower()
'let us invest in our people without leaving them a mountain of debt.'
This did not alter the original string, however.
sentence
'Let us invest in our people without leaving them a mountain of debt.'
In Python, strings are immmutable, meaning once created, they can not be altered in place. We could store the results in a new variable.
lower_sentence = sentence.lower()
lower_sentence
'let us invest in our people without leaving them a mountain of debt.'
Your turn
Create a new, lower cased version of your food
string.
We can also replace
words within the string.
sentence.replace("nation", "country")
'Let us invest in our people without leaving them a mountain of debt.'
replace
can also be used to remove text by not including anything with the replacement quotation marks.
sentence.replace(".", "")
'Let us invest in our people without leaving them a mountain of debt'
As before, this does not alter the original string. If you wanted to save the string edits, you would need to create a new variable.
edited_sentence = sentence.lower()
print(edited_sentence)
let us invest in our people without leaving them a mountain of debt.
If you were doing a series of manipulations, you could reuse a varaiable name, although it is best practices to keep a version of the original string in case you ever need to go back to it.
edited_sentence = sentence.lower()
print(edited_sentence)
edited_sentence = edited_sentence.replace(".", "")
print(edited_sentence)
let us invest in our people without leaving them a mountain of debt.
let us invest in our people without leaving them a mountain of debt
You can also stack multiple transformations together, although combining too many may make your code harder to follow.
edited_sentence.replace(".", "").lower()
'let us invest in our people without leaving them a mountain of debt'
Your turn
Create a new string called boring
that removes the exclamation marks and capitalization from the sentence “Way to go!!!”.
Slicing
If you had a very long text, such as the entire text of the State of the Union, you might only want to look at the first few characters. In Python, this is called by slicing.
sentence
'Let us invest in our people without leaving them a mountain of debt.'
sentence[0:20]
'Let us invest in our'
A slice is signaled with brackets ([]
). The first number is the starting position, where 0 indicates the beginning. This is followed by a colon (:
) and then the end position, which, in this case, is a 20. Note that this is splitting on characters, not words.
Here is a section from the middle of the string:
sentence[20:32]
' people with'
For convience, if you ommit the number before the colon, it defaults to the string beginning.
sentence[:40]
'Let us invest in our people without leav'
Ommitting the second number defaults to the end.
sentence[40:]
'ing them a mountain of debt.'
Finally, negative numbers are interpreted as distance from the end of the string.
sentence[-20:]
' a mountain of debt.'
Your turn
Create a new string called s
that contains The weather is hot and humid today.
Find the slices for each of the following :
- “The w”
- “today.”
- “hot and humid”
The w
today.
hot and humid
Numbers
We can also count the number of characters in a string with the len
function.
len(sentence)
68
In this case, Python returned an interger instead of string. This also can be stored in a variable.
sentence_length = len(sentence)
sentence_length
68
Your turn
What is the length of How many dogs do you own?
? Store it in a variable called sl
.
Since the length of a string is a number, we can do standard math operations with it.
print(sentence_length * 3)
204
print(sentence_length / 2)
34.0
print(sentence_length + sentence_length)
136
Your turn
What is one-third the length of sl
?
As with strings, these can be saved in new variables.
double_length = sentence_length + sentence_length
print(double_length)
136
These same operators also work with strings.
print(sentence * 2)
Let us invest in our people without leaving them a mountain of debt.Let us invest in our people without leaving them a mountain of debt.
print(sentence + sentence)
Let us invest in our people without leaving them a mountain of debt.Let us invest in our people without leaving them a mountain of debt.
The operators can’t be used to combine different data types, however.
print("The sentence was " + sentence_length + "characters.")
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-63-9041347a8e39> in <module>
----> 1 print("The sentence was " + sentence_length + "characters.")
TypeError: can only concatenate str (not "int") to str
Conviently, the str
function will convert an interger to a string.
print("The sentence was " + str(sentence_length) + " characters.")
The sentence was 68 characters.
I manually had to include the spaces before and after sentence_length
. Otherwise, it all is smushed together.
print("The sentence was" + str(sentence_length) + "characters.")
The sentence was68characters.
Your turn
Print The length of the word "hippopotamus" is [x].
where [x]
is the length of the word hippopotamus .
Lists
You can also split
the sentence into a series of strings. By default, this splits based on spaces and other whitespace characters such as a line break (\n
) or tab character (\t
).
print(sentence.split())
['Let', 'us', 'invest', 'in', 'our', 'people', 'without', 'leaving', 'them', 'a', 'mountain', 'of', 'debt.']
What is returned here is a third data type (the first two were strings and intergers) called a list. A list is enclosed in brackets ([]
) and the items are seperated by commas. In this case each item is in quotation marks because they are all strings. Items in a list, however, can be of any sort.
my_list = ['Speeches', 7, 'Data']
my_list
['Speeches', 7, 'Data']
While len
returned the number of characters in a string, it returns the number the items in a list.
len(my_list)
3
sentence_length = len(sentence.split())
sentence_length
13
In the second example, the list created by sentence.split()
is not saved in any way; only its length.
Your turn
Create a list called ate that includes at least three things you ate today. Use len
to count the number of items in the list.
Like, strings, lists can also be sliced. The first three items of a list:
words = sentence.split()
print(words[:3])
['Let', 'us', 'invest']
We can also extract specific items from a list by their position. As it did with strings, slicing in Python starts with 0.
words[0]
'Let'
The third word:
words[2]
'invest'
The fifth word from the end:
words[-5]
'them'
The last two words:
words[-2:]
['of', 'debt.']
Your turn
Display the first two items of your ate list. What is the last item?
['apple', 'dosa']
pizza slice
Slicing a list returns a list. If you ask for the first three items, you will get a list made up of those items. In contrast, if your request a specific location, such as words[2]
, Python returns the specific object stored in the place, which may be a string, number, or event an entire list.
Unlike a string, lists are mutable. That means that we can remove or as is more frequently the case text analysis, add things to it. This is done with append
.
male_words = ['his', 'him', 'father']
male_words.append('brother')
print(male_words)
['his', 'him', 'father', 'brother']
Since append
is changing male_words
, we do not want to use an =
. The Python interpreter is editing our original list but not returning anything.
not_going_to_work = male_words.append('brother')
print(not_going_to_work)
None
Lists can be also be combined using +
.
gendered_words = male_words + ['her', 'she', 'mother']
print(gendered_words)
['his', 'him', 'father', 'brother', 'brother', 'her', 'she', 'mother']
As note above, the items in a list can include a variety of data types. This includes lists.
gendered_lists = [ male_words , ['her', 'she', 'mother'] ]
Note the two closing brackets next to each other. The first closes the list that ends with ‘mother’ while the second closes our gendered_lists
.
len(gendered_lists)
2
gendered_lists
has a length of two because it contains just two items, each a list of varying lengths.
print(gendered_lists)
[['his', 'him', 'father', 'brother', 'brother'], ['her', 'she', 'mother']]
Your turn
Add three more items to your food
list. Use append
for one.
For the other two, places them in a list and then combine the two lists.
Dictionaries
A fourth useful data type is a dictionary. A dictionary is like a list in that it holds multiples items. The items in a list can be identified by their position in the list. In contrast, the values in a dictionary are associated with a keyword. The analogy here is a to a physical dictionary, which has a list of unique words, and each word has a definition. In this case, the entries are called keys, and the definitions, which can be any data type, are called values.
Alternatively, you can think of a dictionary as a single row of data from a dataset, where the keys are the variable names.
respondent = {'sex' : "female",
'abany' : 1,
'educ' : 'College'}
Dictionaries are surrounded by curly brackets ({}
). Each entry is pair consisting of the key, which must be a string, followed, by a colon and then the value. Like in a list, entries are seperated by commas.
You can access the contents of a dictionary by enclosing the key in brackets ([]
).
respondent['sex']
'female'
If the key is not dictionary, you will get a KeyError
.
respondent['gender']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-10-4da2e0df882d> in <module>
----> 1 respondent['gender']
KeyError: 'gender'
You can inspect all the keys in a dictionary, in case you forgot or someone else made it.
respondent.keys()
dict_keys(['sex', 'abany', 'educ'])
len(respondent.keys())
3
Dictionaries are mutable, so we can change the value of existing keys, remove keys, or add new ones.
respondent['race'] = 'Black'
print(respondent)
{'sex': 'female', 'abany': 1, 'educ': 'College', 'race': 'Black'}
respondent['abany'] = 'Yes'
print(respondent)
{'sex': 'female', 'abany': 'Yes', 'educ': 'College', 'race': 'Black'}
Your turn
Add a new key to the dictionary called age
with a value of 37. Confirm that you did it correctly by displaying the value of age
.
As noted above, while the keys have to be strings, the values can be any data type. You could add the ages of the respondent’s children as a list.
respondent['children ages'] = [3, 5, 10]
print(respondent)
{'sex': 'female', 'abany': 'Yes', 'educ': 'College', 'race': 'Black', 'age': 37, 'children ages': [3, 5, 10]}
Spaces
Within the Python community, there are strong norms about how code should be written. Many of these are centered around have code be readable, both by others and by your future self. As a trivial example, 2+2
is allowed, by is almost always written 2 + 2
. Likewise I defined my respondent dictionary with plenty of white space in order to maximize readability.
respondent = {'sex' : "female",
'abany' : 1,
'educ' : 'College'}
This is identical to:
respondent={'sex':'female','abany':1,'educ':'College'}
but putting it all on one line obscures the logic of the dictionary. In this case, what is a key and what is a value is quite clear in the first version, while distinguishing between the two is more problematic in the single-line version.
r2 = {'sex':'male', 'abany':1, 'educ':'College' }
r3 = {'sex':'female', 'abany':0, 'educ':'High School' }
r4 = {'sex':'male', 'abany':0, 'educ':'Some College'}
respondents = [respondent, r2, r3, r4]
respondents
[{'sex': 'female', 'abany': 1, 'educ': 'College'},
{'sex': 'male', 'abany': 1, 'educ': 'College'},
{'sex': 'female', 'abany': 0, 'educ': 'High School'},
{'sex': 'male', 'abany': 0, 'educ': 'Some College'}]
This is now looks a lot like the common data format JSON!
Loops
for person in respondents:
print(person['educ'])
College
College
High School
Some College
for item in [1,2,'bobcat']:
print(item)
1
2
bobcat
Your turn
Loop over the items in your food
list. For each item, print its length.
Functions
For those who come from Stata or R background, one of the more striking aspects of Python code is the frequency of user defined functions. They are deployed not just for things where you think there should be a function, like counting words in a sentence, but also for highly-custom situations, such as scraping the contents of a particular web page. This style of programming, with many small functions, tends to make code more readable easier to debug than code written in a more traditional social science style.
A standard function has three parts. First the function is named and defined. Subsequent line or lines actually do the thing. Finally the results are returned.
A trivial function that returns the Hello!
might look like:
def make_hello():
word = 'Hello!'
return word
Of note, def
signals that your are defining a function. This is followed by the name of the function. In this case, make_hello
. Since this function doesn’t take any arguments, such as accepting a variable to modify or have any options, it is followed by ()
. The first line ends with a colon.
All subsequent lines are indented. The second line creates a new string variable called word
which contains Hello!
. The third and final line of the functions returns the value stored in word.
make_hello()
'Hello!'
More commonly in text analysis, a user-defined function modifies an existing string. In this case, the variable name that will be used within the function is established within the parenthesis on the opening line.
A second trivial function takes a text string and returns an all-caps version.
def scream(text):
text_upper = text.upper()
return text_upper
scream('Hi there!')
'HI THERE!'
The text
and text_upper
variable only exist within the function. That means that you can pass a variable not called text
to the function.
scream(sentence)
'LET US INVEST IN OUR PEOPLE WITHOUT LEAVING THEM A MOUNTAIN OF DEBT.'
It also means that everything but the returned text
disappears.
text_upper
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-37-9374ae9251ac> in <module>
----> 1 text_upper
NameError: name 'text_upper' is not defined
It is a good idea to include a comment within the function that explains the function. This is helpful for other people reading your code and when you return to your own code months and days later.
def scream(text):
'''Returns an all-caps version of text string.'''
text_upper = text.upper()
return text_upper
Your turn
Make a function called whisper
that replaces all exclamation marks with a period and returns a lower case version of a string. Test it out.