Counting Words
This section introduces the basics of all text analysis - turning words into numbers.
Begin by enabling plots to be displayed in the Jupyter notebook and importing the pandas library.
%matplotlib inline
import pandas as pd
The sample text for this section are presidential State of the Union addresses. These were collected buy Constantine Lignos and made availabe on his github page.
Since CSV files aren’t really well-designed to hold long text fields with line-breaks, quotation marks and commas (which might interpreted by a csv reader as the start of a new field), I stored the addresses in a json file format. This can be read by pandas and converted into a data frame.
sotu_df = pd.read_json('data/sotu.json')
As before, info, head and tail give a sense of the data.
sotu_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 230 entries, 0 to 229
Data columns (total 3 columns):
sotu_president 230 non-null object
sotu_text 230 non-null object
sotu_year 230 non-null int64
dtypes: int64(1), object(2)
memory usage: 7.2+ KB
There are three columns with no missing data. sotu_year is the only numeric field.
sotu_df.head()
| sotu_president | sotu_text | sotu_year | |
|---|---|---|---|
| 0 | George_Washington | Fellow-Citizens of the Senate and House of Rep... | 1790 |
| 1 | George_Washington | Fellow-Citizens of the Senate and House of Rep... | 1791 |
| 2 | George_Washington | Fellow-Citizens of the Senate and House of Rep... | 1792 |
| 3 | George_Washington | Fellow-Citizens of the Senate and House of Rep... | 1793 |
| 4 | George_Washington | Fellow-Citizens of the Senate and House of Rep... | 1794 |
A specific row can be referenced using its index through iloc.
sotu_df.iloc[227]
sotu_president Barack_Obama
sotu_text Mr. Speaker, Mr. Vice President, Members of Co...
sotu_year 2016
Name: 227, dtype: object
Here’s what the first 800 characters of the variable sotu_text look like from a random row.
print(sotu_df.iloc[227]['sotu_text'][:800])
Mr. Speaker, Mr. Vice President, Members of Congress, my fellow
Americans:
Tonight marks the eighth year I have come here to report on the State of
the Union. And for this final one, I am going to try to make it shorter.
I know some of you are antsy to get back to Iowa.
I also understand that because it is an election season, expectations for
what we will achieve this year are low. Still, Mr. Speaker, I appreciate
the constructive approach you and the other leaders took at the end of
last year to pass a budget and make tax cuts permanent for working
families. So I hope we can work together this year on bipartisan
priorities like criminal justice reform, and helping people who are
battling prescription drug abuse. We just might surprise the cynics
again.
But tonight, I want to go easy on
Word Count
One the simplest things one might want to know about text is how long is it. This could be useful for basic descriptive questions, such as “Have State of the Union Addresses increased in length over time?” Word counts are also useful for normalized texts. A 200 word essay with 10 exclamation marks is quite different from a 20,000 word essay with 10 exclamation marks
Unfortunately, while there are many useful libraries for text analysis in Python, there isn’t a simple way to use any of them to get words counts. Luckily, it is fairly straightforward to build one using Python’s string methods.
Start by storing a sample sentence as a string called sentence.
sentence = "Our allies will find that America is once again ready to lead."
By default, the string split method divides a string by spaces, tabs, and line breaks. It returns a list.
sentence.split()
['Our',
'allies',
'will',
'find',
'that',
'America',
'is',
'once',
'again',
'ready',
'to',
'lead.']
The built-in function len will return the number of items in our list. By combining the two, it creates a one line word counter.
len(sentence.split())
12
Since we might want to use this word-counting technique in a variety of circumstances, it is convenient to place it in a small function.
def word_count(text_string):
return len(text_string.split())
Since you might read your own code months later and think, “what was trying to do?”, it best practice to document your functions.
def word_count(text_string):
'''Calculate the number of words in a string'''
return len(text_string.split())
word_count(sentence)
12
It worked on the simple example, but it is helpful to check the functions robustness using a more complex example, such as one that includes tabs and line breaks.
tricky_sentence = 'This\thas tabs and\nline breaks.'
print(tricky_sentence)
This has tabs and
line breaks.
word_count(tricky_sentence)
6
Now that the function exists, it can applied to our text variable, sotu_text to create a new variable with the number of words in the address.
sotu_df['sotu_word_count'] = sotu_df['sotu_text'].apply(word_count)
describe, hist, and scatter can provide some information on the new variable.
sotu_df['sotu_word_count'].describe()
count 230.000000
mean 7729.513043
std 5417.866953
min 1372.000000
25% 4018.250000
50% 6149.500000
75% 9517.500000
max 33564.000000
Name: sotu_word_count, dtype: float64
sotu_df['sotu_word_count'].hist(bins = 25)
<matplotlib.axes._subplots.AxesSubplot at 0x110bf1cf8>

sotu_df.plot.scatter(y='sotu_word_count',
x='sotu_year')
<matplotlib.axes._subplots.AxesSubplot at 0x113d69208>

The scatter plot suggests there were two periods of increasing length, although address word count stabilized somewhat in the early part of the 20th century. There are also two historically anomalies, around 1950 and 1990.
The dataset can be subset to just a few informative columns and then the sort_values and head/tail methods can list the longest and shortest addresses.
sotu_df.sort_values(by='sotu_word_count').head(10)
| sotu_president | sotu_text | sotu_year | sotu_word_count | |
|---|---|---|---|---|
| 10 | John_Adams | Gentlemen of the Senate and Gentlemen of the H... | 1800 | 1372 |
| 0 | George_Washington | Fellow-Citizens of the Senate and House of Rep... | 1790 | 1403 |
| 9 | John_Adams | Gentlemen of the Senate and Gentlemen of the H... | 1799 | 1505 |
| 184 | Richard_Nixon | To the Congress of the United States:\n\nThe t... | 1973 | 1655 |
| 19 | James_Madison | Fellow-Citizens of the Senate and House of Rep... | 1809 | 1831 |
| 3 | George_Washington | Fellow-Citizens of the Senate and House of Rep... | 1793 | 1965 |
| 5 | George_Washington | Fellow-Citizens of the Senate and House of Rep... | 1795 | 1986 |
| 7 | John_Adams | Gentlemen of the Senate and Gentlemen of the H... | 1797 | 2057 |
| 2 | George_Washington | Fellow-Citizens of the Senate and House of Rep... | 1792 | 2098 |
| 14 | Thomas_Jefferson | The Senate and House of Representatives of the... | 1804 | 2101 |
sotu_df.sort_values(by='sotu_word_count', ascending = False).head(10)
| sotu_president | sotu_text | sotu_year | sotu_word_count | |
|---|---|---|---|---|
| 192 | Jimmy_Carter | To the Congress of the United States:\n\nThe S... | 1981 | 33564 |
| 155 | Harry_S._Truman | To the Congress of the United States:\n\nA qua... | 1946 | 27722 |
| 117 | Theodore_Roosevelt | To the Senate and House of Representatives:\n\... | 1907 | 27382 |
| 122 | William_H._Taft | PART I\n\nTo the Senate and House of Represent... | 1912 | 25150 |
| 115 | Theodore_Roosevelt | To the Senate and House of Representatives:\n\... | 1905 | 25033 |
| 121 | William_H._Taft | PART I\n\nThis message is the first of several... | 1911 | 23704 |
| 116 | Theodore_Roosevelt | To the Senate and House of Representatives:\n\... | 1906 | 23575 |
| 58 | James_Polk | Fellow-Citizens of the Senate and of the House... | 1848 | 21292 |
| 108 | William_McKinley | To the Senate and House of Representatives:\n\... | 1898 | 20208 |
| 95 | Grover_Cleveland | To the Congress of the United States:\n\nYour ... | 1885 | 19746 |
Presidents Carter and Truman are the culprits.
Your turn
In the data folder, there is a file called "trump_ge_speeches.json" which contains the general election speeches of Donald Trump. What is the average (median) number of words in one of his speeches? </div> #### Word frequencies Word frequencies are the backbone of almost all text analysis. From topic models to text classification, counting how often certain words occur is a critical step in quantifying texts. While it is certainly possible to compute word frequencies using your own functions, that is usually unnecessary, as many Python libraries can compute words frequencies. If you don't really care about which specific words are in a text, but are mostly using them for subsequent statistical analysis, you'll likely need them in the first, wide format. Here each text is a row and each word a variable. This is the modal format for how text is interpreted as numbers. Alternatively, if you want to know about which specific words, or types of words, are most common, you might favor the second, long approach. In either case, note that the ordering of words in the original sentence. With few notable exceptions, analysts take what is called a bag-of-words approach. This simplifying assumption, that word order doesn't really matter, has two things going for it. First, it is computationally much easier to assume that order of words in a sentence doesn't matter. Second, the results, as you will see, are often pretty solid. Fields like sociology, which is based on the idea that individuals are shaped by their surroundings, analyze individual survey data frequently to great success. Bag-of-words is like that. We know context matters, but modeling strategies that ignore this can still provide fairly good estimates. The scikit-learn library includes a fairly flexible tool for computing word frequencies. In data science lingo, variables are 'features', so `CountVectorizer` is categorized as a tool for text feature extraction. ```America is strong, America is proud, and America is free.``` into something like: | america | and | free | is | proud | strong | |:-------:|-----|------|----|-------|--------| | 3 | 1 | 1 | 3 | 1 | 1 | ```America is strong, America is proud, and America is free.``` |word|freq| |---|---| |america|3| |and|1| |free|1| |is|3| |proud|1| |strong|1| {:.input_area} ```python from sklearn.feature_extraction.text import CountVectorizer ``` First, we set the parameters. This is where we assign rules such as whether or not all words should be converted to lower case and whether very rare or very common words should be excluded. By default, `CountVectorizer` includes all words, strips punctation, and converts to lower case. To keep these options, we don't need to include them, but code is often more readable when it is explicit. {:.input_area} ```python vectorizer = CountVectorizer(lowercase = True, token_pattern = r'(?u)\b\w+\b', max_df = 1.0, min_df = 0.0) ``` We start off with some sample sentences. {:.input_area} ```python sentences = ['America is strong and America is proud.', 'America is free.'] ``` First, we build the vocabulary with `fit`. Critically, these does not produce any word counts, but only develops the words that it will count in subsequent passes. The `vectorizer` expects a list-like object (think variable with values for each case.) {:.input_area} ```python vectorizer.fit(sentences) ``` {:.output .output_data_text} ``` CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=0.0, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\\b\\w+\\b', tokenizer=None, vocabulary=None) ``` After `fit`, the `vectorizer` returns the full-set of parameters used to build the vocabulary list. The vocabulary itself is in `get_feature_names`. {:.input_area} ```python vectorizer.get_feature_names() ``` {:.output .output_data_text} ``` ['america', 'and', 'free', 'is', 'proud', 'strong'] ``` The second step is to `transform` a group of texts into an array based on the rules and vocabulary of the `vectorizer`. {:.input_area} ```python wf_array = vectorizer.transform(sentences) ``` Rather than a standard array/dataframe, `transform` returns a sparse matrix. This is because most texts will not include most words. As such, a normal dataset would be filled with zeroes, which would be fairly inefficient. In contrast, a sparse array includes only those cells with non-zero values. Each row is the coordinates of that cell (row, column) followed by the value. In practice, you never need to see one, but, because you often need to convert them to normal arrays, it is useful to know about them. {:.input_area} ```python print(wf_array) ``` {:.output .output_stream} ``` (0, 0) 2 (0, 1) 1 (0, 3) 2 (0, 4) 1 (0, 5) 1 (1, 0) 1 (1, 2) 1 (1, 3) 1 ``` The sparse arrays can be converted to dense with the `todense` method. The dense version can be read by pandas and converted into a dataframe. {:.input_area} ```python wf_array.todense() ``` {:.output .output_data_text} ``` matrix([[2, 1, 0, 2, 1, 1], [1, 0, 1, 1, 0, 0]]) ``` This can be converted into a data frame. {:.input_area} ```python pd.DataFrame(wf_array.todense()) ```
| 0 | 1 | 2 | 3 | 4 | 5 | |
|---|---|---|---|---|---|---|
| 0 | 2 | 1 | 0 | 2 | 1 | 1 |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 |
| america | and | free | is | proud | strong | |
|---|---|---|---|---|---|---|
| 0 | 2 | 1 | 0 | 2 | 1 | 1 |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 |
| america | and | free | is | proud | strong | sentence | |
|---|---|---|---|---|---|---|---|
| 0 | 2 | 1 | 0 | 2 | 1 | 1 | America is strong and America is proud. |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | America is free. |
Your turn
Use a vectorizer and dataframe to find the most common word in these sentences.
seuss_sen = ['This one has a little star.',
'This one has a little car.',
'Say!',
'What a lot of fish there are.']
</div>
We can now fit our vectorizer on a different set of data, the State of the Union addresses. Fortunately, scikit-learn functions can read data directly from a pandas dataframes.
{:.input_area}
```python
vectorizer.fit(sotu_df['sotu_text'])
```
{:.output .output_data_text}
```
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=0.0,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w+\\b', tokenizer=None,
vocabulary=None)
```
This rebuilt our vocabulary list, which is now much longer.
{:.input_area}
```python
len(vectorizer.get_feature_names())
```
{:.output .output_data_text}
```
24989
```
We can use slice to examine an arbitrary section of the vocabulary.
{:.input_area}
```python
vectorizer.get_feature_names()[12501:12510]
```
{:.output .output_data_text}
```
['installing',
'installment',
'installments',
'instance',
'instances',
'instant',
'instantaneous',
'instantly',
'instead']
```
As before, we next `transform` our corpus to a dense array using the `vectorizer`. Note that while we are building and transforming on the same data, this doesn't have to be the case.
{:.input_area}
```python
frequency_array = vectorizer.transform(sotu_df['sotu_text'])
```
{:.input_area}
```python
frequency_array
```
{:.output .output_data_text}
```
<230x24989 sparse matrix of type '<class 'numpy.int64'>'
with 400529 stored elements in Compressed Sparse Row format>
```
`frequency_array` has 39,8287 cells that are not empty which works out to be 7% of the 5,739,420 of possible cells (230 speeches being coded for the presence of 24,954 words.)
As before, this array can be turned into a data frame. This time, we supply an index from the index of the original data frame (`sotu_df`) in case we want to link records later on.
{:.input_area}
```python
word_freq_df = pd.DataFrame(frequency_array.toarray(),
columns = vectorizer.get_feature_names(),
index = sotu_df.index)
```
{:.input_area}
```python
word_freq_df.head()
```
| 0 | 00 | 000 | 0000 | 0001 | 001 | 002 | 003 | 004 | 005 | ... | zimbabwe | zimbabwean | zinc | zion | zollverein | zone | zones | zoological | zooming | zuloaga | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 24989 columns
| 0 | 00 | 000 | 0000 | 0001 | 001 | 002 | 003 | 004 | 005 | ... | zimbabwe | zimbabwean | zinc | zion | zollverein | zone | zones | zoological | zooming | zuloaga | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 24989 columns
| sotu_president | sotu_text | sotu_year | sotu_word_count | 00 | 000 | 002 | 007 | 009 | 01 | ... | youthful | youths | yukon | zeal | zealand | zealous | zealously | zero | zone | zones | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | George_Washington | Fellow-Citizens of the Senate and House of Rep... | 1790 | 1403 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | George_Washington | Fellow-Citizens of the Senate and House of Rep... | 1791 | 2303 | 0 | 5 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | George_Washington | Fellow-Citizens of the Senate and House of Rep... | 1792 | 2098 | 0 | 2 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | George_Washington | Fellow-Citizens of the Senate and House of Rep... | 1793 | 1965 | 0 | 4 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | George_Washington | Fellow-Citizens of the Senate and House of Rep... | 1794 | 2915 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 10910 columns
Your turn
What were the most common words used by Donald Trump in his speeches? </div> An alternate strategy to removing words entirely is construct weights that are based on how frequently a word occurs in a particular document compared to how frequently it appears in other documents. So a word like "of" would score low in every document if it commonly found and used in similar frequencies. A word like "America" might be used in every text, but some addresses might use it more frequently, it which case it would score high in just those instances. Finally, the highest scores would be associated with a word like "terrorism" might be not only rare, but also appear frequently in the few addresses that include it. The most common algorithm for this sort of word weight is called term-frequency/inverse document frequency, or TF-IDF. The numerator, term frequency, is how frequently a word occurs in a document divided by the number of words in the document. The denominator is the natural log of the fraction of the total number of documents divided by the number of documents with term in it. Returning to our two sample sentences: Scikit-learn's `TfidfVectorizer` can be used to compute tf-idfs with identical syntax to the `CountVectorizer`.  {:.input_area} ```python from sklearn.feature_extraction.text import TfidfVectorizer ``` {:.input_area} ```python tfidf_vectorizer = TfidfVectorizer() ``` {:.input_area} ```python tfidf_vectorizer.fit(sentences) tfidf_array = tfidf_vectorizer.transform(sentences) ``` {:.input_area} ```python pd.DataFrame(tfidf_array.todense(), columns = tfidf_vectorizer.get_feature_names()) ```
| america | and | free | is | proud | strong | |
|---|---|---|---|---|---|---|
| 0 | 0.535941 | 0.376623 | 0.000000 | 0.535941 | 0.376623 | 0.376623 |
| 1 | 0.501549 | 0.000000 | 0.704909 | 0.501549 | 0.000000 | 0.000000 |
Your turn
Create a tf-idf dataframe from the trump speeches. Do it! </div> The `sum` method that was used for the word frequencies to find the most important words is less informative here. tf-idf is meant to identify important words within a text, rather than the most important words in an entire corpus, so the meaning of summing the values across multiple texts is not apparent.However, we can use them to identify the most uniquely informative words within a given text. The most useful way to analyze the most informative words (based on tf-idf) is to switch the dataframe from wide (where each case is a speech, to long, where each row is a word-case. For our two sentence example, the result would look like this: | sentence | word| value| |-|-|-| |0| america| 0.535941 |0| is |0.535941 |0| and| 0.376623 |0 |proud |0.37662 | 0| strong| 0.376623 |0| free |0.000000 | 1| free| 0.704909 | 1| america| 0.501549 | 1| is| 0.501549 | 1| and| 0.000000 | 1| proud| 0.000000 | 1| strong |0.000000 The first sentence is about what "america" "is" while the second is about "free". In pandas, the `melt` method is used to convert a dataframe from wide to long. The first parameter is the name of the dataframe. `id_vars` is the variable or variables that will be used to identify the cases. In this case, that is `sotu_year`, which uniquely identifies each row. `value_vars` are the variables to be transposed. In this case, we only want the word variables, which are still stored in the `tfidf_vectorizer.get_feature_names()` list. {:.input_area} ```python df_long = pd.melt(df_combined, id_vars='sotu_year', value_vars=tfidf_vectorizer.get_feature_names()) ``` {:.input_area} ```python df_long.head() ```
| sotu_year | variable | value | |
|---|---|---|---|
| 0 | 1790 | 00 | 0.0 |
| 1 | 1791 | 00 | 0.0 |
| 2 | 1792 | 00 | 0.0 |
| 3 | 1793 | 00 | 0.0 |
| 4 | 1794 | 00 | 0.0 |
| sotu_year | value | |
|---|---|---|
| variable | ||
| 00 | 1880 | 0.020108 |
| 00 | 1881 | 0.306885 |
| 00 | 1882 | 0.452874 |
| 00 | 1883 | 0.458578 |
| 00 | 1884 | 0.015407 |