Classification with scikit learn

Faking it as a Data Scientist

Supervised learning / Classification
Logistic Regression
Naive Bayes
SVM
Ensemble/Random Forest
Unsupervised learning
PCA
Topic Modeling
Features
Train/test

Load the data
Tranform variables
Select model and parameters
Train the model
Test the model
GOTO 2

Input our data from before

In [4]:

upworthy_titles = open('upworthy_titles.txt', "r").read()

In [5]:

upworthy_titles[:500]

Out[5]:

"5 Reasons Why My Girlfriend Thinks She's Not Beautiful Enough, No Matter What Anyone Tells Her\nThe Perfect Reply A Girl Can Give To The Question 'What's Your Favorite Position?'\nWhen A Simple Idea Like This Actually Works And It Helps People Get By, Everybody Wins\nOne Of The Biggest Lies We\xe2\x80\x99re Encouraged To Tell Ourselves About Our Value To Society Is Right Here\nHe Was About To Take His Own Life \xe2\x80\x94 Until A Man Stopped Him. Here He Meets Him Face To Face Again.\nWhen I Was A Kid, An Ad Aired On"

I hate unicode

20 minutes of Googling later

In [6]:

upworthy_titles = open('upworthy_titles.txt', "r").read()

upworthy_titles = upworthy_titles.decode('utf-8')
upworthy_titles = upworthy_titles.encode('ascii', 'ignore')

In [7]:

print len(upworthy_titles)

upworthy_titles[:500]

Out[7]:

"5 Reasons Why My Girlfriend Thinks She's Not Beautiful Enough, No Matter What Anyone Tells Her\nThe Perfect Reply A Girl Can Give To The Question 'What's Your Favorite Position?'\nWhen A Simple Idea Like This Actually Works And It Helps People Get By, Everybody Wins\nOne Of The Biggest Lies Were Encouraged To Tell Ourselves About Our Value To Society Is Right Here\nHe Was About To Take His Own Life  Until A Man Stopped Him. Here He Meets Him Face To Face Again.\nWhen I Was A Kid, An Ad Aired On TV Th"

In [8]:

upworthy_titles = upworthy_titles.splitlines()

print len(upworthy_titles)

upworthy_titles[:20]

Out[8]:

["5 Reasons Why My Girlfriend Thinks She's Not Beautiful Enough, No Matter What Anyone Tells Her",
 "The Perfect Reply A Girl Can Give To The Question 'What's Your Favorite Position?'",
 'When A Simple Idea Like This Actually Works And It Helps People Get By, Everybody Wins',
 'One Of The Biggest Lies Were Encouraged To Tell Ourselves About Our Value To Society Is Right Here',
 'He Was About To Take His Own Life  Until A Man Stopped Him. Here He Meets Him Face To Face Again.',
 "When I Was A Kid, An Ad Aired On TV That I Didn't Fully Get. Now, I Want Us All To Watch It Again.",
 "These Parents Think It Might Be A Phase, But They Just Don't Understand What It Means",
 'How China Deals With Internet-Addicted Teens Is Kind Of Shocking. And Maybe A Good Idea?',
 'If You Want A Successful Long-Term Relationship (Of Any Kind), Here Are 3 Invaluable Things To Know',
 'How Do You Know If You Have Depression? Hear This Woman Explain How She Found Out.',
 "If You Ever Wanted The Great Novels Explained To You By Your Thug Friend, Now's Your Chance",
 "Seems Like Any Other High School, Right? So What's The Major Thing Missing From These Pictures?",
 'Every Person On Earth Is Supposed To Have These. But What Are They Exactly?',
 'Jon Stewart Delivers One Of The Best Interviews In Recent Memory',
 'If You Give A Bride A Beautiful Set Of Bone China, You Set Her Table For A Day',
 'A High School In A Poor Neighborhood Closed Down. These Folks Reopened It And Kicked Some Butt. How?',
 'The Super Bowl Ad That You Should See If You Think Little Girls Can Become Epic Rocket Scientists',
 "The NFL Would Never Let This Ad Air On The Super Bowl, So We're Gonna Show You It. It's Important.",
 'The Simple Way A Developing Country Managed To Decrease The Number Of Babies Who Died By Almost 30%',
 'When Theres Nothing Scarier Than A Room Full Of White Men Laughing']

In [9]:

times_titles = open('times_titles.txt', 'rb').read()

times_titles = times_titles.decode('utf-8')
times_titles = times_titles.encode('ascii', 'ignore')

times_titles = times_titles.splitlines()

I might think about wrapping this in a function

In [10]:

def import_titles(filename):
    ''' Imports text file, 
        cleans up some unicode mess, and 
        splits by line. '''
    titles = open(filename, 'rb').read()

    titles = titles.decode('utf-8')
    titles = titles.encode('ascii', 'ignore')

    return times_titles.splitlines()

What we want the data to look like next

upworthy	titles
1	He Was About To Take His Own Life — Until A Man Stopped Him. Here He Meets Him Face To Face Again
0	CVS Pharmacy to Stop Selling Tobacco Goods by October
0	Twitter’s Share Price Falls After It Reports a Loss
1	A 16-Year-Old Explains Why Everything You Thought You Knew About Beauty May Be Wrong. With Math.

In [11]:

#Set up placehold lists
upworthy = []
titles   = []

#Go through all the upworthy articles
for title in upworthy_titles:
    #add title to title list
    titles.append(title)
    #add 1 to Y list
    upworthy.append(1)

for title in times_titles:
    titles.append(title)
    upworthy.append(0)

In [12]:

print len(upworthy)
print upworthy[:5]

1193
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [13]:

print len(titles)
print titles[:5]

1193
["5 Reasons Why My Girlfriend Thinks She's Not Beautiful Enough, No Matter What Anyone Tells Her", "The Perfect Reply A Girl Can Give To The Question 'What's Your Favorite Position?'", 'When A Simple Idea Like This Actually Works And It Helps People Get By, Everybody Wins', 'One Of The Biggest Lies Were Encouraged To Tell Ourselves About Our Value To Society Is Right Here', 'He Was About To Take His Own Life  Until A Man Stopped Him. Here He Meets Him Face To Face Again.', "When I Was A Kid, An Ad Aired On TV That I Didn't Fully Get. Now, I Want Us All To Watch It Again.", "These Parents Think It Might Be A Phase, But They Just Don't Understand What It Means", 'How China Deals With Internet-Addicted Teens Is Kind Of Shocking. And Maybe A Good Idea?', 'If You Want A Successful Long-Term Relationship (Of Any Kind), Here Are 3 Invaluable Things To Know', 'How Do You Know If You Have Depression? Hear This Woman Explain How She Found Out.']

I don't like that the X and the Y are in two unrelated lists. Matches cases depends on the sort order, so don't mess that up.

What we want the data to look like next

upworthy	stop	man	obama	explain	you	nato	debate	industry	believe
1	0	1	0	0	2	0	0	0	1
0	1	0	0	0	0	1	1	0	0
0	0	0	1	1	0	0	0	1	0
1	0	0	0	0	1	0	1	0	0

In [14]:

from sklearn.feature_extraction.text import CountVectorizer

In [83]:

vectorizer = CountVectorizer(lowercase     = True ,
                             min_df        = 2 ,
                             max_df        = .5 ,
                             ngram_range   = (1, 1),
                             stop_words    = 'english', 
                             strip_accents = 'unicode')

In [84]:

vectorizer.fit(titles)

Out[84]:

CountVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=0.5, max_features=None, min_df=2,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents='unicode', token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [85]:

X = vectorizer.transform(titles)

In [86]:

from sklearn.naive_bayes import MultinomialNB

In [87]:

clf = MultinomialNB()

In [88]:

clf.fit(X,upworthy)

Out[88]:

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [89]:

y_hat = clf.predict(X)

In [90]:

y_hat[:20]

Out[90]:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1])

What % were correctly predicted?

In [91]:

clf.score(X, upworthy)

Out[91]:

0.91701592623637884

Better fit statistics

In [92]:

from sklearn import metrics

In [93]:

metrics.confusion_matrix(upworthy, y_hat)

Out[93]:

array([[416,  84],
       [ 15, 678]])

Sociologists care about coeficients, but they are a tucked away.

In [94]:

clf.coef_

Out[94]:

array([[-6.71921442, -6.31374931, -7.81782671, ..., -7.12467953,
        -7.4123616 , -7.4123616 ]])

In [95]:

len(clf.coef_[0])

Out[95]:

In [96]:

len(vectorizer.vocabulary_)

Out[96]:

I know what we can do.

In [97]:

coefficients = zip(vectorizer.vocabulary_,clf.coef_[0])
coefficients = {h[0] : h[1] for h in coefficients}

for word in sorted(coefficients, key = coefficients.get, reverse=True)[:20]:
    print word,coefficients[word]

nocturnalist -4.55973017302
rest -4.63977288069
columnist -4.70431140183
debate -5.01446633014
drunk -5.01446633014
biggest -5.01446633014
industry -5.14367806162
nato -5.17876938143
severe -5.17876938143
need -5.2151370256
rebels -5.2151370256
disposing -5.29209806673
weeks -5.33292006125
learning -5.33292006125
head -5.33292006125
bump -5.37547967567
abortions -5.46645145388
world -5.46645145388
million -5.51524161805
000 -5.51524161805

See anything different in this code?

In [98]:

coefficients = zip(vectorizer.vocabulary_,clf.coef_[0])
coefficients = {h[0] : h[1] for h in coefficients}

for word in sorted(coefficients, key = coefficients.get, reverse=False)[:20]:
    print word,coefficients[word]

ron -8.5109738916
leads -8.5109738916
glimpse -8.5109738916
young -8.5109738916
james -8.5109738916
detention -8.5109738916
fat -8.5109738916
cool -8.5109738916
dating -8.5109738916
immediately -8.5109738916
die -8.5109738916
lapse -8.5109738916
pretend -8.5109738916
second -8.5109738916
street -8.5109738916
7th -8.5109738916
saved -8.5109738916
goes -8.5109738916
seeks -8.5109738916
box -8.5109738916

Data scientists care about over fitting. We probably should too.

In [99]:

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, upworthy, test_size=0.3)

In [100]:

clf.fit(X_train, y_train)

y_test_pred = clf.predict(X_test)

In [101]:

y_train_pred = clf.predict(X_train)

print "classification accuracy:", metrics.accuracy_score(y_train, y_train_pred)
metrics.confusion_matrix(y_train, y_train_pred)

classification accuracy: 0.925748502994

Out[101]:

array([[303,  57],
       [  5, 470]])

In [102]:

y_test_pred = clf.predict(X_test)

print "classification accuracy:", metrics.accuracy_score(y_test, y_test_pred)
metrics.confusion_matrix(y_test, y_test_pred)

classification accuracy: 0.821229050279

Out[102]:

array([[ 99,  41],
       [ 23, 195]])

In [103]:

from sklearn import cross_validation
import numpy as np

In [104]:

cross_validation.cross_val_score(clf, X, np.array(upworthy),  cv=10)

Out[104]:

array([ 0.81666667,  0.81666667,  0.86666667,  0.79831933,  0.81512605,
        0.83193277,  0.84033613,  0.79831933,  0.77310924,  0.85714286])

In [105]:

np.mean(cross_validation.cross_val_score(clf, X, np.array(upworthy),  cv=10))

Out[105]:

0.8214285714285714

Do we do any better by looking at bigrams? Go back up and change the values for CountVectorizer.

What about a new headline?

In [106]:

soc_title = 'Educational Assortative Mating and Earnings Inequality in the United States'

In [107]:

soc_title_vector = vectorizer.transform([soc_title])

In [110]:

clf.predict_proba(soc_title_vector)

Out[110]:

array([[ 0.94490128,  0.05509872]])

In [111]:

gawker_title = 'Shocking Footage Aired of Police Shooting Face-Eating Nude Man'
gawker_title_vector = vectorizer.transform([gawker_title])
clf.predict_proba(gawker_title_vector)

Out[111]:

array([[ 0.38060745,  0.61939255]])

Your turn. Take the file abstracts.csv has 3,232 abstracts, divided evenly between those that were published in sociology journals and those that were presented at the ASAs.

After importing the data, develop a model that categorizes each, without overfitting.

Note: Copy and pasting code is your friend!!!

Make a prediction for this out of sample abstract:

This article investigates how changes in educational assortative mating affected the growth in earnings inequality among households in the United States between the late 1970s and early 2000s. The authors find that these changes had a small, negative effect on inequality: there would have been more inequality in earnings in the early 2000s if educational assortative mating patterns had remained as they were in the 1970s. Given the educational distribution of men and women in the United States, educational assortative mating can have only a weak impact on inequality, and educational sorting among partners is a poor proxy for sorting on earnings.