Classification with scikit learn
Ensemble/Random Forest
Train/test
Input our data from before
upworthy_titles = open('upworthy_titles.txt', "r").read()
upworthy_titles[:500]
"5 Reasons Why My Girlfriend Thinks She's Not Beautiful Enough, No Matter What Anyone Tells Her\nThe Perfect Reply A Girl Can Give To The Question 'What's Your Favorite Position?'\nWhen A Simple Idea Like This Actually Works And It Helps People Get By, Everybody Wins\nOne Of The Biggest Lies We\xe2\x80\x99re Encouraged To Tell Ourselves About Our Value To Society Is Right Here\nHe Was About To Take His Own Life \xe2\x80\x94 Until A Man Stopped Him. Here He Meets Him Face To Face Again.\nWhen I Was A Kid, An Ad Aired On"
I hate unicode
20 minutes of Googling later
upworthy_titles = open('upworthy_titles.txt', "r").read()
upworthy_titles = upworthy_titles.decode('utf-8')
upworthy_titles = upworthy_titles.encode('ascii', 'ignore')
print len(upworthy_titles)
upworthy_titles[:500]
56210
"5 Reasons Why My Girlfriend Thinks She's Not Beautiful Enough, No Matter What Anyone Tells Her\nThe Perfect Reply A Girl Can Give To The Question 'What's Your Favorite Position?'\nWhen A Simple Idea Like This Actually Works And It Helps People Get By, Everybody Wins\nOne Of The Biggest Lies Were Encouraged To Tell Ourselves About Our Value To Society Is Right Here\nHe Was About To Take His Own Life Until A Man Stopped Him. Here He Meets Him Face To Face Again.\nWhen I Was A Kid, An Ad Aired On TV Th"
upworthy_titles = upworthy_titles.splitlines()
print len(upworthy_titles)
upworthy_titles[:20]
693
["5 Reasons Why My Girlfriend Thinks She's Not Beautiful Enough, No Matter What Anyone Tells Her", "The Perfect Reply A Girl Can Give To The Question 'What's Your Favorite Position?'", 'When A Simple Idea Like This Actually Works And It Helps People Get By, Everybody Wins', 'One Of The Biggest Lies Were Encouraged To Tell Ourselves About Our Value To Society Is Right Here', 'He Was About To Take His Own Life Until A Man Stopped Him. Here He Meets Him Face To Face Again.', "When I Was A Kid, An Ad Aired On TV That I Didn't Fully Get. Now, I Want Us All To Watch It Again.", "These Parents Think It Might Be A Phase, But They Just Don't Understand What It Means", 'How China Deals With Internet-Addicted Teens Is Kind Of Shocking. And Maybe A Good Idea?', 'If You Want A Successful Long-Term Relationship (Of Any Kind), Here Are 3 Invaluable Things To Know', 'How Do You Know If You Have Depression? Hear This Woman Explain How She Found Out.', "If You Ever Wanted The Great Novels Explained To You By Your Thug Friend, Now's Your Chance", "Seems Like Any Other High School, Right? So What's The Major Thing Missing From These Pictures?", 'Every Person On Earth Is Supposed To Have These. But What Are They Exactly?', 'Jon Stewart Delivers One Of The Best Interviews In Recent Memory', 'If You Give A Bride A Beautiful Set Of Bone China, You Set Her Table For A Day', 'A High School In A Poor Neighborhood Closed Down. These Folks Reopened It And Kicked Some Butt. How?', 'The Super Bowl Ad That You Should See If You Think Little Girls Can Become Epic Rocket Scientists', "The NFL Would Never Let This Ad Air On The Super Bowl, So We're Gonna Show You It. It's Important.", 'The Simple Way A Developing Country Managed To Decrease The Number Of Babies Who Died By Almost 30%', 'When Theres Nothing Scarier Than A Room Full Of White Men Laughing']
times_titles = open('times_titles.txt', 'rb').read()
times_titles = times_titles.decode('utf-8')
times_titles = times_titles.encode('ascii', 'ignore')
times_titles = times_titles.splitlines()
I might think about wrapping this in a function
def import_titles(filename):
''' Imports text file,
cleans up some unicode mess, and
splits by line. '''
titles = open(filename, 'rb').read()
titles = titles.decode('utf-8')
titles = titles.encode('ascii', 'ignore')
return times_titles.splitlines()
What we want the data to look like next
upworthy | titles |
---|---|
1 | He Was About To Take His Own Life — Until A Man Stopped Him. Here He Meets Him Face To Face Again |
0 | CVS Pharmacy to Stop Selling Tobacco Goods by October |
0 | Twitter’s Share Price Falls After It Reports a Loss |
1 | A 16-Year-Old Explains Why Everything You Thought You Knew About Beauty May Be Wrong. With Math. |
#Set up placehold lists
upworthy = []
titles = []
#Go through all the upworthy articles
for title in upworthy_titles:
#add title to title list
titles.append(title)
#add 1 to Y list
upworthy.append(1)
for title in times_titles:
titles.append(title)
upworthy.append(0)
print len(upworthy)
print upworthy[:5]
1193 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
print len(titles)
print titles[:5]
1193 ["5 Reasons Why My Girlfriend Thinks She's Not Beautiful Enough, No Matter What Anyone Tells Her", "The Perfect Reply A Girl Can Give To The Question 'What's Your Favorite Position?'", 'When A Simple Idea Like This Actually Works And It Helps People Get By, Everybody Wins', 'One Of The Biggest Lies Were Encouraged To Tell Ourselves About Our Value To Society Is Right Here', 'He Was About To Take His Own Life Until A Man Stopped Him. Here He Meets Him Face To Face Again.', "When I Was A Kid, An Ad Aired On TV That I Didn't Fully Get. Now, I Want Us All To Watch It Again.", "These Parents Think It Might Be A Phase, But They Just Don't Understand What It Means", 'How China Deals With Internet-Addicted Teens Is Kind Of Shocking. And Maybe A Good Idea?', 'If You Want A Successful Long-Term Relationship (Of Any Kind), Here Are 3 Invaluable Things To Know', 'How Do You Know If You Have Depression? Hear This Woman Explain How She Found Out.']
I don't like that the X and the Y are in two unrelated lists. Matches cases depends on the sort order, so don't mess that up.
What we want the data to look like next
upworthy | stop | man | obama | explain | everything | you | nato | debate | industry | believe |
---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 1 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(lowercase = True ,
min_df = 2 ,
max_df = .5 ,
ngram_range = (1, 1),
stop_words = 'english',
strip_accents = 'unicode')
vectorizer.fit(titles)
CountVectorizer(analyzer=u'word', binary=False, charset=None, charset_error=None, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=0.5, max_features=None, min_df=2, ngram_range=(1, 1), preprocessor=None, stop_words='english', strip_accents='unicode', token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
X = vectorizer.transform(titles)
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X,upworthy)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
y_hat = clf.predict(X)
y_hat[:20]
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1])
What % were correctly predicted?
clf.score(X, upworthy)
0.91701592623637884
Better fit statistics
from sklearn import metrics
metrics.confusion_matrix(upworthy, y_hat)
array([[416, 84], [ 15, 678]])
Sociologists care about coeficients, but they are a tucked away.
clf.coef_
array([[-6.71921442, -6.31374931, -7.81782671, ..., -7.12467953, -7.4123616 , -7.4123616 ]])
len(clf.coef_[0])
1192
len(vectorizer.vocabulary_)
1192
I know what we can do.
coefficients = zip(vectorizer.vocabulary_,clf.coef_[0])
coefficients = {h[0] : h[1] for h in coefficients}
for word in sorted(coefficients, key = coefficients.get, reverse=True)[:20]:
print word,coefficients[word]
nocturnalist -4.55973017302 rest -4.63977288069 columnist -4.70431140183 debate -5.01446633014 drunk -5.01446633014 biggest -5.01446633014 industry -5.14367806162 nato -5.17876938143 severe -5.17876938143 need -5.2151370256 rebels -5.2151370256 disposing -5.29209806673 weeks -5.33292006125 learning -5.33292006125 head -5.33292006125 bump -5.37547967567 abortions -5.46645145388 world -5.46645145388 million -5.51524161805 000 -5.51524161805
See anything different in this code?
coefficients = zip(vectorizer.vocabulary_,clf.coef_[0])
coefficients = {h[0] : h[1] for h in coefficients}
for word in sorted(coefficients, key = coefficients.get, reverse=False)[:20]:
print word,coefficients[word]
ron -8.5109738916 leads -8.5109738916 glimpse -8.5109738916 young -8.5109738916 james -8.5109738916 detention -8.5109738916 fat -8.5109738916 cool -8.5109738916 dating -8.5109738916 immediately -8.5109738916 die -8.5109738916 lapse -8.5109738916 pretend -8.5109738916 second -8.5109738916 street -8.5109738916 7th -8.5109738916 saved -8.5109738916 goes -8.5109738916 seeks -8.5109738916 box -8.5109738916
Data scientists care about over fitting. We probably should too.
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, upworthy, test_size=0.3)
clf.fit(X_train, y_train)
y_test_pred = clf.predict(X_test)
y_train_pred = clf.predict(X_train)
print "classification accuracy:", metrics.accuracy_score(y_train, y_train_pred)
metrics.confusion_matrix(y_train, y_train_pred)
classification accuracy: 0.925748502994
array([[303, 57], [ 5, 470]])
y_test_pred = clf.predict(X_test)
print "classification accuracy:", metrics.accuracy_score(y_test, y_test_pred)
metrics.confusion_matrix(y_test, y_test_pred)
classification accuracy: 0.821229050279
array([[ 99, 41], [ 23, 195]])
from sklearn import cross_validation
import numpy as np
cross_validation.cross_val_score(clf, X, np.array(upworthy), cv=10)
array([ 0.81666667, 0.81666667, 0.86666667, 0.79831933, 0.81512605, 0.83193277, 0.84033613, 0.79831933, 0.77310924, 0.85714286])
np.mean(cross_validation.cross_val_score(clf, X, np.array(upworthy), cv=10))
0.8214285714285714
Do we do any better by looking at bigrams? Go back up and change the values for CountVectorizer.
What about a new headline?
soc_title = 'Educational Assortative Mating and Earnings Inequality in the United States'
soc_title_vector = vectorizer.transform([soc_title])
clf.predict_proba(soc_title_vector)
array([[ 0.94490128, 0.05509872]])
gawker_title = 'Shocking Footage Aired of Police Shooting Face-Eating Nude Man'
gawker_title_vector = vectorizer.transform([gawker_title])
clf.predict_proba(gawker_title_vector)
array([[ 0.38060745, 0.61939255]])
Your turn. Take the file abstracts.csv
has 3,232 abstracts, divided evenly between those that were published in sociology journals and those that were presented at the ASAs.
After importing the data, develop a model that categorizes each, without overfitting.
Note: Copy and pasting code is your friend!!!
Make a prediction for this out of sample abstract:
This article investigates how changes in educational assortative mating affected the growth in earnings inequality among households in the United States between the late 1970s and early 2000s. The authors find that these changes had a small, negative effect on inequality: there would have been more inequality in earnings in the early 2000s if educational assortative mating patterns had remained as they were in the 1970s. Given the educational distribution of men and women in the United States, educational assortative mating can have only a weak impact on inequality, and educational sorting among partners is a poor proxy for sorting on earnings.