Groups All Day
Group numbers are in data/groups.json
. Find your group. Move tables and chairs so that folks are not in row and no one has to turn around to see the board.
Start a new notebook where you will do your work for today. Make the first cell a markdown cell and put a title or notes in there. Second cell can include your import
statements.
</div>
### Text Classification
{:.input_area}
```python
%matplotlib inline
import pandas as pd
```
{:.input_area}
```python
#https://www.kaggle.com/zynicide/wine-reviews/data
wine_df = pd.read_csv('data/wine_reviews.csv')
```
Your turn
What's in the dataset? </div> {:.input_area} ```python wine_df['points'].value_counts() ``` {:.output .output_data_text} ``` 87 16933 86 12600 91 11359 92 9613 85 9530 93 6489 84 6480 94 3758 83 3025 82 1836 95 1535 81 692 96 523 80 397 97 229 98 77 99 33 100 19 Name: points, dtype: int64 ``` {:.input_area} ```python wine_df['description'][:5] ``` {:.output .output_data_text} ``` 0 Aromas include tropical fruit, broom, brimston... 1 This is ripe and fruity, a wine that is smooth... 2 Tart and snappy, the flavors of lime flesh and... 3 Pineapple rind, lemon pith and orange blossom ... 4 Much like the regular bottling from 2012, this... Name: description, dtype: object ```  {:.input_area} ```python pd.set_option('display.max_colwidth', 120) ``` {:.input_area} ```python wine_df['description'][:5] ``` {:.output .output_data_text} ``` 0 Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripen... 1 This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red be... 2 Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity... 3 Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of ... 4 Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, herbal ... Name: description, dtype: object ``` {:.input_area} ```python wine_df.head() ```
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripen... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia | Low |
1 | Portugal | This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red be... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos | Low |
2 | US | Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm | Low |
3 | US | Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling (Lake Michigan Shore) | Riesling | St. Julian | Low |
4 | US | Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, herbal ... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child Block Pinot Noir (Willamette Valley) | Pinot Noir | Sweet Cheeks | Low |
Your turn
New groups! Load up the group spreadsheet and find your group. Working in your group, use the "ge_speeches.json" file to determine what are the most distinguishing words used by Hillary Clinton and Donald Trump during the 2016 election. Do this in a new notebook! </div> {:.input_area} ```python nb_classifier.predict(review_word_counts) ``` {:.output .output_data_text} ``` array(['Low', 'High', 'Low', ..., 'Low', 'High', 'High'], dtype='<U4') ``` {:.input_area} ```python wine_df['prediction'] = nb_classifier.predict(review_word_counts) ``` {:.input_area} ```python pd.crosstab(wine_df['rating'], wine_df['prediction']) ```
prediction | High | Low |
---|---|---|
rating | ||
High | 26996 | 6639 |
Low | 7884 | 43609 |
High | Low | |
---|---|---|
0 | 0.058370 | 0.941630 |
1 | 0.818990 | 0.181010 |
2 | 0.007330 | 0.992670 |
3 | 0.009965 | 0.990035 |
4 | 0.017779 | 0.982221 |
description | points | |
---|---|---|
79481 | From a beautifully exposed southwest facing vineyard with views of the Pyrenees, this is a serious and impressive wi... | 96 |
67748 | Dark, rich mountain blueberry and blackberry form the core of this classically delicious Napa Valley wine from an es... | 93 |
5129 | A blend of 28% Cabernet Franc, 23% Cabernet Sauvignon, 21% Malbec, 18% Petit Verdot and 10% Merlot, this is a big, b... | 94 |
25463 | A blend of 28% Cabernet Franc, 23% Cabernet Sauvignon, 21% Malbec, 18% Petit Verdot and 10% Merlot, this is a big, b... | 94 |
64570 | A superb wine from a great year, this is powerful and structured, with great acidity and solid, pronounced fruits. L... | 96 |
49526 | This Ferreirinha Douro Superior wine is made in exceptional years. The 2007 is the 16th vintage since 1960 (the prev... | 97 |
58945 | A blend of 57% Cabernet Sauvignon, 14% Merlot, 13% Malbec, 11% Cabernet Franc and 5% Petit Verdot, this stunning, we... | 93 |
3958 | This is an enormous Cabernet, as packed with intensity and power as anything in Napa Valley. The vineyard is Von Str... | 95 |
5115 | A proprietary blend of 57% Merlot, 35% Cabernet Sauvignon and 8% Petit Verdot, all homegrown, this is dense and powe... | 94 |
36344 | Mature dark-skinned berry, leather, underbrush and dark spice are some of the aromas that emerge on this fantastic r... | 97 |
68457 | After a great inaugural 2010 vintage, this is another impressive Cabernet from this winery. The vineyard is in the P... | 95 |
41876 | This is a bold and powerful yet layered and nuanced wine, with hints of green peppercorn, tobacco leaf, cigar box an... | 93 |
73422 | A beautifully dense, ripe wine, its intense acidity balanced by an opulent structure and gorgeous fruits. The textur... | 97 |
77340 | From one of the top estates in Cahors, this complex, dense wine is both structured and packed with great fruit. At t... | 94 |
70514 | This comes from 30-year-old vines in the Quinta da Manoella, the family vineyard of Jorge Borges, one of the winemak... | 91 |
description | points | |
---|---|---|
26592 | Lemon citrus, toast, white flowers—the lead on this wine is feminine and light and, as the name suggests, feels like... | 84 |
39450 | Reasonably accurate on the nose for Leyda Sauvignon Blanc, but also a little pickled smelling. Feels chunky and a li... | 85 |
63660 | Apple and mineral aromas are basic but clean, and the palate is fresh and lithe, with little to no extra weight. Fla... | 86 |
16200 | Lively aromas of grapefruit, white flowers and mineral lead into a light, fruity but rather simple palate that offer... | 86 |
30861 | Simple but solid apple and nectarine aromas are straight forward. This feels round and easy, without much acid-based... | 86 |
21333 | Simple but solid apple and nectarine aromas are straight forward. This feels round and easy, without much acid-based... | 86 |
52219 | Fruity on the nose, with a friendly mix of pineapple, apple, melon and powdered sugar aromas. Feels smooth and round... | 87 |
43618 | A medium-bodied Bordeaux blend with sweet aromas of cherry pie and a hint of fresh sage and tarragon. Simple and str... | 85 |
66311 | Dry, mild, dusty berry aromas are simple but correct for the variety. This feels scattered across the palate, with s... | 86 |
58458 | Slightly stalky, roasted aromas of earthy plum and berry lean in the direction of compost. This feels light and some... | 87 |
13944 | This straightforward Verdicchio opens with subdued aromas of stone fruit and citrus. The palate is a bit lean but of... | 86 |
44783 | Simple green-apple aromas are innocuous. This is fresh, easy and light on the palate. Flavors of apple and sweet gre... | 87 |
65070 | Citrus and peach aromas are basic but clean. The palate starts out fresh and easy before losing some steam, but alon... | 85 |
84252 | This easy-drinking white marries citrus and green pear with just a hint of herbal spice. Simple and straightforward,... | 87 |
54423 | Stylistically, this is a racy, no-oak wine that smells of pineapple and tastes of citrus and passion fruit. The mout... | 83 |
Your turn
In your groups, how well do your models fit? What is the most Trumpish Trump speech? What is the least? </div> ### What about overfitting? {:.input_area} ```python from sklearn.model_selection import train_test_split train, test = train_test_split(wine_df, test_size=0.2) ``` {:.input_area} ```python len(train) ``` {:.output .output_data_text} ``` 68102 ``` {:.input_area} ```python len(test) ``` {:.output .output_data_text} ``` 17026 ``` {:.input_area} ```python vectorizer = CountVectorizer(lowercase = True, ngram_range = (3,3), max_df = 1.00, min_df = .05, max_features = None) vectorizer.fit(train['description']) vectorizer.get_feature_names() ``` {:.output .output_data_text} ``` ['on the finish', 'on the nose', 'on the palate'] ``` {:.input_area} ```python X_train = vectorizer.transform(train['description']) ``` {:.input_area} ```python nb_classifier.fit(X_train, train['rating']) ``` {:.output .output_data_text} ``` MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) ``` {:.input_area} ```python from sklearn.metrics import classification_report, confusion_matrix, accuracy_score ``` {:.input_area} ```python print(accuracy_score(train['rating'], nb_classifier.predict(X_train))) ``` {:.output .output_stream} ``` 0.8385216293207248 ``` {:.input_area} ```python print(confusion_matrix(train['rating'], nb_classifier.predict(X_train))) ``` {:.output .output_stream} ``` [[21780 4975] [ 6022 35325]] ``` {:.input_area} ```python print(classification_report(train['rating'], nb_classifier.predict(X_train))) ``` {:.output .output_stream} ``` precision recall f1-score support High 0.78 0.81 0.80 26755 Low 0.88 0.85 0.87 41347 avg / total 0.84 0.84 0.84 68102 ``` Precision: % of selected items that are correct Recall: % of correct items that are selected {:.input_area} ```python test_wf = vectorizer.transform(test['description']) test_prediction = nb_classifier.predict(test_wf) ``` {:.input_area} ```python print(accuracy_score(test['rating'], test_prediction)) ``` {:.output .output_stream} ``` 0.8360742393985668 ``` {:.input_area} ```python print(classification_report(test['rating'], test_prediction)) ``` {:.output .output_stream} ``` precision recall f1-score support High 0.79 0.81 0.80 6880 Low 0.87 0.85 0.86 10146 avg / total 0.84 0.84 0.84 17026 ``` {:.input_area} ```python vectorizer = CountVectorizer(lowercase=True, ngram_range = (1,1), stop_words = 'english', max_df = .60, min_df = 5, max_features = None) ``` {:.input_area} ```python vectorizer.fit(train['description']) X_train = vectorizer.transform(train['description']) nb_classifier.fit(X_train, train['rating']) ``` {:.output .output_data_text} ``` MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) ``` {:.input_area} ```python print(accuracy_score(train['rating'], nb_classifier.predict(X_train))) ``` {:.output .output_stream} ``` 0.8925435376347244 ``` {:.input_area} ```python print(accuracy_score(test['rating'], nb_classifier.predict(vectorizer.transform(test['description'])))) ``` {:.output .output_stream} ``` 0.8863502877951368 ```
Your turn
What happens to your model if you change some of the parameters for your vectorizer? Be sure to spit the data between train and test! </div> {:.input_area} ```python from sklearn.linear_model import LogisticRegression ``` {:.input_area} ```python ln_classifier = LogisticRegression() ``` {:.input_area} ```python vectorizer = CountVectorizer(lowercase=True, ngram_range = (1,2), stop_words = 'english', max_df = .80, min_df = .005, max_features = None) vectorizer.fit(train['description']) print(len(vectorizer.get_feature_names())) ln_classifier.fit(vectorizer.transform(train['description']), train['rating']) ``` {:.output .output_stream} ``` 913 ``` {:.output .output_data_text} ``` LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) ``` {:.input_area} ```python print(accuracy_score(train['rating'], ln_classifier.predict(vectorizer.transform(train['description'])))) ``` {:.output .output_stream} ``` 0.9051863381398491 ``` {:.input_area} ```python print(accuracy_score(test['rating'], ln_classifier.predict(vectorizer.transform(test['description'])))) ``` {:.output .output_stream} ``` 0.8981557617761071 ``` ### What about a different model?
Your turn
What is the out sample accuracy of a logistic regression model on your data? What does a k-nearest neigbhor for your speech dataset look like? How does the accuracy compare?
</div>

{:.input_area}
```python
knn_classifier = KNeighborsClassifier(n_neighbors = 15)
```
### But what's the best fitting model?
{:.input_area}
```python
from sklearn.model_selection import GridSearchCV
```
{:.input_area}
```python
parameters = {'n_neighbors' : (2, 3, 4)}
```
{:.input_area}
```python
grid = GridSearchCV(KNeighborsClassifier(), parameters, cv=5)
```
{:.input_area}
```python
grid.fit(review_word_counts,
wine_df['rating'])
```
{:.input_area}
```python
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
```
{:.input_area}
```python
pipeline = Pipeline([
('vectorizer' , CountVectorizer()),
('classifier' , MultinomialNB())
])
```
{:.input_area}
```python
parameters = {'vectorizer__max_df' : (.2, .4),
'vectorizer__min_df' : (100, 150)
}
```
{:.input_area}
```python
grid_search = GridSearchCV(pipeline,
parameters,
n_jobs = -1,
cv = 3,
verbose = 1)
```
{:.input_area}
```python
grid_search.fit(wine_df['description'],
wine_df['rating'])
```
{:.input_area}
```python
grid_search.best_score_
```
{:.input_area}
```python
grid_search.best_estimator_
```
{:.input_area}
```python
parameters = {'vectorizer__max_df' : [.1, .15, .2, .25],
'vectorizer__min_df' : [25, 50, 100],
'vectorizer__stop_words' : [None, 'english'],
'vectorizer__ngram_range' : [(1,1), (1,2)]
}
```
{:.input_area}
```python
grid_search.best_score_
```
{:.input_area}
```python
grid_search = GridSearchCV(pipeline,
parameters,
n_jobs = -1,
cv = 5,
verbose = 1)
```
{:.input_area}
```python
grid_search.fit(wine_df_extremes['description'],
wine_df_extremes['rating'])
```
{:.input_area}
```python
grid_search.best_score_
```
{:.input_area}
```python
df
```
{:.output .output_traceback_line}
```
---------------------------------------------------------------------------
```
{:.output .output_traceback_line}
```
NameError Traceback (most recent call last)
```
{:.output .output_traceback_line}
```
from sklearn.linear_model import LogisticRegression
</div>

{:.input_area}
```python
from sklearn.neighbors import KNeighborsClassifier
knn_classifier = KNeighborsClassifier(n_neighbors = 3)
```
{:.input_area}
```python
from sklearn.feature_extraction.text import TfidfVectorizer
```
{:.input_area}
```python
tf_vector = TfidfVectorizer(lowercase = True,
ngram_range = (1,1),
stop_words = 'english',
max_df = .60,
min_df = .05,
max_features = None)
```
{:.input_area}
```python
asdfadsf
```
{:.input_area}
```python
tf_vector.fit(wine_df['description'])
```
{:.input_area}
```python
len(tf_vector.get_feature_names())
```
{:.input_area}
```python
review_tf = tf_vector.transform(wine_df['description'])
```
{:.input_area}
```python
wine_df['rating']
```
{:.input_area}
```python
knn_classifier.fit(review_tf, wine_df['rating'])
```
{:.input_area}
```python
knn_prediction = knn_classifier.predict(review_tf)
```
{:.input_area}
```python
print(accuracy_score(test['rating'], knn_prediction))
```
{:.input_area}
```python
print(classification_report(test['rating'], knn_prediction))
```
Your turn