Groups All Day

Group numbers are in data/groups.json. Find your group. Move tables and chairs so that folks are not in row and no one has to turn around to see the board. Start a new notebook where you will do your work for today. Make the first cell a markdown cell and put a title or notes in there. Second cell can include your import statements. </div> ### Text Classification {:.input_area} ```python %matplotlib inline import pandas as pd ``` {:.input_area} ```python #https://www.kaggle.com/zynicide/wine-reviews/data wine_df = pd.read_csv('data/wine_reviews.csv') ```

Your turn

What's in the dataset? </div> {:.input_area} ```python wine_df['points'].value_counts() ``` {:.output .output_data_text} ``` 87 16933 86 12600 91 11359 92 9613 85 9530 93 6489 84 6480 94 3758 83 3025 82 1836 95 1535 81 692 96 523 80 397 97 229 98 77 99 33 100 19 Name: points, dtype: int64 ``` {:.input_area} ```python wine_df['description'][:5] ``` {:.output .output_data_text} ``` 0 Aromas include tropical fruit, broom, brimston... 1 This is ripe and fruity, a wine that is smooth... 2 Tart and snappy, the flavors of lime flesh and... 3 Pineapple rind, lemon pith and orange blossom ... 4 Much like the regular bottling from 2012, this... Name: description, dtype: object ``` ![google_search.png](images/google_search.png) {:.input_area} ```python pd.set_option('display.max_colwidth', 120) ``` {:.input_area} ```python wine_df['description'][:5] ``` {:.output .output_data_text} ``` 0 Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripen... 1 This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red be... 2 Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity... 3 Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of ... 4 Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, herbal ... Name: description, dtype: object ``` {:.input_area} ```python wine_df.head() ```

country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery rating
0 Italy Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripen... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia Low
1 Portugal This is ripe and fruity, a wine that is smooth while still structured. Firm tannins are filled out with juicy red be... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos Low
2 US Tart and snappy, the flavors of lime flesh and rind dominate. Some green pineapple pokes through, with crisp acidity... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm Low
3 US Pineapple rind, lemon pith and orange blossom start off the aromas. The palate is a bit more opulent, with notes of ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling (Lake Michigan Shore) Riesling St. Julian Low
4 US Much like the regular bottling from 2012, this comes across as rather rough and tannic, with rustic, earthy, herbal ... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child Block Pinot Noir (Willamette Valley) Pinot Noir Sweet Cheeks Low
### Turning words in to features {:.input_area} ```python from sklearn.feature_extraction.text import CountVectorizer ``` {:.input_area} ```python vectorizer = CountVectorizer(lowercase = True, ngram_range = (1,1), stop_words = 'english', max_df = .60, min_df = .01, max_features = None) ``` {:.input_area} ```python vectorizer.fit(wine_df['description']) ``` {:.output .output_data_text} ``` CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=0.6, max_features=None, min_df=0.01, ngram_range=(1, 1), preprocessor=None, stop_words='english', strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None) ``` {:.input_area} ```python len(vectorizer.get_feature_names()) ``` {:.output .output_data_text} ``` 400 ``` {:.input_area} ```python review_word_counts = vectorizer.transform(wine_df['description']) ``` {:.input_area} ```python from sklearn.naive_bayes import MultinomialNB ``` {:.input_area} ```python nb_classifier = MultinomialNB() ``` {:.input_area} ```python nb_classifier.fit(review_word_counts, wine_df['rating']) ``` {:.output .output_data_text} ``` MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) ``` {:.input_area} ```python nb_classifier.coef_[0] ``` {:.output .output_data_text} ``` array([ -7.33467081, -7.24590289, -7.19142933, -7.20616256, -7.38885524, -8.18761044, -8.89070795, -11.09798286, -7.20061213, -7.24397425, -6.69237287, -4.00593826, -6.70465667, -7.24397425, -7.25950854, -6.263145 , -6.67477066, -6.92639671, -6.56148092, -6.14471991, -6.39337878, -6.09342661, -7.16437628, -4.84345499, -6.96645929, -6.27185332, -7.04067957, -3.78998822, -6.93484747, -6.76128647, -5.86105745, -7.28728811, -6.39585097, -7.30147274, -6.26459113, -5.88949074, -6.73083916, -7.12085331, -8.99606846, -9.17839002, -6.77928336, -4.62605818, -6.80255446, -7.28127003, -6.45534174, -5.38365017, -5.85864374, -4.66095875, -6.82131674, -5.29641393, -6.28284647, -5.05147071, -7.33256333, -7.57162234, -6.77807344, -5.25233262, -6.4614902 , -7.22300113, -7.28127003, -6.99910314, -6.93202261, -6.6704181 , -5.11679824, -6.92639671, -6.60120809, -7.02817941, -6.90970648, -5.45898136, -6.25953879, -6.94337025, -6.60019543, -6.97231583, -6.67368074, -6.92639671, -8.24885406, -5.35585418, -5.83765061, -6.17138815, -4.27578547, -7.14499242, -5.76966548, -6.77084449, -6.78779405, -4.92438617, -6.89600763, -7.57429971, -5.50665672, -6.73894574, -6.66392457, -7.36248009, -6.42855828, -6.28505968, -6.03022282, -6.31276643, -6.62064605, -7.24783526, -6.96063686, -6.76605406, -7.01583357, -7.06133161, -6.71029049, -6.71937114, -5.85479388, -4.80154344, -7.04698874, -5.67423501, -6.93909978, -6.87981295, -5.63530735, -7.44143879, -6.60934649, -6.80007922, -6.5595354 , -6.69793772, -7.28728811, -7.23057692, -7.82255487, -6.81376954, -6.74594694, -5.74741811, -4.30921589, -6.14087621, -7.07425586, -4.5923347 , -6.65319485, -6.59012486, -5.8944758 , -5.40823784, -6.27696842, -7.73382675, -6.93061316, -6.48106378, -7.04698874, -6.99158996, -7.37558071, -7.10061653, -7.26343012, -5.86882072, -5.67865535, -6.2459526 , -3.93059431, -6.04579856, -7.30761401, -6.00367509, -5.63878427, -6.97820686, -3.2928417 , -6.26459113, -6.47747633, -6.96938327, -8.22287857, -6.49372218, -6.8520884 , -7.59598078, -6.53363467, -7.02507863, -6.81251719, -7.17510984, -4.62241008, -6.51023631, -3.71517988, -6.61755166, -5.08005344, -4.95653214, -7.05015831, -6.33580892, -7.05015831, -7.01583357, -5.10606387, -6.59514747, -6.14664731, -6.52327188, -6.65747293, -4.95888231, -7.0066732 , -6.93626289, -7.71513462, -6.31885937, -6.23114795, -5.97077739, -5.341953 , -6.24808555, -6.13069801, -5.71641117, -6.0574946 , -6.2283527 , -6.82384519, -8.7337042 , -7.71205295, -6.85469596, -7.23821054, -6.52514805, -6.20695044, -6.99910314, -5.54467098, -5.76131929, -7.86825702, -7.3516925 , -5.98599507, -7.40910341, -6.09770404, -6.30144128, -5.36670314, -7.73382675, -6.37867263, -4.70226606, -6.0983166 , -5.25602072, -5.78612127, -7.55834193, -5.70146693, -6.07052025, -6.74010921, -7.00212429, -8.26476952, -6.94765903, -7.48706495, -6.41165171, -6.702412 , -6.84948762, -6.85339133, -6.96792021, -5.44485592, -5.80559211, -5.85095879, -7.36465164, -6.61344066, -6.17535117, -7.07425586, -6.85339133, -6.92639671, -6.90695165, -7.09396094, -5.56891459, -5.8830471 , -6.99458846, -7.02817941, -6.61652233, -6.28653788, -7.16082388, -6.39998491, -4.75027528, -5.67183212, -4.66518893, -4.6986319 , -6.2459526 , -6.85469596, -5.05297755, -7.28327204, -7.25365487, -7.0036383 , -5.93631864, -8.6675644 , -5.76526407, -6.78657379, -7.55570688, -6.25953879, -3.97101903, -5.3208936 , -5.51825303, -7.03597368, -5.50665672, -6.63626323, -6.70690639, -7.36465164, -7.05015831, -6.01432243, -6.79884388, -5.62380415, -6.54794106, -6.37059504, -4.88046989, -7.05972776, -7.64080621, -7.79878465, -8.18761044, -7.66987803, -7.75287494, -6.19203491, -6.60019543, -7.53229879, -7.87547727, -7.83988632, -6.87847515, -7.48706495, -7.86825702, -6.84689358, -6.3196236 , -6.97673084, -7.26934146, -5.31303942, -6.78779405, -5.67543862, -4.40533532, -6.20695044, -5.30082044, -6.4676767 , -6.84689358, -6.93768033, -4.53233384, -7.40005357, -6.50194516, -7.27927203, -6.48827752, -6.19608077, -6.6682489 , -6.61755166, -7.35815108, -6.83146911, -5.62724124, -6.37220535, -6.47301005, -7.69375976, -7.36900895, -6.39255607, -6.55468808, -6.56538337, -5.28439522, -6.81502345, -5.348011 , -7.64367565, -6.95629218, -5.53940133, -6.67586177, -6.82511181, -6.22974935, -5.86009127, -4.62115037, -6.34989366, -6.71595622, -6.83530298, -6.27185332, -4.84170949, -6.91246891, -5.67183212, -6.32115382, -6.53933269, -6.05573145, -6.54794106, -5.97239682, -6.8430151 , -5.68755508, -6.96354384, -6.63941615, -7.56098394, -4.6830347 , -6.39915675, -6.01883966, -6.88249391, -6.31428619, -5.67183212, -7.0051546 , -4.37644764, -5.58382124, -6.64046933, -5.76000781, -7.25755351, -5.23641994, -6.92080229, -6.11437624, -8.05779882, -6.68243353, -5.98818805, -6.60628682, -6.76844641, -6.37301147, -6.80255446, -6.51023631, -5.30496925, -7.25171123, -6.0307954 , -7.30556273, -5.15566081, -7.01583357, -7.00819411, -8.2973811 , -7.43209292, -7.15024177, -7.79209566, -6.78292193, -7.62376034, -6.55759365, -7.55570688, -6.50194516, -6.60934649, -6.37705188, -7.27727801, -6.94765903, -5.01659451, -6.52420953, -3.2604285 , -7.35384073, -6.92499517, -5.85383373, -7.07263117, -6.63836407, -6.35147095, -6.36098728, -6.58214078, -6.49645566]) ``` {:.input_area} ```python coeficients = pd.Series(nb_classifier.coef_[0], index = vectorizer.get_feature_names()) ``` {:.output .output_data_text} ``` array(['High', 'Low'], dtype='<U4') ``` {:.input_area} ```python coeficients.sort_values()[:20] ``` {:.output .output_data_text} ``` 2022 -11.097983 beautifully -9.178390 beautiful -8.996068 2020 -8.890708 impressive -8.733704 opulent -8.667564 velvety -8.297381 lovely -8.264770 cellar -8.248854 focused -8.222879 potential -8.187610 2019 -8.187610 tightly -8.057799 producer -7.875477 layered -7.868257 purple -7.868257 provide -7.839886 develop -7.822555 polished -7.798785 vines -7.792096 dtype: float64 ``` {:.input_area} ```python coeficients.sort_values(ascending=False)[:20] ``` {:.output .output_data_text} ``` wine -3.260428 flavors -3.292842 fruit -3.715180 aromas -3.789988 finish -3.930594 palate -3.971019 acidity -4.005938 cherry -4.275785 drink -4.309216 tannins -4.376448 red -4.405335 ripe -4.532334 dry -4.592335 soft -4.621150 fresh -4.622410 berry -4.626058 black -4.660959 notes -4.665189 sweet -4.683035 oak -4.698632 dtype: float64 ```

Your turn

New groups! Load up the group spreadsheet and find your group. Working in your group, use the "ge_speeches.json" file to determine what are the most distinguishing words used by Hillary Clinton and Donald Trump during the 2016 election. Do this in a new notebook! </div> {:.input_area} ```python nb_classifier.predict(review_word_counts) ``` {:.output .output_data_text} ``` array(['Low', 'High', 'Low', ..., 'Low', 'High', 'High'], dtype='<U4') ``` {:.input_area} ```python wine_df['prediction'] = nb_classifier.predict(review_word_counts) ``` {:.input_area} ```python pd.crosstab(wine_df['rating'], wine_df['prediction']) ```

prediction High Low
rating
High 26996 6639
Low 7884 43609
{:.input_area} ```python nb_classifier.predict_proba(review_word_counts) ``` {:.output .output_data_text} ``` array([[0.05837039, 0.94162961], [0.8189896 , 0.1810104 ], [0.0073304 , 0.9926696 ], ..., [0.31987064, 0.68012936], [0.92064944, 0.07935056], [0.65113375, 0.34886625]]) ``` {:.input_area} ```python predict_df = pd.DataFrame(nb_classifier.predict_proba(review_word_counts), columns=nb_classifier.classes_) ``` {:.input_area} ```python predict_df.head() ```
High Low
0 0.058370 0.941630
1 0.818990 0.181010
2 0.007330 0.992670
3 0.009965 0.990035
4 0.017779 0.982221
{:.input_area} ```python wine_df_prediction = pd.concat([wine_df, predict_df], axis = 1) ``` {:.input_area} ```python wine_df_prediction.sort_values('High', ascending=False)[['description','points']].head(15) ```
description points
79481 From a beautifully exposed southwest facing vineyard with views of the Pyrenees, this is a serious and impressive wi... 96
67748 Dark, rich mountain blueberry and blackberry form the core of this classically delicious Napa Valley wine from an es... 93
5129 A blend of 28% Cabernet Franc, 23% Cabernet Sauvignon, 21% Malbec, 18% Petit Verdot and 10% Merlot, this is a big, b... 94
25463 A blend of 28% Cabernet Franc, 23% Cabernet Sauvignon, 21% Malbec, 18% Petit Verdot and 10% Merlot, this is a big, b... 94
64570 A superb wine from a great year, this is powerful and structured, with great acidity and solid, pronounced fruits. L... 96
49526 This Ferreirinha Douro Superior wine is made in exceptional years. The 2007 is the 16th vintage since 1960 (the prev... 97
58945 A blend of 57% Cabernet Sauvignon, 14% Merlot, 13% Malbec, 11% Cabernet Franc and 5% Petit Verdot, this stunning, we... 93
3958 This is an enormous Cabernet, as packed with intensity and power as anything in Napa Valley. The vineyard is Von Str... 95
5115 A proprietary blend of 57% Merlot, 35% Cabernet Sauvignon and 8% Petit Verdot, all homegrown, this is dense and powe... 94
36344 Mature dark-skinned berry, leather, underbrush and dark spice are some of the aromas that emerge on this fantastic r... 97
68457 After a great inaugural 2010 vintage, this is another impressive Cabernet from this winery. The vineyard is in the P... 95
41876 This is a bold and powerful yet layered and nuanced wine, with hints of green peppercorn, tobacco leaf, cigar box an... 93
73422 A beautifully dense, ripe wine, its intense acidity balanced by an opulent structure and gorgeous fruits. The textur... 97
77340 From one of the top estates in Cahors, this complex, dense wine is both structured and packed with great fruit. At t... 94
70514 This comes from 30-year-old vines in the Quinta da Manoella, the family vineyard of Jorge Borges, one of the winemak... 91
{:.input_area} ```python wine_df_prediction.sort_values('Low', ascending=False)[['description','points']].head(15) ```
description points
26592 Lemon citrus, toast, white flowers—the lead on this wine is feminine and light and, as the name suggests, feels like... 84
39450 Reasonably accurate on the nose for Leyda Sauvignon Blanc, but also a little pickled smelling. Feels chunky and a li... 85
63660 Apple and mineral aromas are basic but clean, and the palate is fresh and lithe, with little to no extra weight. Fla... 86
16200 Lively aromas of grapefruit, white flowers and mineral lead into a light, fruity but rather simple palate that offer... 86
30861 Simple but solid apple and nectarine aromas are straight forward. This feels round and easy, without much acid-based... 86
21333 Simple but solid apple and nectarine aromas are straight forward. This feels round and easy, without much acid-based... 86
52219 Fruity on the nose, with a friendly mix of pineapple, apple, melon and powdered sugar aromas. Feels smooth and round... 87
43618 A medium-bodied Bordeaux blend with sweet aromas of cherry pie and a hint of fresh sage and tarragon. Simple and str... 85
66311 Dry, mild, dusty berry aromas are simple but correct for the variety. This feels scattered across the palate, with s... 86
58458 Slightly stalky, roasted aromas of earthy plum and berry lean in the direction of compost. This feels light and some... 87
13944 This straightforward Verdicchio opens with subdued aromas of stone fruit and citrus. The palate is a bit lean but of... 86
44783 Simple green-apple aromas are innocuous. This is fresh, easy and light on the palate. Flavors of apple and sweet gre... 87
65070 Citrus and peach aromas are basic but clean. The palate starts out fresh and easy before losing some steam, but alon... 85
84252 This easy-drinking white marries citrus and green pear with just a hint of herbal spice. Simple and straightforward,... 87
54423 Stylistically, this is a racy, no-oak wine that smells of pineapple and tastes of citrus and passion fruit. The mout... 83

Your turn

In your groups, how well do your models fit? What is the most Trumpish Trump speech? What is the least? </div> ### What about overfitting? {:.input_area} ```python from sklearn.model_selection import train_test_split train, test = train_test_split(wine_df, test_size=0.2) ``` {:.input_area} ```python len(train) ``` {:.output .output_data_text} ``` 68102 ``` {:.input_area} ```python len(test) ``` {:.output .output_data_text} ``` 17026 ``` {:.input_area} ```python vectorizer = CountVectorizer(lowercase = True, ngram_range = (3,3), max_df = 1.00, min_df = .05, max_features = None) vectorizer.fit(train['description']) vectorizer.get_feature_names() ``` {:.output .output_data_text} ``` ['on the finish', 'on the nose', 'on the palate'] ``` {:.input_area} ```python X_train = vectorizer.transform(train['description']) ``` {:.input_area} ```python nb_classifier.fit(X_train, train['rating']) ``` {:.output .output_data_text} ``` MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) ``` {:.input_area} ```python from sklearn.metrics import classification_report, confusion_matrix, accuracy_score ``` {:.input_area} ```python print(accuracy_score(train['rating'], nb_classifier.predict(X_train))) ``` {:.output .output_stream} ``` 0.8385216293207248 ``` {:.input_area} ```python print(confusion_matrix(train['rating'], nb_classifier.predict(X_train))) ``` {:.output .output_stream} ``` [[21780 4975] [ 6022 35325]] ``` {:.input_area} ```python print(classification_report(train['rating'], nb_classifier.predict(X_train))) ``` {:.output .output_stream} ``` precision recall f1-score support High 0.78 0.81 0.80 26755 Low 0.88 0.85 0.87 41347 avg / total 0.84 0.84 0.84 68102 ``` Precision: % of selected items that are correct Recall: % of correct items that are selected {:.input_area} ```python test_wf = vectorizer.transform(test['description']) test_prediction = nb_classifier.predict(test_wf) ``` {:.input_area} ```python print(accuracy_score(test['rating'], test_prediction)) ``` {:.output .output_stream} ``` 0.8360742393985668 ``` {:.input_area} ```python print(classification_report(test['rating'], test_prediction)) ``` {:.output .output_stream} ``` precision recall f1-score support High 0.79 0.81 0.80 6880 Low 0.87 0.85 0.86 10146 avg / total 0.84 0.84 0.84 17026 ``` {:.input_area} ```python vectorizer = CountVectorizer(lowercase=True, ngram_range = (1,1), stop_words = 'english', max_df = .60, min_df = 5, max_features = None) ``` {:.input_area} ```python vectorizer.fit(train['description']) X_train = vectorizer.transform(train['description']) nb_classifier.fit(X_train, train['rating']) ``` {:.output .output_data_text} ``` MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) ``` {:.input_area} ```python print(accuracy_score(train['rating'], nb_classifier.predict(X_train))) ``` {:.output .output_stream} ``` 0.8925435376347244 ``` {:.input_area} ```python print(accuracy_score(test['rating'], nb_classifier.predict(vectorizer.transform(test['description'])))) ``` {:.output .output_stream} ``` 0.8863502877951368 ```

Your turn

What happens to your model if you change some of the parameters for your vectorizer? Be sure to spit the data between train and test! </div> {:.input_area} ```python from sklearn.linear_model import LogisticRegression ``` {:.input_area} ```python ln_classifier = LogisticRegression() ``` {:.input_area} ```python vectorizer = CountVectorizer(lowercase=True, ngram_range = (1,2), stop_words = 'english', max_df = .80, min_df = .005, max_features = None) vectorizer.fit(train['description']) print(len(vectorizer.get_feature_names())) ln_classifier.fit(vectorizer.transform(train['description']), train['rating']) ``` {:.output .output_stream} ``` 913 ``` {:.output .output_data_text} ``` LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False) ``` {:.input_area} ```python print(accuracy_score(train['rating'], ln_classifier.predict(vectorizer.transform(train['description'])))) ``` {:.output .output_stream} ``` 0.9051863381398491 ``` {:.input_area} ```python print(accuracy_score(test['rating'], ln_classifier.predict(vectorizer.transform(test['description'])))) ``` {:.output .output_stream} ``` 0.8981557617761071 ``` ### What about a different model?

Your turn

What is the out sample accuracy of a logistic regression model on your data?

from sklearn.linear_model import LogisticRegression </div> ![](images/knn1.png) {:.input_area} ```python from sklearn.neighbors import KNeighborsClassifier knn_classifier = KNeighborsClassifier(n_neighbors = 3) ``` {:.input_area} ```python from sklearn.feature_extraction.text import TfidfVectorizer ``` {:.input_area} ```python tf_vector = TfidfVectorizer(lowercase = True, ngram_range = (1,1), stop_words = 'english', max_df = .60, min_df = .05, max_features = None) ``` {:.input_area} ```python asdfadsf ``` {:.input_area} ```python tf_vector.fit(wine_df['description']) ``` {:.input_area} ```python len(tf_vector.get_feature_names()) ``` {:.input_area} ```python review_tf = tf_vector.transform(wine_df['description']) ``` {:.input_area} ```python wine_df['rating'] ``` {:.input_area} ```python knn_classifier.fit(review_tf, wine_df['rating']) ``` {:.input_area} ```python knn_prediction = knn_classifier.predict(review_tf) ``` {:.input_area} ```python print(accuracy_score(test['rating'], knn_prediction)) ``` {:.input_area} ```python print(classification_report(test['rating'], knn_prediction)) ```

Your turn

What does a k-nearest neigbhor for your speech dataset look like? How does the accuracy compare? </div> ![](images/knn2.png) {:.input_area} ```python knn_classifier = KNeighborsClassifier(n_neighbors = 15) ``` ### But what's the best fitting model? {:.input_area} ```python from sklearn.model_selection import GridSearchCV ``` {:.input_area} ```python parameters = {'n_neighbors' : (2, 3, 4)} ``` {:.input_area} ```python grid = GridSearchCV(KNeighborsClassifier(), parameters, cv=5) ``` {:.input_area} ```python grid.fit(review_word_counts, wine_df['rating']) ``` {:.input_area} ```python from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV ``` {:.input_area} ```python pipeline = Pipeline([ ('vectorizer' , CountVectorizer()), ('classifier' , MultinomialNB()) ]) ``` {:.input_area} ```python parameters = {'vectorizer__max_df' : (.2, .4), 'vectorizer__min_df' : (100, 150) } ``` {:.input_area} ```python grid_search = GridSearchCV(pipeline, parameters, n_jobs = -1, cv = 3, verbose = 1) ``` {:.input_area} ```python grid_search.fit(wine_df['description'], wine_df['rating']) ``` {:.input_area} ```python grid_search.best_score_ ``` {:.input_area} ```python grid_search.best_estimator_ ``` {:.input_area} ```python parameters = {'vectorizer__max_df' : [.1, .15, .2, .25], 'vectorizer__min_df' : [25, 50, 100], 'vectorizer__stop_words' : [None, 'english'], 'vectorizer__ngram_range' : [(1,1), (1,2)] } ``` {:.input_area} ```python grid_search.best_score_ ``` {:.input_area} ```python grid_search = GridSearchCV(pipeline, parameters, n_jobs = -1, cv = 5, verbose = 1) ``` {:.input_area} ```python grid_search.fit(wine_df_extremes['description'], wine_df_extremes['rating']) ``` {:.input_area} ```python grid_search.best_score_ ``` {:.input_area} ```python df ``` {:.output .output_traceback_line} ``` --------------------------------------------------------------------------- ``` {:.output .output_traceback_line} ``` NameError Traceback (most recent call last) ``` {:.output .output_traceback_line} ``` in () ----> 1 df ``` {:.output .output_traceback_line} ``` NameError: name 'df' is not defined ``` {:.input_area} ```python grid_search.best_estimator_.get_params ```