Sushi Bars and Buffets: Measuring local culture with the Yelp API

Neal Caren - University of North Carolina, Chapel Hill web twitter scholar

Sociologists are often interested in how lived experiences vary by social class. This includes topics like variation in the chances of going to prison and life expectancies as well as the kinds of music people like or what type of sports their kids do.

For one project I'm working on, a team of sociologists and pediatricians are looking at variation in the local food environment as a possible explanation for the link between socioeconomic status and youth obesity. In the relatively poor county we studied, images of fast food, especially the "chicken 'n biscuits" provider Bojangles, dominated the roadside.

As a first look at a broader range of food environments, I decided to use the Yelp API to collect data on the mix of restaurants across the country. In this case, I'm not using any of the user provided review data that Yelp is known for, but rather using Yelp as a categorized directory of eating establishments. Since Yelp provides a wide variety of restaurant categories, it is possible to get more information than just the presence of fast food chains, but also look at how prominent buffets, sushi bars, or pizzerias are in different parts of the country.

Most of the APIs of interest to social scientists weren't designed for our use. They are primarily for web or mobile app developers who want to include the content on their pages. So while I might use the MapQuest API to look at how often intra-city trips involve highways, the target audience is business owners trying to help people get to their store. Similarly, scores of researchers have used data from Twitter APIs to study politics, but it was developed so that you could put a custom Twitter widget on your home page.

The good news is that since these services want you to use their data, the APIs are often well documented, especially for languages like Python that are popular in Silicon Valley. The bad news is that APIs don't always make available the parts of the service, like historical data, that are of most interest to researchers. The worst news is research using APIs frequently violates the providers Terms of Service, so it can be an ethical grey zone.

When you sign up as a developer to use an API, you usually agree to only use the API to facilitate other people using the service (e.g. customer's finding their way to your store) and that you won't store the data. API providers usually enforce this through rate limiting, meaning you can only access the service so many times per minute or per day. For example, you can only search status updates 180 times every 15 minutes according to Twitter guidelines. Yelp limits you to 10,000 calls per day. If you go over your limit, you won't be able to access the service for a bit. You will also get in trouble if you redistribute the data, so don't plan on doing that.

I began setting up my program by importing seven libraries. Four of these are part of any standard Python installation. Two others (numpy and requests) are used frequently and included in most Python bundles, like the free anacoda bundle.

The one package that you might need to install is oauth2. The Yelp API requires that requests be authenticated using the OAuth protocol.

In [2]:

from __future__ import division
import json
import csv
from collections import Counter

import numpy as np
import requests

import oauth2

Two of the major reasons that web services require API authentication is so that they know who you are and so they can make sure that you don't go over their rate limits. Since you shouldn't be giving your password to random people on the internet, API authentication works a little bit differently. Like many other places, in order to use the Yelp API you have to sign up as developer. After telling them a little bit about what you plan to do--feel free to be honest; they aren't going to deny you access if you put "research on food cultures" as the purpose--you will get a Consumer Key, Consumer Secret, Token, and Token Secret. Copy and paste them somewhere special.

Using the Yelp API goes something like this. First, you tell Yelp who you are and what you want. Assuming you are authorized to have this information, they respond with a URL where you can retrieve the data. The coding for this in practice is a little bit complicated, so there are often single use tools for accessing APIs, like Tweepy for Twitter.

There's no module to install for the Yelp API, but Yelp does provide some sample Python code. I've slightly modified the code below to show a sample search for restaurants near Chapel Hill, NC, sorted by distance. You can find more options in the search documentation. The API's search options include things like location and type of business, and allows you to sort either by distance or popularity.

In [3]:

consumer_key    = 'qDBPo9c_szHVrZwxzo-zDw'
consumer_secret = '4we8Jz9rq5J3j15Z5yCUqmgDJjM'
token           = 'jeRrhRey_k-emvC_VFLGrlVHrkR4P3UF'
token_secret    = 'n-7xHNCxxedmAMYZPQtnh1hd7lI'

consumer = oauth2.Consumer(consumer_key, consumer_secret)

category_filter = 'restaurants'
location = 'Chapel Hill, NC'
options =  'category_filter=%s&location=%s&sort=1' % (category_filter, location)
url = 'http://api.yelp.com/v2/search?' + options

oauth_request = oauth2.Request('GET', url, {})
oauth_request.update({'oauth_nonce'      : oauth2.generate_nonce(),
                      'oauth_timestamp'  : oauth2.generate_timestamp(),
                      'oauth_token'       : token,
                      'oauth_consumer_key': consumer_key})

token = oauth2.Token(token, token_secret)
oauth_request.sign_request(oauth2.SignatureMethod_HMAC_SHA1(), consumer, token)
signed_url = oauth_request.to_url()

print signed_url

http://api.yelp.com/v2/search?sort=1&oauth_body_hash=2jmj7l5rSw0yVb%2FvlWAYkK%2FYBwk%3D&oauth_nonce=76998400&oauth_timestamp=1385485666&oauth_consumer_key=qDBPo9c_szHVrZwxzo-zDw&oauth_signature_method=HMAC-SHA1&category_filter=restaurants&oauth_token=jeRrhRey_k-emvC_VFLGrlVHrkR4P3UF&location=Chapel+Hill%2C+NC&oauth_signature=ZaqLbf4VhvJ71cAxnoyFkKgKKD0%3D

The URL returned expires after a couple of seconds, so don't expect for the above link to work. The results are provided in the JSON file format, so I'm going to use the already imported requests module to download them.

In [26]:

resp = requests.get(url=signed_url)
chapel_hill_restaurants = resp.json()

chapel_hill_restaurants now stores a JSON object that can be treated quite similarly to a dictionary. Two of the entries are of interest, 'total' and 'businesses'. Total shows the maximum number of results that Yelp could return for that specific search. At most, Yelp will return 40 results, presumably to stop people from scraping their entire database.

In [37]:

print chapel_hill_restaurants['total']

The first page of results only returns a maximum of 20 results, so I would expect that is how items are in 'businesses'. To check what results were returned, I'll print out the rating and name of each of the restaurants, followed by the number of reviews.

In [51]:

for business in chapel_hill_restaurants['businesses']:
    print '%s - %s (%s)' % (business['rating'], business['name'], business['review_count'])

4.0 - Buns (173)
3.5 - Cosmic Cantina (61)
3.5 - R&R Grill (48)
4.0 - Mei Asian (22)
4.0 - Artisan Pizza Kitchen (36)
4.0 - Sutton's Drug Store (50)
4.0 - Bandido's Mexican Cafe (40)
3.5 - Ye Olde Waffle Shoppe (74)
3.5 - 35 Chinese Restaurant (50)
4.5 - Linda's Bar and Grill (58)
3.5 - Time-Out Restaurant (87)
4.5 - TRU Deli & Wine (15)
4.0 - Carolina Crossroads Restaurant (37)
4.0 - Cholanad (103)
4.5 - Mediterranean Deli (228)
4.0 - Sandwhich (150)
3.5 - 411 West (130)
4.0 - Lantern (162)
4.0 - Elaine's On Franklin (77)
4.0 - Talullas (80)

Our graduate students recently sponsored drinks at TRU, one of the highest rated places. It was good, but I suspect as the number of reviews gets higher, their rating will drop a bit.

I found out that 'ratings' held the Yelp rating by looking at the documentation, but the information can also be found by looking at the keys of a specific entry. In this case, I'll take advantage of the fact that information about the last restaurant is still stored in business.

In [42]:

business.keys()

Out[42]:

[u'is_claimed',
 u'rating',
 u'mobile_url',
 u'rating_img_url',
 u'review_count',
 u'name',
 u'rating_img_url_small',
 u'url',
 u'is_closed',
 u'id',
 u'phone',
 u'snippet_text',
 u'image_url',
 u'categories',
 u'display_phone',
 u'rating_img_url_large',
 u'menu_provider',
 u'menu_date_updated',
 u'snippet_image_url',
 u'location']

For this project, I'm only interested in the kind of food that restaurants serve, which is stored in 'categories'

In [43]:

business['categories']

Out[43]:

[[u'Middle Eastern', u'mideastern'],
 [u'Turkish', u'turkish'],
 [u'Mediterranean', u'mediterranean']]

Somewhat inconveniently, this is a list of lists, but since I only really care about the first item, I can extract that with some list comprehension:

In [47]:

categories = [category[0] for category in business['categories']]
categories

Out[47]:

[u'Middle Eastern', u'Turkish', u'Mediterranean']

In case you were wondering, the 'u'' in front of the restaurant categories, and all the other strings, means that the text is stored as unicode, which contains more characters than the vanilla ASCII set. Storing text as unicode can be quite useful, but can causes some difficulties when functions expect ASCII, like when you use the csv writer to output an à without properly encoding the document.

Since I want to gather restaurant information for lots of different areas, I put a slightly modified version of the code above in a new function that I call get_yelp_businesses.

In [48]:

def get_yelp_businesses(location, category_filter = 'restaurants'):
    # from https://github.com/Yelp/yelp-api/tree/master/v2/python
    consumer_key    = 'qDBPo9c_szHVrZwxzo-zDw'
    consumer_secret = '4we8Jz9rq5J3j15Z5yCUqmgDJjM'
    token           = 'jeRrhRey_k-emvC_VFLGrlVHrkR4P3UF'
    token_secret    = 'n-7xHNCxxedmAMYZPQtnh1hd7lI'

    consumer = oauth2.Consumer(consumer_key, consumer_secret)

    url = 'http://api.yelp.com/v2/search?category_filter=%s&location=%s&sort=1' % (category_filter,location)

    oauth_request = oauth2.Request('GET', url, {})
    oauth_request.update({'oauth_nonce': oauth2.generate_nonce(),
                          'oauth_timestamp': oauth2.generate_timestamp(),
                          'oauth_token': token,
                          'oauth_consumer_key': consumer_key})

    token = oauth2.Token(token, token_secret)

    oauth_request.sign_request(oauth2.SignatureMethod_HMAC_SHA1(), consumer, token)

    signed_url = oauth_request.to_url()
    resp = requests.get(url=signed_url)
    return resp.json()

The function expects that the first thing you input will be a location. Taking advantage of both oath2's ability to clean up the text so that it is functional when put in a URL (e.g., escape spaces) and Yelp's savvy ability to parse locations, the value for location can be fairly wide (e.g., "Chapel Hill" or "90210"). You can also add a category of business to search for from the list of acceptable values. If you don't provide a value, category_filter = 'restaurants' provides a default value of 'restaurants'. This function returns the JSON formatted results. Note that this doesn't have any mechanism for handling errors, which will need to happen elsewhere.

In [50]:

beverly_hills_restaurants = get_yelp_businesses('90210')
for business in beverly_hills_restaurants['businesses']:
    print '%s - %s (%s)' % (business['rating'], business['name'], business['review_count'])

4.0 - Sprinklesmobile (35)
4.0 - The Lounge at L'Ermitage (16)
4.0 - The Polo Lounge (222)
4.0 - The Fountain Coffee Shop (40)
4.0 - Mastro's Penthouse (93)
4.5 - The Larder at Maple Drive (14)
4.5 - Harajuku Crepe (573)
4.0 - Nic's Beverly Hills (445)
4.0 - Il Tramezzino (256)
4.0 - Euro Caffe (52)
4.0 - Sayuri (60)
4.0 - Sfixio (108)
4.5 - Da Pasquale Restaurant (230)
4.5 - DOMA (42)
4.5 - Ferrarini (24)
4.0 - Walter's 2 Go (27)
4.5 - Mastro's Steakhouse (2303)
4.0 - La Dolce Vita (136)
4.0 - Scarpetta (515)
4.0 - Fleming's Prime Steakhouse & Wine Bar (157)

I hate to admit it, but it appears that Beverly Hills has nicer restaurants than Chapel Hill. Also, based on the number of reviews, more popular ones.

Like many sociologists, I have a comma separated text file sitting around of big counties (population 50,000+) sorted by educational attainment. It is possible to read this in as you would a normal text file, but python's csv module is better at parsing the data. While the object it produces is list-like, it isn't iterable, but you can convert it with list:

In [83]:

big_counties = csv.reader(open('county_college.csv', 'rU'))
big_counties = list(big_counties)
print big_counties[0]


#remove header row
big_counties = big_counties[1:]

#How many counties?
print len(big_counties)

#first county in the list
print big_counties[0]

#Last county in the list
print big_counties[-1]

['fips', 'name', 'state', 'college_degree', 'population_2012', 'education']
973
['06037', 'Los Angeles County', 'CA', '29.2', '1.0e+07', 'Middle']
['05035', 'Crittenden County', 'AR', '13.5', '50021', 'Low']

Since this is a quick and dirty analysis, I tossed the top row, which has the column labels, and just made a mental note of the order of the columns. A more serious analysis would probably use pandas, which has a better ability to import csv files directly for use in statistical analysis. The last column has a grouping variable that I had previously used, where the top quarter of counties in terms of % college graduates are coded as "High", the bottom quarter coded as "Low" and everything else coded as "Middle". The counties were already sorted by population size, but that doesn't really matter.

To get all the restaurant data, I loop through each county and store the Yelp results in a new dictionary. To make sure it's plugging along, I have it print out some information along the way.

Things might go wrong half way through, and I don't want to look up counties twice as that is slow, rude to Yelp, and could lead me to hitting my API limit. One solution is to check to see if you have the results stored locally first, and only call the API when the information is missing. So I make sure to only call the Yelp API when I don't already have the county in in my 'county_restaurants' dictionary.

In [85]:

county_restaurants = {}

In [86]:

for county in big_counties:
    full_name = '%s, %s' % (county[1],county[2])
    if full_name not in county_restaurants:
        county_restaurants[full_name] = get_yelp_businesses(full_name)
        print county_restaurants[full_name]['total'], full_name

3 Los Angeles County, CA
40 Cook County, IL
40 Harris County, TX
7 Maricopa County, AZ
23 San Diego County, CA
40 Orange County, CA
40 Miami, FL
40 Kings County, NY
40 Dallas County, TX
40 Queens County, NY
1 Riverside County, CA
0 San Bernardino County, CA
40 King County, WA
40 Clark County, NV
...
4 Henry County, IL
1 Crittenden County, AR

This takes about 15 minutes to run through. Although it's running, it looks like we aren't getting all the data we wanted. In particular, some very large counties, like Los Angeles are showing up as having no or few restaurants. Presumably these areas has some nice dining options, but they likely only show up in Yelp when you select "Tucson" as the region, rather than "Pima County". Additionally, as noted in the guide for Developers, the API will only return the top 40 results. Each API call returns at most 20 businesses, and since the get_yelp_businesses function only returns one page, I don't have the full 40. I'm not sure that either of these bias the overall preliminary explanatory findings, so I think it's okay to press forward, although I would want to go back and fix some of the names. As a test of this, I already edited "Miami-Dade County" to "Miami" which you might have noticed in the county list.

Once all the county data is collected, a simple first pass at an analysis would be to count up how prominent different types of restaurants are in each type of county. This is slightly complicated by the fact that the number of counties differs between each of the education groups, the number of restaurants varies by county, and the number of categories varies by restaurant. One way to accurately summarize the data would be to look at the proportion of restaurants in a given county-type that are associated with each category. That is, what percentage of restaurants are described as pizza parlors or sushi bars?

First, I set up a counter to store how many restaurants are in each type of county:

In [122]:

restaurant_count = {'High'   : 0,
                    'Low'    : 0,
                    'Middle' : 0}

I filled in the keys and starting values to avoid errors the first time I encounter a county type. I didn't have to preset the value of zero. One other way would be to set up 'restaurant_count' as an empty dictionary, and try to increment the counter every time a restaurant is observed, and if there is an error set the value to 1.

In [128]:

restaurant_count = {}

try:
    # if it already has a value, increment the value by one
    restaurant_count['High'] +=1
except:
    # If this is the first time we are seeing it, start at one
    restaurant_count['High'] = 1

This is more flexible--it doesn't rely on the fixed categories of 'High', 'Low and 'Middle'.

Another Pythonic way of doing things is to use a defaultdict. This is similar to a standard dictionary, except that you can set a default value or object type for new keys.

In [2]:

from collections import defaultdict

restaurant_count = defaultdict(int)

Now 'restaurant_count' expects integers. By default, an unknown key is assumed to have a value of zero:

In [4]:

print restaurant_count['Super Fancy']

This way we don't have to worry about catching errors and have the flexibility if we want to shift to a different categorization scheme. In addition to numbers, default_dicts can be set up with other useful data types, such as list or dictionaries, or we can set up a couple of them inside another dictionary, which is what I wanted to do to keep track of the count of different restaurant categories by county type.

In []:

restaraunt_types = {'High'   : defaultdict(int),
                    'Low'    : defaultdict(int),
                    'Middle' : defaultdict(int)}

I fill up these counters by looping through the list of counties and looking up the list of restaurants from the 'county_restaurants' dictionary I created earlier. For each restaurant, I increment the restaurant counter for that county type. I also increment the appropriate counter for each category of food that they are listed as serving.

In [133]:

for county in big_counties:
    full_name = '%s, %s' % (county[1],county[2])
    ses = county[-1]
    restaurants = county_restaurants[full_name]['businesses']
    for restaurant in restaurants:
        #increment the restaurant counter
        restaurant_count[ses] += 1

        #Remove duplicate categories and the 'restaurant category'
        categories = restaurant.get('categories',[])
        categories = [c[0] for c in categories if c!= ['Restaurants','restaurants']]

        #increment the ses-category counter
        for category in categories:
            restaraunt_types[ses][category] = restaraunt_types[ses][category] + 1

In [137]:

print restaurant_count

defaultdict(<type 'int'>, {'High': 4174, 'Middle': 6984, 'Low': 2989})

Because of limitations in the API, I only gathered data on a maximum of 20 restaurants per county, so 'restaurant_count' probably understates the true difference in the number of variation in restaurants across each education level. That said, I have no strong reason to suspect that there is any bias in the types of restaurants listed.

Finally, to print out the results, I listed the top 25 restaurant categories for each county type using the values storied in 'restaurant_count' to compute the denominator.

In [236]:

for ses in restaurant_count:
    print '%s county SES (n=%s)' % (ses, restaurant_count[ses])

    #loop over each category, sorted by frequency
    for category in sorted(restaraunt_types[ses], key = restaraunt_types[ses].get, reverse=True )[:25]:

        #Compute the % of overall restaurants that are of a type
        frequency  = restaraunt_types[ses][category]
        percent = frequency/restaurant_count[ses] * 100

        #Print out the results, including a leading zero so the columns look pretty.
        print '%04.1f%%  %s' % (percent , category)

    #Blank line in between each category
    print

High county SES (n=4174)
12.8%  Pizza
10.5%  American (New)
09.2%  American (Traditional)
08.4%  Sandwiches
07.7%  Mexican
07.7%  Italian
05.3%  Delis
05.1%  Chinese
04.6%  Burgers
04.3%  Breakfast & Brunch
04.1%  Fast Food
03.9%  Seafood
03.7%  Barbeque
03.2%  Cafes
02.9%  Sushi Bars
02.9%  Japanese
02.7%  Steakhouses
02.6%  Bars
02.2%  Coffee & Tea
02.0%  Thai
01.9%  Bakeries
01.7%  Diners
01.7%  Mediterranean
01.4%  Pubs
01.4%  Asian Fusion

Middle county SES (n=6984)
13.0%  Pizza
11.3%  American (Traditional)
08.8%  Mexican
08.5%  American (New)
07.1%  Sandwiches
06.1%  Italian
05.8%  Chinese
05.7%  Fast Food
04.8%  Burgers
04.5%  Barbeque
03.9%  Breakfast & Brunch
03.4%  Seafood
03.3%  Delis
03.1%  Steakhouses
02.7%  Cafes
02.6%  Japanese
02.5%  Diners
02.3%  Bars
01.9%  Sushi Bars
01.8%  Chicken Wings
01.6%  Thai
01.4%  Coffee & Tea
01.2%  Sports Bars
01.1%  Ice Cream & Frozen Yogurt
01.1%  Buffets

Low county SES (n=2989)
12.0%  Pizza
11.2%  Mexican
10.7%  American (Traditional)
08.4%  Fast Food
07.0%  American (New)
05.8%  Burgers
05.4%  Chinese
05.4%  Sandwiches
05.0%  Barbeque
04.3%  Italian
03.3%  Steakhouses
02.7%  Breakfast & Brunch
02.6%  Cafes
02.6%  Seafood
02.0%  Delis
01.8%  Japanese
01.7%  Bars
01.7%  Buffets
01.6%  Diners
01.6%  Chicken Wings
01.3%  Southern
00.9%  Caterers
00.9%  Sushi Bars
00.9%  Hot Dogs
00.8%  Thai

Everybody likes pizza. Assuming the Yelp data is representative and this method has something approaching plausibility, about one in eight restaurants in the US is a pizza place, and this is relatively constant across different types of counties. After that, there is a fair amount of variation. I need to do more research on what constitutes 'American (New)' cuisine, but it is more popular in high SES counties than in low SES counties, which favor 'American (Traditional)'.

Even more stratified is 'Sushi Bars' which constitute roughly 3% of restaurants in high SES areas, 2% in middle, and 1% in low SES counties. In contrast, approximately 2% of restaurants in low SES counties are 'Buffets', while they aren't even in the top 25 for high SES counties. The exact percent is:

In [192]:

print restaraunt_types['High']['Buffets'] / restaurant_count['High'] * 100

0.622903689506

Fast food is stratified as well and found approximately twice as often in low SES counties as in high SES counties:

In [197]:

high_ff = restaraunt_types['High']['Fast Food'] / restaurant_count['High'] * 100
low_ff  = restaraunt_types['Low']['Fast Food'] / restaurant_count['Low'] * 100
print low_ff/high_ff

2.05793171453

This code could be generalized to loop over a couple of popular food categories:

In [219]:

category_list = ['Mexican', u'Chinese', 'American (New)', 'American (Traditional)',
                 'Breakfast & Brunch', 'Delis', 'Pizza', 'Sandwiches', 'Fast Food',
                 'Burgers', 'Italian', 'Barbeque']

ratio = {}
for category in category_list:
    high_ff = restaraunt_types['High'][category] / restaurant_count['High'] * 100
    low_ff  = restaraunt_types['Low'][category] / restaurant_count['Low'] * 100
    ratio[category] = high_ff/low_ff

for category in sorted(ratio, key=ratio.get):
    print '%0.2f  %s' % (ratio[category], category)

0.49  Fast Food
0.69  Mexican
0.75  Barbeque
0.81  Burgers
0.87  American (Traditional)
0.94  Chinese
1.07  Pizza
1.50  American (New)
1.56  Sandwiches
1.58  Breakfast & Brunch
1.77  Italian
2.69  Delis

So the low SES eating environment has an overrepresentation of barbeque, Mexican and fast food places, while delis, Italian food, and breakfast & brunch places are more commonly found in high SES counties.

If I were doing this for a peer-reviewed manuscript, I would probably use a sample of census tracts rather than counties in order to get a more realistic picture of local food environments. I would also want to spend some time understanding how restaurants get categorized and what each category means.

For publication, I would also use statistical models appropriate for count variables and treat education as continuous rather than binning it. Python has pretty impressive capabilities for statistical models, so I put together some quick code to model the count of the number of barbeque joints as a function of the proportion of the adult population with a college degree using the statsmodels negative binomial regression function.

In [220]:

#Module for doing statistical analysis
import statsmodels.api as sm

def county_category(businesses, category):
    #count how many businesses in a county are of a certain type
    category_count = 0
    for restaurant in businesses:
        categories = restaurant.get('categories',[])
        categories = [c[0] for c in categories]
        if category in categories:
            category_count += 1
    return category_count

#dependent variable
Y = [county_category( county_restaurants['%s, %s' % (c[1],c[2])]['businesses'],"Barbeque")
     for c in big_counties]

#Explantory variable
education_level = [float(c[-3]) for c in big_counties]

In [221]:

#negative binomial regression model
mod_nbin = sm.NegativeBinomial(Y, education_level)
res_nbin = mod_nbin.fit(disp=False)
print res_nbin.summary()

                     NegativeBinomial Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  973
Model:               NegativeBinomial   Df Residuals:                      972
Method:                           MLE   Df Model:                            0
Date:                Tue, 19 Nov 2013   Pseudo R-squ.:               -0.006523
Time:                        15:38:39   Log-Likelihood:                -1057.1
converged:                       True   LL-Null:                       -1050.2
                                        LLR p-value:                     1.000
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1            -0.0161      0.002     -9.268      0.000        -0.019    -0.013
alpha          0.4821      0.113      4.257      0.000         0.260     0.704
==============================================================================

Not surprisingly, college education (x1) is negatively correlated with the count of barbeque restaurants. This relationship could be spurious, however, as county population size is negatively associated with both high levels of education and high levels of barbeque.

In [222]:

population = [float(c[-2]) for c in big_counties]
population = [np.log(p) for p in population]

#combine
X = zip(education_level, population)
print X[:5]
print ''
#rerun regression model
mod_nbin = sm.NegativeBinomial(Y, X)
res_nbin = mod_nbin.fit(disp=False)
print res_nbin.summary()

[(29.2, 16.11809565095832), (33.7, 15.464169183551656), (27.9, 15.274125580663791), (29.1, 15.176487111099874), (34.2, 14.978661367769956)]

                     NegativeBinomial Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  973
Model:               NegativeBinomial   Df Residuals:                      971
Method:                           MLE   Df Model:                            1
Date:                Tue, 19 Nov 2013   Pseudo R-squ.:              -0.0007531
Time:                        15:40:19   Log-Likelihood:                -1051.0
converged:                       True   LL-Null:                       -1050.2
                                        LLR p-value:                     1.000
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.0011      0.005      0.218      0.827        -0.009     0.011
x2            -0.0404      0.012     -3.501      0.000        -0.063    -0.018
alpha          0.4429      0.110      4.020      0.000         0.227     0.659
==============================================================================

Once population (x2) is in the model, education level (x1) is no longer significant. So the number of barbeque restaurants in town is largely a function of county size, but, because of the relationship between county size and education level, they are disproportionately found in low SES counties.

As before, we can revise the code to loop over more types of restaurants:

In [234]:

types = set(restaraunt_types['High'])

coefficients= {}

for type in types:
    Y = [county_category( county_restaurants['%s, %s' % (c[1],c[2])]['businesses'],type) for c in big_counties]
    if np.mean(Y) > .05:
        mod_nbin = sm.NegativeBinomial(Y, X)
        res_nbin = mod_nbin.fit(disp=False)
        if np.absolute(res_nbin.tvalues[0]) > 1.96:
            coefficients[type] = res_nbin.params[0]


for category in sorted(coefficients, key=coefficients.get, reverse=True):
    print '%.02f %s' % (coefficients[category], category)

0.11 Food Trucks
0.08 Vietnamese
0.08 French
0.08 Food Stands
0.07 Wine Bars
0.07 Vegetarian
0.07 Middle Eastern
0.06 Mediterranean
0.06 Grocery
0.06 Indian
0.06 Greek
0.06 Latin American
0.06 Desserts
0.05 Asian Fusion
0.05 Bakeries
0.05 Sushi Bars
0.05 Lounges
0.04 Thai
0.04 Delis
0.04 Coffee & Tea
0.04 Pubs
0.03 Beer, Wine & Spirits
0.03 Japanese
0.03 Hot Dogs
0.03 Salad
0.03 Sports Bars
0.02 American (New)
0.02 Breakfast & Brunch
0.02 Cafes
0.02 Italian
0.02 Sandwiches
0.02 Seafood
0.02 Caterers
0.02 Bars
0.02 Ice Cream & Frozen Yogurt
0.01 Pizza
-0.02 Buffets

This time, I only printed out the coefficient for SES when there was a statistically significant relationship. Controlling for population size, food trucks are the most distinctively high SES food category. This specific relationship may have something to do with the operationalization of SES, as college towns, which often have high education levels but not the highest income levels, are rife with food trucks.

Only one food category was negatively associated with education level, after controlling for population size. I think this is because the outcome variable wasn't about the relative proportion in a county, but the absolute count, and high SES areas are more likely to have restaurants of all types (except buffets).