Neal Caren - University of North Carolina, Chapel Hill web twitter scholar
Sociologists are often interested in how lived experiences vary by social class. This includes topics like variation in the chances of going to prison and life expectancies as well as the kinds of music people like or what type of sports their kids do.
For one project I'm working on, a team of sociologists and pediatricians are looking at variation in the local food environment as a possible explanation for the link between socioeconomic status and youth obesity. In the relatively poor county we studied, images of fast food, especially the "chicken 'n biscuits" provider Bojangles, dominated the roadside.
As a first look at a broader range of food environments, I decided to use the Yelp API to collect data on the mix of restaurants across the country. In this case, I'm not using any of the user provided review data that Yelp is known for, but rather using Yelp as a categorized directory of eating establishments. Since Yelp provides a wide variety of restaurant categories, it is possible to get more information than just the presence of fast food chains, but also look at how prominent buffets, sushi bars, or pizzerias are in different parts of the country.
Most of the APIs of interest to social scientists weren't designed for our use. They are primarily for web or mobile app developers who want to include the content on their pages. So while I might use the MapQuest API to look at how often intra-city trips involve highways, the target audience is business owners trying to help people get to their store. Similarly, scores of researchers have used data from Twitter APIs to study politics, but it was developed so that you could put a custom Twitter widget on your home page.
The good news is that since these services want you to use their data, the APIs are often well documented, especially for languages like Python that are popular in Silicon Valley. The bad news is that APIs don't always make available the parts of the service, like historical data, that are of most interest to researchers. The worst news is research using APIs frequently violates the providers Terms of Service, so it can be an ethical grey zone.
When you sign up as a developer to use an API, you usually agree to only use the API to facilitate other people using the service (e.g. customer's finding their way to your store) and that you won't store the data. API providers usually enforce this through rate limiting, meaning you can only access the service so many times per minute or per day. For example, you can only search status updates 180 times every 15 minutes according to Twitter guidelines. Yelp limits you to 10,000 calls per day. If you go over your limit, you won't be able to access the service for a bit. You will also get in trouble if you redistribute the data, so don't plan on doing that.
I began setting up my program by importing seven libraries. Four of these are part of any standard Python installation. Two others (numpy and requests) are used frequently and included in most Python bundles, like the free anacoda bundle.
The one package that you might need to install is oauth2
. The Yelp API requires that requests be authenticated using the OAuth protocol.
from __future__ import division
import json
import csv
from collections import Counter
import numpy as np
import requests
import oauth2
Two of the major reasons that web services require API authentication is so that they know who you are and so they can make sure that you don't go over their rate limits. Since you shouldn't be giving your password to random people on the internet, API authentication works a little bit differently. Like many other places, in order to use the Yelp API you have to sign up as developer. After telling them a little bit about what you plan to do--feel free to be honest; they aren't going to deny you access if you put "research on food cultures" as the purpose--you will get a Consumer Key, Consumer Secret, Token, and Token Secret. Copy and paste them somewhere special.
Using the Yelp API goes something like this. First, you tell Yelp who you are and what you want. Assuming you are authorized to have this information, they respond with a URL where you can retrieve the data. The coding for this in practice is a little bit complicated, so there are often single use tools for accessing APIs, like Tweepy for Twitter.
There's no module to install for the Yelp API, but Yelp does provide some sample Python code. I've slightly modified the code below to show a sample search for restaurants near Chapel Hill, NC, sorted by distance. You can find more options in the search documentation. The API's search options include things like location and type of business, and allows you to sort either by distance or popularity.
consumer_key = 'qDBPo9c_szHVrZwxzo-zDw'
consumer_secret = '4we8Jz9rq5J3j15Z5yCUqmgDJjM'
token = 'jeRrhRey_k-emvC_VFLGrlVHrkR4P3UF'
token_secret = 'n-7xHNCxxedmAMYZPQtnh1hd7lI'
consumer = oauth2.Consumer(consumer_key, consumer_secret)
category_filter = 'restaurants'
location = 'Chapel Hill, NC'
options = 'category_filter=%s&location=%s&sort=1' % (category_filter, location)
url = 'http://api.yelp.com/v2/search?' + options
oauth_request = oauth2.Request('GET', url, {})
oauth_request.update({'oauth_nonce' : oauth2.generate_nonce(),
'oauth_timestamp' : oauth2.generate_timestamp(),
'oauth_token' : token,
'oauth_consumer_key': consumer_key})
token = oauth2.Token(token, token_secret)
oauth_request.sign_request(oauth2.SignatureMethod_HMAC_SHA1(), consumer, token)
signed_url = oauth_request.to_url()
print signed_url
The URL returned expires after a couple of seconds, so don't expect for the above link to work. The results are provided in the JSON file format, so I'm going to use the already imported requests
module to download them.
resp = requests.get(url=signed_url)
chapel_hill_restaurants = resp.json()
chapel_hill_restaurants
now stores a JSON object that can be treated quite similarly to a dictionary. Two of the entries are of interest, 'total' and 'businesses'. Total shows the maximum number of results that Yelp could return for that specific search. At most, Yelp will return 40 results, presumably to stop people from scraping their entire database.
print chapel_hill_restaurants['total']
The first page of results only returns a maximum of 20 results, so I would expect that is how items are in 'businesses'. To check what results were returned, I'll print out the rating and name of each of the restaurants, followed by the number of reviews.
for business in chapel_hill_restaurants['businesses']:
print '%s - %s (%s)' % (business['rating'], business['name'], business['review_count'])
Our graduate students recently sponsored drinks at TRU, one of the highest rated places. It was good, but I suspect as the number of reviews gets higher, their rating will drop a bit.
I found out that 'ratings' held the Yelp rating by looking at the documentation, but the information can also be found by looking at the keys of a specific entry. In this case, I'll take advantage of the fact that information about the last restaurant is still stored in business
.
business.keys()
For this project, I'm only interested in the kind of food that restaurants serve, which is stored in 'categories'
business['categories']
Somewhat inconveniently, this is a list of lists, but since I only really care about the first item, I can extract that with some list comprehension:
categories = [category[0] for category in business['categories']]
categories
In case you were wondering, the 'u'' in front of the restaurant categories, and all the other strings, means that the text is stored as unicode, which contains more characters than the vanilla ASCII set. Storing text as unicode can be quite useful, but can causes some difficulties when functions expect ASCII, like when you use the csv writer to output an à without properly encoding the document.
Since I want to gather restaurant information for lots of different areas, I put a slightly modified version of the code above in a new function that I call get_yelp_businesses
.
def get_yelp_businesses(location, category_filter = 'restaurants'):
# from https://github.com/Yelp/yelp-api/tree/master/v2/python
consumer_key = 'qDBPo9c_szHVrZwxzo-zDw'
consumer_secret = '4we8Jz9rq5J3j15Z5yCUqmgDJjM'
token = 'jeRrhRey_k-emvC_VFLGrlVHrkR4P3UF'
token_secret = 'n-7xHNCxxedmAMYZPQtnh1hd7lI'
consumer = oauth2.Consumer(consumer_key, consumer_secret)
url = 'http://api.yelp.com/v2/search?category_filter=%s&location=%s&sort=1' % (category_filter,location)
oauth_request = oauth2.Request('GET', url, {})
oauth_request.update({'oauth_nonce': oauth2.generate_nonce(),
'oauth_timestamp': oauth2.generate_timestamp(),
'oauth_token': token,
'oauth_consumer_key': consumer_key})
token = oauth2.Token(token, token_secret)
oauth_request.sign_request(oauth2.SignatureMethod_HMAC_SHA1(), consumer, token)
signed_url = oauth_request.to_url()
resp = requests.get(url=signed_url)
return resp.json()
The function expects that the first thing you input will be a location. Taking advantage of both oath2
's ability to clean up the text so that it is functional when put in a URL (e.g., escape spaces) and Yelp's savvy ability to parse locations, the value for location can be fairly wide (e.g., "Chapel Hill" or "90210"). You can also add a category of business to search for from the list of acceptable values. If you don't provide a value, category_filter = 'restaurants'
provides a default value of 'restaurants'. This function returns the JSON formatted results. Note that this doesn't have any mechanism for handling errors, which will need to happen elsewhere.
beverly_hills_restaurants = get_yelp_businesses('90210')
for business in beverly_hills_restaurants['businesses']:
print '%s - %s (%s)' % (business['rating'], business['name'], business['review_count'])
I hate to admit it, but it appears that Beverly Hills has nicer restaurants than Chapel Hill. Also, based on the number of reviews, more popular ones.
Like many sociologists, I have a comma separated text file sitting around of big counties (population 50,000+) sorted by educational attainment. It is possible to read this in as you would a normal text file, but python's csv
module is better at parsing the data. While the object it produces is list-like, it isn't iterable, but you can convert it with list
:
big_counties = csv.reader(open('county_college.csv', 'rU'))
big_counties = list(big_counties)
print big_counties[0]
#remove header row
big_counties = big_counties[1:]
#How many counties?
print len(big_counties)
#first county in the list
print big_counties[0]
#Last county in the list
print big_counties[-1]
Since this is a quick and dirty analysis, I tossed the top row, which has the column labels, and just made a mental note of the order of the columns. A more serious analysis would probably use pandas, which has a better ability to import csv files directly for use in statistical analysis. The last column has a grouping variable that I had previously used, where the top quarter of counties in terms of % college graduates are coded as "High", the bottom quarter coded as "Low" and everything else coded as "Middle". The counties were already sorted by population size, but that doesn't really matter.
To get all the restaurant data, I loop through each county and store the Yelp results in a new dictionary. To make sure it's plugging along, I have it print out some information along the way.
Things might go wrong half way through, and I don't want to look up counties twice as that is slow, rude to Yelp, and could lead me to hitting my API limit. One solution is to check to see if you have the results stored locally first, and only call the API when the information is missing. So I make sure to only call the Yelp API when I don't already have the county in in my 'county_restaurants' dictionary.
county_restaurants = {}
for county in big_counties:
full_name = '%s, %s' % (county[1],county[2])
if full_name not in county_restaurants:
county_restaurants[full_name] = get_yelp_businesses(full_name)
print county_restaurants[full_name]['total'], full_name
This takes about 15 minutes to run through. Although it's running, it looks like we aren't getting all the data we wanted. In particular, some very large counties, like Los Angeles are showing up as having no or few restaurants. Presumably these areas has some nice dining options, but they likely only show up in Yelp when you select "Tucson" as the region, rather than "Pima County". Additionally, as noted in the guide for Developers, the API will only return the top 40 results. Each API call returns at most 20 businesses, and since the get_yelp_businesses
function only returns one page, I don't have the full 40. I'm not sure that either of these bias the overall preliminary explanatory findings, so I think it's okay to press forward, although I would want to go back and fix some of the names. As a test of this, I already edited "Miami-Dade County" to "Miami" which you might have noticed in the county list.
Once all the county data is collected, a simple first pass at an analysis would be to count up how prominent different types of restaurants are in each type of county. This is slightly complicated by the fact that the number of counties differs between each of the education groups, the number of restaurants varies by county, and the number of categories varies by restaurant. One way to accurately summarize the data would be to look at the proportion of restaurants in a given county-type that are associated with each category. That is, what percentage of restaurants are described as pizza parlors or sushi bars?
First, I set up a counter to store how many restaurants are in each type of county:
restaurant_count = {'High' : 0,
'Low' : 0,
'Middle' : 0}
I filled in the keys and starting values to avoid errors the first time I encounter a county type. I didn't have to preset the value of zero. One other way would be to set up 'restaurant_count' as an empty dictionary, and try
to increment the counter every time a restaurant is observed, and if there is an error set the value to 1.
restaurant_count = {}
try:
# if it already has a value, increment the value by one
restaurant_count['High'] +=1
except:
# If this is the first time we are seeing it, start at one
restaurant_count['High'] = 1
This is more flexible--it doesn't rely on the fixed categories of 'High', 'Low and 'Middle'.
Another Pythonic way of doing things is to use a defaultdict
. This is similar to a standard dictionary, except that you can set a default value or object type for new keys.
from collections import defaultdict
restaurant_count = defaultdict(int)
Now 'restaurant_count' expects integers. By default, an unknown key is assumed to have a value of zero:
print restaurant_count['Super Fancy']
This way we don't have to worry about catching errors and have the flexibility if we want to shift to a different categorization scheme. In addition to numbers, default_dict
s can be set up with other useful data types, such as list or dictionaries, or we can set up a couple of them inside another dictionary, which is what I wanted to do to keep track of the count of different restaurant categories by county type.
restaraunt_types = {'High' : defaultdict(int),
'Low' : defaultdict(int),
'Middle' : defaultdict(int)}
I fill up these counters by looping through the list of counties and looking up the list of restaurants from the 'county_restaurants' dictionary I created earlier. For each restaurant, I increment the restaurant counter for that county type. I also increment the appropriate counter for each category of food that they are listed as serving.
for county in big_counties:
full_name = '%s, %s' % (county[1],county[2])
ses = county[-1]
restaurants = county_restaurants[full_name]['businesses']
for restaurant in restaurants:
#increment the restaurant counter
restaurant_count[ses] += 1
#Remove duplicate categories and the 'restaurant category'
categories = restaurant.get('categories',[])
categories = [c[0] for c in categories if c!= ['Restaurants','restaurants']]
#increment the ses-category counter
for category in categories:
restaraunt_types[ses][category] = restaraunt_types[ses][category] + 1
print restaurant_count
Because of limitations in the API, I only gathered data on a maximum of 20 restaurants per county, so 'restaurant_count' probably understates the true difference in the number of variation in restaurants across each education level. That said, I have no strong reason to suspect that there is any bias in the types of restaurants listed.
Finally, to print out the results, I listed the top 25 restaurant categories for each county type using the values storied in 'restaurant_count' to compute the denominator.
for ses in restaurant_count:
print '%s county SES (n=%s)' % (ses, restaurant_count[ses])
#loop over each category, sorted by frequency
for category in sorted(restaraunt_types[ses], key = restaraunt_types[ses].get, reverse=True )[:25]:
#Compute the % of overall restaurants that are of a type
frequency = restaraunt_types[ses][category]
percent = frequency/restaurant_count[ses] * 100
#Print out the results, including a leading zero so the columns look pretty.
print '%04.1f%% %s' % (percent , category)
#Blank line in between each category
print
Everybody likes pizza. Assuming the Yelp data is representative and this method has something approaching plausibility, about one in eight restaurants in the US is a pizza place, and this is relatively constant across different types of counties. After that, there is a fair amount of variation. I need to do more research on what constitutes 'American (New)' cuisine, but it is more popular in high SES counties than in low SES counties, which favor 'American (Traditional)'.
Even more stratified is 'Sushi Bars' which constitute roughly 3% of restaurants in high SES areas, 2% in middle, and 1% in low SES counties. In contrast, approximately 2% of restaurants in low SES counties are 'Buffets', while they aren't even in the top 25 for high SES counties. The exact percent is:
print restaraunt_types['High']['Buffets'] / restaurant_count['High'] * 100
Fast food is stratified as well and found approximately twice as often in low SES counties as in high SES counties:
high_ff = restaraunt_types['High']['Fast Food'] / restaurant_count['High'] * 100
low_ff = restaraunt_types['Low']['Fast Food'] / restaurant_count['Low'] * 100
print low_ff/high_ff
This code could be generalized to loop over a couple of popular food categories:
category_list = ['Mexican', u'Chinese', 'American (New)', 'American (Traditional)',
'Breakfast & Brunch', 'Delis', 'Pizza', 'Sandwiches', 'Fast Food',
'Burgers', 'Italian', 'Barbeque']
ratio = {}
for category in category_list:
high_ff = restaraunt_types['High'][category] / restaurant_count['High'] * 100
low_ff = restaraunt_types['Low'][category] / restaurant_count['Low'] * 100
ratio[category] = high_ff/low_ff
for category in sorted(ratio, key=ratio.get):
print '%0.2f %s' % (ratio[category], category)
So the low SES eating environment has an overrepresentation of barbeque, Mexican and fast food places, while delis, Italian food, and breakfast & brunch places are more commonly found in high SES counties.
If I were doing this for a peer-reviewed manuscript, I would probably use a sample of census tracts rather than counties in order to get a more realistic picture of local food environments. I would also want to spend some time understanding how restaurants get categorized and what each category means.
For publication, I would also use statistical models appropriate for count variables and treat education as continuous rather than binning it. Python has pretty impressive capabilities for statistical models, so I put together some quick code to model the count of the number of barbeque joints as a function of the proportion of the adult population with a college degree using the statsmodels
negative binomial regression function.
#Module for doing statistical analysis
import statsmodels.api as sm
def county_category(businesses, category):
#count how many businesses in a county are of a certain type
category_count = 0
for restaurant in businesses:
categories = restaurant.get('categories',[])
categories = [c[0] for c in categories]
if category in categories:
category_count += 1
return category_count
#dependent variable
Y = [county_category( county_restaurants['%s, %s' % (c[1],c[2])]['businesses'],"Barbeque")
for c in big_counties]
#Explantory variable
education_level = [float(c[-3]) for c in big_counties]
#negative binomial regression model
mod_nbin = sm.NegativeBinomial(Y, education_level)
res_nbin = mod_nbin.fit(disp=False)
print res_nbin.summary()
Not surprisingly, college education (x1) is negatively correlated with the count of barbeque restaurants. This relationship could be spurious, however, as county population size is negatively associated with both high levels of education and high levels of barbeque.
population = [float(c[-2]) for c in big_counties]
population = [np.log(p) for p in population]
#combine
X = zip(education_level, population)
print X[:5]
print ''
#rerun regression model
mod_nbin = sm.NegativeBinomial(Y, X)
res_nbin = mod_nbin.fit(disp=False)
print res_nbin.summary()
Once population (x2) is in the model, education level (x1) is no longer significant. So the number of barbeque restaurants in town is largely a function of county size, but, because of the relationship between county size and education level, they are disproportionately found in low SES counties.
As before, we can revise the code to loop over more types of restaurants:
types = set(restaraunt_types['High'])
coefficients= {}
for type in types:
Y = [county_category( county_restaurants['%s, %s' % (c[1],c[2])]['businesses'],type) for c in big_counties]
if np.mean(Y) > .05:
mod_nbin = sm.NegativeBinomial(Y, X)
res_nbin = mod_nbin.fit(disp=False)
if np.absolute(res_nbin.tvalues[0]) > 1.96:
coefficients[type] = res_nbin.params[0]
for category in sorted(coefficients, key=coefficients.get, reverse=True):
print '%.02f %s' % (coefficients[category], category)
This time, I only printed out the coefficient for SES when there was a statistically significant relationship. Controlling for population size, food trucks are the most distinctively high SES food category. This specific relationship may have something to do with the operationalization of SES, as college towns, which often have high education levels but not the highest income levels, are rife with food trucks.
Only one food category was negatively associated with education level, after controlling for population size. I think this is because the outcome variable wasn't about the relative proportion in a county, but the absolute count, and high SES areas are more likely to have restaurants of all types (except buffets).