Neal Caren - University of North Carolina, Chapel Hill mail web twitter scholar
I’ve compiled a list of Python tutorials and annotated analyses. I've tried to list pages that are accessible to social scientists with little background in Python and/or machine learning.
If you are totally new to Python, I would recommend installing Continuum's Anacoda Python distribution. It works on Macs and Windows, makes using IPython notebooks trivial, and solves most of the problems associated with installing various packages.
If you know of anything I've left out or if links go dead, please let me know.
Walkthroughs
One of the great things about IPython notebooks is that they can easily blend text and code. This has led to a sharp increase in the number of data analysis projects where people carefully explain an entire research project, including data collection/importation, management and analysis. The code is right there, and you can usually run it and/or modify yourself. Looking at a few of these is an excellent introduction to what people are currently doing, even if you don't understand everything.
Diving into Open Data with IPython Notebook & Pandas by Julia Evans. An analysis of whether people bike when it rains using Pandas.
The Need for Openness in Data Journalism by Brian Keegan Reanalysis of a 538 posts on the Bechdel Test and films using Pandas and statsmodels. When I ran this one, I had an issue with the BeautifulSoup part.u
Predicting customer churn with scikit-learn by Eric Chiang. I don't care about customer churn, but it's a well-written walkthrough of machine learning classification.
538 Model by Skipper Seabold. Recreation of the classic 538 prediction model using Pandas.
Heat and Violence in Chicago by Brian Keegan. Walkthrough of an impressive analysis of crime trends.
Powerpoetry Analysis by SumAll Foundation. Analysis of how individual poetry styles change over time using pandas.
World Cup Learning by Juan Pedro Fisanotti. Predict winners of World Cup soccer matches using the PyBrain library for machine learning. Data is also on Github.
Is Seattle Really Seeing an Uptick In Cycling? by Jake Vanderplas. Yes. An excellent time-series analysis.
Overviews
Introductions to using Python for data analysis that make sense to social scientists.
Using APIs
When a service wants you to use their data, they often provide it through an API. There are often specific Python libraries for accessing popularing, complex and/or APIs requiring authentication. Otherwises, requests is quite useful.
Web Scraping
When they don't want to give you the data, you can sometimes grab it anyway by visiting one or more web pages and then extracting the parts you need. requests is a useful library for accessing web pages, and BeautifulSoup is a popular choice for pulling out the good stuff. If you don't know any HTML, regular expressions can sometimes work well too.
Data Management
Going for raw data--numbers of words--to Xs that can be included in a regression equation is about 80% of the work. There's a lot of data management in the walkthroughs, but I've found a couple of others that show the process quite clearly. Pandas is popular and super useful, especially the data frames.
Text Management
Playing with words.
Introduction to data analysis
Introductions and/or overviews of data analysis, usually using scikit-learn.
Classification
When the outcome variable is categorical. Social scientists usually start and stop with variations on logistic regression. Turns out, there's a lot of other things out there.
Unsupervised Learning
When you don't have an outcome variable and/or want to combine your explanatory variables. Sociologists usually learn about factor analysis and then never use it. For text data, topic modeling is what all the cools are doing.
Regression
While continuous outcomes are common in the social sciences, machine learning folks rarely talk about them.
Model/Feature Selection
Picking which model or variables to use often happens offstage in social science research. It doesn't have to be that way, though.
Networks
NetworkX and igraph are both fairly powerful tools for network analysis. I don't think you can use them for regression analysis, but you can use them to do things like compute centrality measures and make pretty pictures. You can also use Python to create/manipulate your network data for analysis/display elsewhere.
Plotting
matplotlib is the default plotting library for data scientists and plays well with pandas. seaborn makes it prettier. Other programs, like mpld3, Plotly, or bokeh are also worth trying out, especially for putting stuff together on the web.
Images as Data
Social scientists don't really analyze images much, but that might be the next big thing.