Using Social Media Data in Political Science Research (2016)

Workshop @ Waseda University, 2016/12/07

This page includes a listing of all of the software and packages mentioned during the workshop, as well as a few others you may find useful. I’ll link to detailed guides I’ve written about this topic, and update the lists as I write more in the coming months.

Core Software

Python – the programming language used for all of the examples and packages in this workshop.
I also recommend the PyCharm interface for writing and organising your Python code. There’s a free educational license that you can sign up for.

As for databases, I suggest using MongoDB for storing social media data – but certain types of project may benefit from a more structured SQL database like MySQL or PostgreSQL. You may also simply have experience with SQL and prefer to use it.

If your project is very large, Google BigQuery (part of Google’s Cloud Platform) is worth looking into, but its costs can scale up very quickly. (Edit: Just after the workshop, Amazon announced a new cloud service called Athena, which is very similar to BigQuery; I haven’t tried using it yet, but it looks like a viable alternative.)

Python Packages

These are add-on packages for Python which make your life much easier when accessing, storing, handling and processing social media data.

Twython is a simple, easy to use interface for the Twitter API; pymongo is a similarly simple interface for MongoDB. These two are all you need to start downloading and storing social media data from Twitter.

numpy and pandas are a matched pair of packages which give Python similar statistical and scientific computing abilities to R. They’re required for a number of the other, more advanced packages on this list.

matplotlib is a very popular package for creating graphs of your data. I also like Bokeh, a more modern graphing package which creates interactive graphs you can manipulate in your web browser.

SciKit-Learn is a machine learning package for Python which includes good text analysis functionality, as well as implementing a large number of classifying and clustering algorithms.

NLTK (the Natural Language ToolKit) is a package of tools specifically aimed at text analysis. It’s better at certain things than SciKit-Learn; for example, it can tokenise text in a more advanced way that allows for “stemming” European languages – i.e. using the base form of each word to avoid duplication in the corpus. Paired up with SciKit-Learn it makes a formidable machine learning system.