(Data Science Week @ Waseda 2019)


Slides from the Workshop


Example Code and Data

  • sentiment_workshop.zip – this file contains the example code and data used in the workshop. Unzip the file into a directory you can easily find, then open the code using the “Jupyter Lab” component in Anaconda.

Packages used in the workshop

Most of these packages are already installed in the standard Anaconda distribution, but if you install your Python distribution from somewhere else (or if you’re using it on a server), you may need to install these directly using the pip install command.

  • numpy
  • pandas
  • scipy
  • scikit-learn
  • nltk
  • matplotlib (only required for drawing graphs and charts)
  • quadprog (only required for using the iSA aggregate algorithm)

You can find the Python version of the iSA aggregate algorithm on my Github page.


Word Segmentation in non-European languages

For those working in non-European languages, you’ll need to use additional software to divide up sentences into words (tokens) and perform functions like stemming or identifying part-of-speech (nouns, verbs etc.).

  • Japanese: MeCab (install on your computer, then use the mecab-python package to access it from Python), ChaSen/CaboCha or Janome. If you’re using data from Twitter, you may find this short script I wrote useful – it’ll correctly identify things like web addresses, usernames, emoji and kaomoji, which MeCab would otherwise skip or make mistakes with.
  • Chinese (or Arabic): the Stanford Word Segmenter
  • Korean: open-korean-text was recommended by some colleagues, though I haven’t used it.
  • Thai: There is a package called PyThaiNLP which looks like a pretty comprehensive language processing system for Thai text. Another package called cutkum looks like a promising project for carrying out accurate word segmentation.