A Python package for tokenising Japanese-language tweets

Last time I wrote about preparing Japanese-language Twitter data for machine learning purposes, I said I’d likely come back to the topic and discuss different approaches to using this data to find out useful or interesting things.

I will. Eventually. I’m working on a number of interesting projects at Waseda at the moment, using several different machine learning approaches (some supervised, some unsupervised) to cut through the noise of Twitter data and find out (hopefully) relevant political or sociological things. I’ll write up some practical guides to those approaches as soon as I have time.

In the meanwhile, though, I’ve battle-tested a lot of the techniques outlined in the previous blog post, fixed some bugs and updated a handful of things to be faster or more effective. I think I’ve now got a pretty solid system in place for processing Japanese-language tweets and splitting them down into handy bags-of-words – without losing semantically important features like emoji or kaomoji along the way.

Rather than update the last blog post, I’ve bundled all of the various functions into a Python object that you can use in your own code, and uploaded it to Github. You can find the code here. It’s not the world’s most professional Python code and for now at least, you’ll need to just download the file and drop it into your project – I haven’t got around to turning it into a “real” Python package yet – but it will hopefully be useful to some people working on this kind of project nonetheless.

Comments are closed.