This is a more technical post than I usually write, but it will be useful to some people. My political science research involves some natural language processing and machine learning, which I use to analyse texts from Japanese newspapers and social media – so one of the challenges is teaching a computer to “read” Japanese. Luckily, there are some tools out there which make this (relatively) straightforward.
For this guide, I’m using Python 3.5. My development system runs on macOS 10.12 and I deploy my code to a server running Ubuntu 12.16, so this guide will include commands for setting up the software on both macOS and Linux/Ubuntu. I don’t use any version of Windows so I don’t know how to set this up on Windows – if anyone can provide a set of Windows commands that achieve the same goal, drop me an email and I’ll add them to the blog post.
What are we trying to accomplish?
The first hurdle to doing any analysis of Japanese text is segmentation, or tokenisation; breaking it down into usable chunks. For European languages, words have spaces between them, so you can just divide everything up at the word boundaries (the spaces, commas, full stops and so on), which yields an array of words that we can use to calculate things like frequencies or co-occurrence matrices. Japanese, however, has no spaces in its text, so there’s an extra pre-processing step required before we can start using these text analysis approaches.
In essence, we want to turn a string like this…
… into an array like this…
["今日", "は", "いい", "天気", "です", "ね", "遊び", "に", "行か", "ない", "新宿", "で", "祭り", "が", "ある"]
… which a computer can process to figure out frequencies, co-occurrences and so on.
A Software Shopping List
That’s the objective. Here’s a quick list of what we’re going to use to get there. (The URLs are for completeness only; we’ll be downloading everything from the command line.)
This is not a definitive list of every piece of software that can segment Japanese text. There’s a big range out there, from the lightweight tinysegmenter through to fully featured software like kuromoji. I’m using this setup for two reasons. First, in my experience it’s the best at handling text data sourced from the Internet. Second, I think it’s the best trade-off of simplicity versus accuracy. tinysegmenter is much easier to set up and use, but its output is unreliable; it often breaks apart words that are actually common phrases or proper names (the name of the present Japanese prime minister, 阿部晋三, is rendered by tinysegmenter as [“阿部”, “晋”, “三”], not the correct [“阿部”, “晋三”] or [“安倍晋三”]; the same problem occurs with the word for prime minister itself, 総理大臣, which comes out as [“総理”, “大臣”] when what you (probably) want is [“総理大臣”]). Mecab works nicely with Python, and it’s easy to set it up with an extensive dictionary of common phrases and neologisms so its output is very accurate.
One other thing; if you’re following this tutorial on macOS, I expect you to have a little bit of familiarity with the Terminal. If you don’t know your way around Terminal at all, your homework is to go and install the “Homebrew” package; we’ll be using it to install a lot of the rest of the software. There are detailed instructions on the homepage and once you’ve done that, you’ll be able to get cracking with installing MeCab and its dictionaries.
First, let’s get MeCab (the core segmentation software) and MeCab-ipadic up and running. (Again, for macOS users, this assumes that you have successfully installed the Homebrew package.)
brew install mecab brew install mecab-ipadic
sudo apt-get install mecab mecab-ipadic libmecab-dev sudo apt-get install mecab-ipadic-utf8
That’ll take a while, but once it’s done, MeCab should be up and running. You can test it by typing “mecab” at the terminal; in the blank line it gives you afterwards, type some Japanese text, and press enter. The result should look like this:
[email protected]:~$ mecab 無事にインストール出来ました ！ 無事 名詞,形容動詞語幹,*,*,*,*,無事,ブジ,ブジ に 助詞,副詞化,*,*,*,*,に,ニ,ニ インストール 名詞,一般,*,*,*,*,インストール,インストール,インストール 出来 動詞,自立,*,*,一段,連用形,出来る,デキ,デキ まし 助動詞,*,*,*,特殊・マス,連用形,ます,マシ,マシ た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ ！ 記号,一般,*,*,*,*,！,！,！ EOS
As you can see, it’s working nicely and correctly identifying parts of speech in this test sentence. Press Ctrl-C to get back to the terminal command line, and let’s continue.
The next step is installing mecab-ipadic-neologd working. This is a bit more complex, since we need to download the dictionary of neologisms and slang (vital for handling text from the Internet) and then recompile it for MeCab. First, we install the tools used to download (“clone”) the most recent version of the dictionary, then we compile and install the dictionary itself.
brew install git curl xz git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git cd mecab-ipadic-neologd ./bin/install-mecab-ipadic-neologd -n
sudo apt-get install git curl git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git cd mecab-ipadic-neologd sudo ./bin/install-mecab-ipadic-neologd -n
You’ll need to type “yes” at some point in the install process to confirm that you’re okay with overwriting some dictionary defaults. Now let’s check that it worked. To use this dictionary at the command line, we need to specify it when we invoke MeCab:
mecab -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/
mecab -d /usr/lib/mecab/dic/mecab-ipadic-neologd/
(Note the different paths – if you’re on Ubuntu you’ll also need to change the path given in the Python files below to match.)
If you can type some Japanese text and have it tokenised (as in the previous example), then everything is working.
Working with Python
The next step is to get MeCab talking to Python. Again, this tutorial assumes you’re using Python 3 (I’m on 3.5, but any recent version of Python should be fine); there is also a library for Python 2, but I haven’t used or installed it. If you’re an advanced Python user and are using a virtual environment setup for your packages, you should switch to that environment now. (All commands from here onwards should work the same on both macOS and Ubuntu.)
At the terminal, type:
pip3 install mecab-python3
(The command may be “pip“, not “pip3“, on your system, but a lot of systems use Python 2 internally for various things, and therefore rename the Python 3 tools with a “3” suffix to avoid clashes. Also, depending on your setup, you may need to “sudo” that command on Ubuntu.)
Now open your Python editor, whether that’s an IDE (I use PyCharm personally) or just a text window, and let’s see if this is working. Try the following code:
import MeCab test = "今日はいい天気ですね。遊びに行かない？新宿で祭りがある！" mt = MeCab.Tagger("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd") print(mt.parse(test))
When you run this code, you should see the same output you got at the Terminal earlier. Okay, so Python is now talking to MeCab, but how can we turn that output into something useful that we can use in our analysis?
import MeCab test = "今日はいい天気ですね。遊びに行かない？新宿で祭りがある！" mt = MeCab.Tagger("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd") parsed = mt.parseToNode(test) components =  while parsed: components.append(parsed.surface) parsed = parsed.next print(components)
This will give you the following output…
['', '今日', 'は', 'いい', '天気', 'です', 'ね', '。', '遊び', 'に', '行か', 'ない', '？', '新宿', 'で', '祭り', 'が', 'ある', '！', '']
… which is what we’ve been aiming for from the outset! The parseToNode function of the Tagger generates an object that you can iterate over to see each token in the text. The actual token is accessed as “.surface“; if you want to see details of the part-of-speech or pronunciation, you can access that as “.feature“.
You’ve now got a system that will take Japanese text input (like a tweet, a Facebook post, or a blog entry) and turn it into a list of tokens. You can then apply exactly the same techniques to those tokens that you would to (more easily segmented) European languages.
A Few Cautionary Words
Before turning this system loose on a large volume of tweets or other social media data, I would strongly suggest writing some code to clean up your input – in particular, code to strip out and separately handle URLs, image links and (on Twitter) @mentions. MeCab doesn’t handle these well – it breaks up URLs into little chunks which will throw off a lot of machine learning algorithms, for example.
One suggestion (and the one I use in my own research) is to strip non-text elements from the text before handing it over to MeCab, and then add things like URLs and images back into the dictionary that MeCab returns to you – either as the full URL (ideally the non-shortened version) and @name, or as a token referencing a table that you can look up later (“URL1248”, “USER2234”).
Also, if you’re planning on analysing hashtags, note that MeCab doesn’t understand those either (it’ll tokenise the # and the text separately), so you may also want to pre-process those.
Finally, it’s quite likely that you’ll want to strip some elements out of the list MeCab returns to you as well. It has a habit of returning empty elements at the beginning and end of text, which you can remove; depending on the analysis you’re conducting, you may also wish to strip out punctuation, or even certain parts of speech (in which case you should check the “.feature” segment of each node to decide whether to keep or dispose of each given token).