Posts Tagged ‘mecab’

Tidying up Japanese SNS data for Machine Learning

In my last post about performing text analysis of Japanese language texts, I outlined how to install and use the MeCab system to break up Japanese sentences into their component parts (“tokens”) which can then be used for analysis. At the end of the post, I mentioned that if you’re using text sourced from social media like Twitter or Facebook, you might want to pre-process your data to deal with things like usernames, hashtags and URLs, which MeCab doesn’t understand or handle reliably.

Since then, I’ve spent some time building a tokenisation system to deal with very large volumes of data – the databases for the research project I’m working on at present sum up to about 15 million Japanese language tweets, and we expect to end up dealing with many times that volume by the end of the project. What have I learned? Quite a few things (not least a lot of stuff about using Google Cloud Platform, since this all rapidly outgrew my humble laptop), but one of the main ones was this; Twitter data is a god-damned mess. It’s not just hashtags, URLs and so on; people also tweet a lot of Emojis, which aren’t handled very well by a lot of analysis systems, and in Japan there’s also a tendency to tweet lots of “kaomoji” (you know, stuff that looks like “(。ŏ﹏ŏ)” or “((;,;;  ิ;;◞౪◟;; ิ;))“, and no, I don’t really know what that second one is meant to convey either…) as well as expressing feelings with single characters in brackets, like (笑) meaning laughter, or (涙) implying crying, which can also end up confusing the tagger.

A lot of conventional approaches to machine learning and text analysis just throw out those elements of the data. A common approach is to strip all punctuation, since it isn’t considered to have semantic meaning that’s useful to the machine learning system – but a kaomoji or a bracketed character clearly does have semantic meaning. The inclusion of a laughing kaomoji, or an emoji with a sweatdrop, or a bracketed character for crying, can radically alter the sentiment and meaning expressed by a tweet – in fact, they’re especially important on Twitter, where the 140 character limit means that people seek to find “economical” ways of expressing complex emotions and thoughts.

As a result, the tokenisation system I’ve built for our research is a fair bit more complex than I’d originally intended; it now strips out usernames, hashtags, URLs, emoji, kaomoji and bracketed characters from the data before passing it to MeCab to tokenise the remaining Japanese. There’s also a post-processing stage where I make sure that the keywords we used to build the data set (i.e. the Twitter search terms, which should appear in every single tweet in the data) are being tokenised as a single word, and not split up into separate words, as this could mess up analysis results further down the line. For the benefit of anyone trying to build something similar, this post will introduce all the systems I pass the tweets through in order to achieve this processing, in the order in which they’re done.


Finding Emoji in Tweets

Emoji – be they smiley faces, grinning turds or tiny depictions of churches and love hotels – have become ubiquitous on the Internet, but they turn out to be rather difficult to handle in text mining / machine learning approaches. In fact, some systems which don’t handle Emoji properly can end up making serious errors, as they not only misinterpret the emoji itself, but allow it to “pollute” their understanding of surrounding text characters too. MeCab doesn’t do a terrible job with Emoji, but frequently misinterprets them – so let’s find them and strip them out before passing the tweet over.

The problem is that this is a much harder task than it looks, because the standard for Emoji changes rapidly and isn’t simple. A certain number of ranges of characters in the Unicode standard (which is a system designed to create a standardised list of every character in every world language, thus ending garbled foreign language characters (文字化け) for good) are defined as being emoji – but the list isn’t fully agreed and is often updated. The most recent list I could find is from late 2016, and I’ve uploaded a copy of it here – feel free to download it and use it in your own project. The format it’s in is a Regular Expression (kind of a programming mini-language that allows you to do complex matching of characters and strings based on a set of conditions), and the way to use it in your Python program is as follows:

with open('emoji_list.txt') as emojifile:
    emoji_regex = ''.join(emojifile.readlines()).strip()
emoji_finder = re.compile(emoji_regex)

Now you can use emoji_finder as follows:

some_emoji = emoji_finder.findall(a_tweet)

This will return a list of the emoji in the tweet. I suggest adding them to the master list of tokens, and deleting them from the tweet itself before moving on to the next step of the tokenisation process. This is what you’ll do at every step; adding the elements you’ve extracted to the token list (along with a tag to say what kind of element they are, similar to the Part of Speech (POS) tagging that MeCab provides), and removing them from the tweet itself. Here’s a bit of sample code that does that:

for an_emoji in find_emoji(a_tweet):
    some_tokens.append((an_emoji, 'EMOJI'))
    a_tweet = a_tweet.replace(an_emoji, ' ')

Note that we’re replacing the emoji with a blank space, not deleting them entirely. This is deliberate; if the emoji was separating two words / sentences, i.e. the user was using it in place of punctuation, then shoving them back together could confuse MeCab and cause inaccurate tokenisation. If you’re building your own tokeniser, you’ll create a variation of the above function for every step along the way, so I won’t repeat the code for each one.


Finding Usernames and Hashtags in Tweets

Now that we’ve stripped out the Emoji, we can handle the tasks dealing with “ordinary” unicode characters. First let’s do the easy ones – usernames (which begin with @ on Twitter) and hashtags (which begin with #).

some_usernames = re.findall("@([a-z0-9_]+)", a_tweet, re.I)

some_hashtags = re.findall("#([a-z0-9_]+)", a_tweet, re.I)

Again, each of these functions returns a list. Note that they strip off the @ and # marks, so you should add those back in when you’re using a_tweet.replace() to get rid of them from your tweet text.


Finding URLs in Tweets

URLs have a number of consistent features, but they come in all sorts of shapes and sizes, and we need a system that effective matches all of those and pulls them out of the tweet. The below code is a Python adaptation of a regular expression originally created by John Gruber, which is designed to match any kind of URL, and seems to do the job very effectively – I haven’t yet found any URLs it doesn’t match.

Don’t worry too much about what the regular expression actually does; this is very much one of those cases where there’s no shame in copy and pasting a complex piece of code that’s well tested but which you don’t fully understand… (Incidentally, though I’ve put the two commands together here, you should actually create your “url_finder” object at the outset and re-use it over and over again for every tweet, instead of running the re.compile() command each time.)

url_finder = re.compile(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')

some_urls = url_finder.findall(a_tweet)

Finding Bracketed Characters in a Tweet

Next up, we’ll locate all those bracketed characters like (爆) and (汗) that often pop up in Japanese tweets (I don’t know if these have a name?). These are sort-of like a special case of the Kaomoji we’ll discuss in a moment, so be sure to strip them out before you do the Kaomoji – as mentioned before, these processing steps are being presented in order, and doing them in a different order may create some odd results.

bracket_finder = re.compile(r'[\((]' + re_text + r'[\))]')

some_brackets = bracket_finder.findall(a_tweet)

This will find the characters complete with their brackets, and will only work for single kanji (so it wouldn’t detect (爆笑) for example; but that’s a rare usage, and you start running into weird stuff like confusing this kind of character for people’s names being put in brackets etc.).


Finding Kaomoji in a Tweet

This is the really hard one. There’s no set standard for Kaomoji, and new ones seem to be invented almost every day. I really struggled to come up with a way to tokenise these peculiar beasts, until I came across a research paper from 2015 by a pair of researchers at Meiji University, Kurosaki Yuta and Takagi Tomohiro. They wanted to conduct a sentiment analysis test on Kaomoji, which is interesting in itself, but the part of their research that was really useful to me is the regular expression they constructed for locating the Kaomoji in text. Below is my Python version of their regular expression.

re_text = '[0-9A-Za-zぁ-ヶ一-龠]'
re_nontext = '[^0-9A-Za-zぁ-ヶ一-龠]'
re_allowtext = '[ovっつ゜ニノ三二]'
re_hwkana = '[ヲ-゚]'
re_openbracket = r'[\(∩꒰(]'
re_closebracket = r'[\)∩꒱)]'
re_aroundface = '(?:' + re_nontext + '|' + re_allowtext + ')*'
re_face = '(?!(?:' + re_text + '|' + re_hwkana + '){3,}).{3,}'
kao_finder = re.compile(re_aroundface + re_openbracket + re_face + re_closebracket + re_aroundface)

some_kaomoji = kao_finder.findall(a_tweet)

This works really well, except for one thing; it doesn’t know how to handle tweets with more than one Kaomoji present, so if you have a tweet like 「おはようございます!b(⌒o⌒)d 今日もいい天気じゃん!ヾ(〃^▽^)ノ」, it will match the outside edges of the Kaomoji and then extract everything between them – so we get a list with a single entry like this: ['b(⌒o⌒)d 今日もいい天気じゃん!ヾ(〃^▽^)ノ'], rather than what we actually want, which is both Kaomoji separately: ['b(⌒o⌒)d', 'ヾ(〃^▽^)ノ'].

My solution to this is not terribly elegant, but it is effective in every case I’ve tried thus far; I wrote a function that recursively divides up the string into smaller and smaller elements, and checks to see if there’s an individual Kaomoji lurking in them. Here’s what it looks like, with a sample call to the function at the bottom:

def kaomoji_find(a_tweet, facelist=None):
    if facelist is None: facelist = []
    faces = kao_finder.findall(teststring)
    for kao in faces:
        if len(kao) > 10:
            if len(re.findall(re_text, kao)) > 4:
                firstthird = kao_finder.match(kao[:int(len(kao) / 3)])
                if firstthird is not None:
                    facelist.append(firstthird.group())
                    facelist = kaomoji_find(teststring.replace(firstthird.group(), ''), facelist)
                else:
                    firsthalf = kao_finder.match(kao[:int(len(kao) / 2)])
                    if firsthalf is not None:
                        facelist.append(firsthalf.group())
                        facelist = kaomoji_find(teststring.replace(firsthalf.group(), ''), facelist)
            else:
                facelist.append(kao)
        else:
            facelist.append(kao)
    return facelist

some_kaomoji = kaomoji_find(a_tweet)

Keeping Project Keywords Whole in Tokenisation Output

Once you’ve done all the above steps, you’re ready to feed the remainder of the tweet to MeCab for tokenisation, just like we did before; then you can stick the MeCab tokens and the tokens collected in the above steps all together to create the “bag of words” for this tweet. Remember, the order of the words doesn’t actually matter to Bag of Words approaches to machine learning, so it doesn’t matter how we stick the lists of tokens together.

There’s one final wrinkle to deal with, though. If you’ve compiled your data set by searching for a certain keyword, that keyword or phrase will appear in every tweet – and you’ll want to be sure that it’s always tokenised in the same way. You don’t want MeCab splitting up your keyword in unpredictable ways, because this can mess with various kinds of analysis that you might be doing with your bag of words further down the line.

One approach to fixing this would be to treat keywords like we treated emoji and kaomoji – stripping them out of the text before passing them to MeCab. Don’t do this! Tokenising algorithms like MeCab use the surrounding characters in text to figure out where word boundaries are most likely to be; they rely heavily on the context in which a word appears to figure out which word it is and even where the word begins and ends. By taking out your keyword, you’re mangling up a sentence and preventing MeCab from tokenising it correctly (and honestly, if you’re using Twitter data, there are going to be enough mangled, ungrammatical sentences in there to make MeCab have whole baskets of kittens anyway).

The solution, then, is to let MeCab tokenise the tweet with the keywords still intact, then check through its tokens to see if it’s split the keyword(s) up anywhere, and replace them with a complete, un-split version if so. Again, apologies for my messy Python, but here’s the function I created to accomplish that:

def find_tokens(tweet, keywords=None):
    if keywords is None: keywords = []
    mt = MeCab.Tagger("-d /usr/lib/mecab/dic/mecab-ipadic-neologd")
    mt.parse('')   # Who knows why this is required but it seems to fix UnicodeDecodeError appearing randomly.
    parsed = mt.parseToNode(tweet)
    components = []
    while parsed:
        if parsed.surface != '' and parsed.feature.split(',')[0] != "記号":
            components.append((parsed.surface, parsed.feature.split(',')[0]))
        parsed = parsed.next
    for a_keyword in keywords:
        cindex = 0
        while True:
            if cindex >= len(components):
                break
            temp_key = a_keyword
            if components[cindex][0] == temp_key:      # If the keyword is already tagged as one item, no problem.
                cindex += 1
                continue
            elif components[cindex][0] == temp_key[:len(components[cindex][0])]:  # We just matched the start of a keyword.
                match = False
                tempindex = cindex
                temp_key = temp_key.replace(components[tempindex][0], '', 1)
                while True:
                    tempindex += 1
                    if tempindex >= len(components): 
                        break
                    else:               # Test next element.
                        if components[tempindex][0] == temp_key[:len(components[tempindex][0])]:  
                            temp_key = temp_key.replace(components[tempindex][0], '', 1)
                            if temp_key == '':
                                match = True
                                break
                            else:
                                continue
                        else:
                            break
                if match:
                    components[cindex] = (a_keyword, 'PROJECT_KEYWORD')
                    del components[cindex+1:tempindex+1]      
                cindex += 1
                continue
            else:
                cindex += 1     # This component doesn't match the start of a keyword, so continue.
                continue

    return components

A few notes on the above code. Firstly, it works like the other functions in this post – pass it a tweet, and it’ll pass you back a list of tokens – but it also allows you to optionally give it a list of keywords which it will check through and make sure they’re tokenised as a single item.

This version of the code also throws out all punctuation and whitespace (that’s the parsed.feature.split(',')[0] != "記号" part). I figured since we’ve extracted kaomoji etc., we can live without the remaining punctuation – it’s unlikely to be of value to analysis. If you have a different set of circumstances or requirements, you can remove that part of the code to hang on to punctuation tokens. Finally, this code doesn’t just output a set of tokens, it outputs a list of tuples in the form (a_token, part_of_speech) – with the part_of_speech bit being something like 名詞 or 動詞, indicating what kind of word MeCab reckons this is. For some analysis tasks, it can be useful to do something like excluding particles (助詞) or auxiliary verbs (助動詞) – again, this really depends what you’re trying to learn from your text.


Next Steps

And that’s it! Combined with the MeCab instructions in the previous post, that’s pretty much the set of components you need to build a pretty effective bag of words representation of a Japanese language social media post. It’s obviously a lot more complex than tokenisation for a “traditional” piece of text like a newspaper article, simply because people use text in unusual and non-traditional ways on social media. At some point I intend to do a test to see whether there’s a major difference in, for example, the sentiment analysis results you get from using a normal bag of words and my improved, pre-processed bag of words; I suspect there should be a measurable difference because we’re saving a number of elements with significant relevance to sentiment, such as kaomoji, that would be thrown out by a traditional bag-of-words processor. I’ll have to get the rest of our tool pipeline up and running before I can run a side-by-side test, though.

(Next post in this intermittent series will likely be something about how we end up processing and learning from our bag of words, introducing some of the core Python tools for natural language processing and machine learning such as scikit-learn and Gensim.)

Japanese Text Analysis in Python

This is a more technical post than I usually write, but it will be useful to some people. My political science research involves some natural language processing and machine learning, which I use to analyse texts from Japanese newspapers and social media – so one of the challenges is teaching a computer to “read” Japanese. Luckily, there are some tools out there which make this (relatively) straightforward.

For this guide, I’m using Python 3.5. My development system runs on macOS 10.12 and I deploy my code to a server running Ubuntu 12.16, so this guide will include commands for setting up the software on both macOS and Linux/Ubuntu. I don’t use any version of Windows so I don’t know how to set this up on Windows – if anyone can provide a set of Windows commands that achieve the same goal, drop me an email and I’ll add them to the blog post.

What are we trying to accomplish?

The first hurdle to doing any analysis of Japanese text is segmentation, or tokenisation; breaking it down into usable chunks. For European languages, words have spaces between them, so you can just divide everything up at the word boundaries (the spaces, commas, full stops and so on), which yields an array of words that we can use to calculate things like frequencies or co-occurrence matrices. Japanese, however, has no spaces in its text, so there’s an extra pre-processing step required before we can start using these text analysis approaches.

In essence, we want to turn a string like this…

"今日はいい天気ですね。遊びに行かない?新宿で祭りがある!"

… into an array like this…


["今日", "は", "いい", "天気", "です", "ね", "遊び", "に", "行か", "ない", "新宿", "で", "祭り", "が", "ある"]

… which a computer can process to figure out frequencies, co-occurrences and so on.

A Software Shopping List

That’s the objective. Here’s a quick list of what we’re going to use to get there. (The URLs are for completeness only; we’ll be downloading everything from the command line.)

System Software:
MeCab

Dictionaries:
MeCab-ipadic
MeCab-ipadic-neologd

Python Libraries:
mecab-python3

This is not a definitive list of every piece of software that can segment Japanese text. There’s a big range out there, from the lightweight tinysegmenter through to fully featured software like kuromoji. I’m using this setup for two reasons. First, in my experience it’s the best at handling text data sourced from the Internet. Second, I think it’s the best trade-off of simplicity versus accuracy. tinysegmenter is much easier to set up and use, but its output is unreliable; it often breaks apart words that are actually common phrases or proper names (the name of the present Japanese prime minister, 阿部晋三, is rendered by tinysegmenter as [“阿部”, “晋”, “三”], not the correct [“阿部”, “晋三”] or [“安倍晋三”]; the same problem occurs with the word for prime minister itself, 総理大臣, which comes out as [“総理”, “大臣”] when what you (probably) want is [“総理大臣”]). Mecab works nicely with Python, and it’s easy to set it up with an extensive dictionary of common phrases and neologisms so its output is very accurate.

One other thing; if you’re following this tutorial on macOS, I expect you to have a little bit of familiarity with the Terminal. If you don’t know your way around Terminal at all, your homework is to go and install the “Homebrew” package; we’ll be using it to install a lot of the rest of the software. There are detailed instructions on the homepage and once you’ve done that, you’ll be able to get cracking with installing MeCab and its dictionaries.

Installing MeCab

First, let’s get MeCab (the core segmentation software) and MeCab-ipadic up and running. (Again, for macOS users, this assumes that you have successfully installed the Homebrew package.)

macOS:

brew install mecab
brew install mecab-ipadic

Ubuntu:

sudo apt-get install mecab mecab-ipadic libmecab-dev
sudo apt-get install mecab-ipadic-utf8

That’ll take a while, but once it’s done, MeCab should be up and running. You can test it by typing “mecab” at the terminal; in the blank line it gives you afterwards, type some Japanese text, and press enter. The result should look like this:

[email protected]:~$ mecab
無事にインストール出来ました !
無事 名詞,形容動詞語幹,*,*,*,*,無事,ブジ,ブジ
に 助詞,副詞化,*,*,*,*,に,ニ,ニ
インストール 名詞,一般,*,*,*,*,インストール,インストール,インストール
出来 動詞,自立,*,*,一段,連用形,出来る,デキ,デキ
まし 助動詞,*,*,*,特殊・マス,連用形,ます,マシ,マシ
た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
! 記号,一般,*,*,*,*,!,!,!
EOS

As you can see, it’s working nicely and correctly identifying parts of speech in this test sentence. Press Ctrl-C to get back to the terminal command line, and let’s continue.

The next step is installing mecab-ipadic-neologd working. This is a bit more complex, since we need to download the dictionary of neologisms and slang (vital for handling text from the Internet) and then recompile it for MeCab. First, we  install the tools used to download (“clone”) the most recent version of the dictionary, then we compile and install the dictionary itself.

macOS:

brew install git curl xz
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
cd mecab-ipadic-neologd
./bin/install-mecab-ipadic-neologd -n

Ubuntu:

sudo apt-get install git curl
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
cd mecab-ipadic-neologd
sudo ./bin/install-mecab-ipadic-neologd -n

You’ll need to type “yes” at some point in the install process to confirm that you’re okay with overwriting some dictionary defaults. Now let’s check that it worked. To use this dictionary at the command line, we need to specify it when we invoke MeCab:

macOS:

mecab -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/

Ubuntu:

mecab -d /usr/lib/mecab/dic/mecab-ipadic-neologd/

(Note the different paths – if you’re on Ubuntu you’ll also need to change the path given in the Python files below to match.)

If you can type some Japanese text and have it tokenised (as in the previous example), then everything is working.

Working with Python

The next step is to get MeCab talking to Python. Again, this tutorial assumes you’re using Python 3 (I’m on 3.5, but any recent version of Python should be fine); there is also a library for Python 2, but I haven’t used or installed it. If you’re an advanced Python user and are using a virtual environment setup for your packages, you should switch to that environment now. (All commands from here onwards should work the same on both macOS and Ubuntu.)

At the terminal, type:

pip3 install mecab-python3

(The command may be “pip“, not “pip3“, on your system, but a lot of systems use Python 2 internally for various things, and therefore rename the Python 3 tools with a “3” suffix to avoid clashes. Also, depending on your setup, you may need to “sudo” that command on Ubuntu.)

Now open your Python editor, whether that’s an IDE (I use PyCharm personally) or just a text window, and let’s see if this is working. Try the following code:

import MeCab
test = "今日はいい天気ですね。遊びに行かない?新宿で祭りがある!"
mt = MeCab.Tagger("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd")
print(mt.parse(test))

When you run this code, you should see the same output you got at the Terminal earlier. (Note that if you’re on Ubuntu, you’ll need to change the location of mecab-ipadic-neologd in that code to match the location we were using earlier; the same goes for subsequent code in this example.) Okay, so Python is now talking to MeCab, but how can we turn that output into something useful that we can use in our analysis?

import MeCab
test = "今日はいい天気ですね。遊びに行かない?新宿で祭りがある!"
mt = MeCab.Tagger("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd")

parsed = mt.parseToNode(test)
components = []
while parsed:
    components.append(parsed.surface)
    parsed = parsed.next

print(components)

This will give you the following output…


['', '今日', 'は', 'いい', '天気', 'です', 'ね', '。', '遊び', 'に', '行か', 'ない', '?', '新宿', 'で', '祭り', 'が', 'ある', '!', '']

… which is what we’ve been aiming for from the outset! The parseToNode function of the Tagger generates an object that you can iterate over to see each token in the text. The actual token is accessed as “.surface“; if you want to see details of the part-of-speech or pronunciation, you can access that as “.feature“.

You’ve now got a system that will take Japanese text input (like a tweet, a Facebook post, or a blog entry) and turn it into a list of tokens. You can then apply exactly the same techniques to those tokens that you would to (more easily segmented) European languages.

A Few Cautionary Words

Before turning this system loose on a large volume of tweets or other social media data, I would strongly suggest writing some code to clean up your input – in particular, code to strip out and separately handle URLs, image links and (on Twitter) @mentions. MeCab doesn’t handle these well – it breaks up URLs into little chunks which will throw off a lot of machine learning algorithms, for example.

One suggestion (and the one I use in my own research) is to strip non-text elements from the text before handing it over to MeCab, and then add things like URLs and images back into the dictionary that MeCab returns to you – either as the full URL (ideally the non-shortened version) and @name, or as a token referencing a table that you can look up later (“URL1248”, “USER2234”).

Also, if you’re planning on analysing hashtags, note that MeCab doesn’t understand those either (it’ll tokenise the # and the text separately), so you may also want to pre-process those.

Finally, it’s quite likely that you’ll want to strip some elements out of the list MeCab returns to you as well. It has a habit of returning empty elements at the beginning and end of text, which you can remove; depending on the analysis you’re conducting, you may also wish to strip out punctuation, or even certain parts of speech (in which case you should check the “.feature” segment of each node to decide whether to keep or dispose of each given token).