Tidying up Japanese SNS data for Machine Learning

In my last post about performing text analysis of Japanese language texts, I outlined how to install and use the MeCab system to break up Japanese sentences into their component parts (“tokens”) which can then be used for analysis. At the end of the post, I mentioned that if you’re using text sourced from social media like Twitter or Facebook, you might want to pre-process your data to deal with things like usernames, hashtags and URLs, which MeCab doesn’t understand or handle reliably.

Since then, I’ve spent some time building a tokenisation system to deal with very large volumes of data – the databases for the research project I’m working on at present sum up to about 15 million Japanese language tweets, and we expect to end up dealing with many times that volume by the end of the project. What have I learned? Quite a few things (not least a lot of stuff about using Google Cloud Platform, since this all rapidly outgrew my humble laptop), but one of the main ones was this; Twitter data is a god-damned mess. It’s not just hashtags, URLs and so on; people also tweet a lot of Emojis, which aren’t handled very well by a lot of analysis systems, and in Japan there’s also a tendency to tweet lots of “kaomoji” (you know, stuff that looks like “(。ŏ﹏ŏ)” or “((;,;;  ิ;;◞౪◟;; ิ;))“, and no, I don’t really know what that second one is meant to convey either…) as well as expressing feelings with single characters in brackets, like (笑) meaning laughter, or (涙) implying crying, which can also end up confusing the tagger.

A lot of conventional approaches to machine learning and text analysis just throw out those elements of the data. A common approach is to strip all punctuation, since it isn’t considered to have semantic meaning that’s useful to the machine learning system – but a kaomoji or a bracketed character clearly does have semantic meaning. The inclusion of a laughing kaomoji, or an emoji with a sweatdrop, or a bracketed character for crying, can radically alter the sentiment and meaning expressed by a tweet – in fact, they’re especially important on Twitter, where the 140 character limit means that people seek to find “economical” ways of expressing complex emotions and thoughts.

As a result, the tokenisation system I’ve built for our research is a fair bit more complex than I’d originally intended; it now strips out usernames, hashtags, URLs, emoji, kaomoji and bracketed characters from the data before passing it to MeCab to tokenise the remaining Japanese. There’s also a post-processing stage where I make sure that the keywords we used to build the data set (i.e. the Twitter search terms, which should appear in every single tweet in the data) are being tokenised as a single word, and not split up into separate words, as this could mess up analysis results further down the line. For the benefit of anyone trying to build something similar, this post will introduce all the systems I pass the tweets through in order to achieve this processing, in the order in which they’re done.

Finding Emoji in Tweets

Emoji – be they smiley faces, grinning turds or tiny depictions of churches and love hotels – have become ubiquitous on the Internet, but they turn out to be rather difficult to handle in text mining / machine learning approaches. In fact, some systems which don’t handle Emoji properly can end up making serious errors, as they not only misinterpret the emoji itself, but allow it to “pollute” their understanding of surrounding text characters too. MeCab doesn’t do a terrible job with Emoji, but frequently misinterprets them – so let’s find them and strip them out before passing the tweet over.

The problem is that this is a much harder task than it looks, because the standard for Emoji changes rapidly and isn’t simple. A certain number of ranges of characters in the Unicode standard (which is a system designed to create a standardised list of every character in every world language, thus ending garbled foreign language characters (文字化け) for good) are defined as being emoji – but the list isn’t fully agreed and is often updated. The most recent list I could find is from late 2016, and I’ve uploaded a copy of it here – feel free to download it and use it in your own project. The format it’s in is a Regular Expression (kind of a programming mini-language that allows you to do complex matching of characters and strings based on a set of conditions), and the way to use it in your Python program is as follows:

with open('emoji_list.txt') as emojifile:
    emoji_regex = ''.join(emojifile.readlines()).strip()
emoji_finder = re.compile(emoji_regex)

Now you can use emoji_finder as follows:

some_emoji = emoji_finder.findall(a_tweet)

This will return a list of the emoji in the tweet. I suggest adding them to the master list of tokens, and deleting them from the tweet itself before moving on to the next step of the tokenisation process. This is what you’ll do at every step; adding the elements you’ve extracted to the token list (along with a tag to say what kind of element they are, similar to the Part of Speech (POS) tagging that MeCab provides), and removing them from the tweet itself. Here’s a bit of sample code that does that:

for an_emoji in find_emoji(a_tweet):
    some_tokens.append((an_emoji, 'EMOJI'))
    a_tweet = a_tweet.replace(an_emoji, ' ')

Note that we’re replacing the emoji with a blank space, not deleting them entirely. This is deliberate; if the emoji was separating two words / sentences, i.e. the user was using it in place of punctuation, then shoving them back together could confuse MeCab and cause inaccurate tokenisation. If you’re building your own tokeniser, you’ll create a variation of the above function for every step along the way, so I won’t repeat the code for each one.

Finding Usernames and Hashtags in Tweets

Now that we’ve stripped out the Emoji, we can handle the tasks dealing with “ordinary” unicode characters. First let’s do the easy ones – usernames (which begin with @ on Twitter) and hashtags (which begin with #).

some_usernames = re.findall("@([a-z0-9_]+)", a_tweet, re.I)

some_hashtags = re.findall("#([a-z0-9_]+)", a_tweet, re.I)

Again, each of these functions returns a list. Note that they strip off the @ and # marks, so you should add those back in when you’re using a_tweet.replace() to get rid of them from your tweet text.

Finding URLs in Tweets

URLs have a number of consistent features, but they come in all sorts of shapes and sizes, and we need a system that effective matches all of those and pulls them out of the tweet. The below code is a Python adaptation of a regular expression originally created by John Gruber, which is designed to match any kind of URL, and seems to do the job very effectively – I haven’t yet found any URLs it doesn’t match.

Don’t worry too much about what the regular expression actually does; this is very much one of those cases where there’s no shame in copy and pasting a complex piece of code that’s well tested but which you don’t fully understand… (Incidentally, though I’ve put the two commands together here, you should actually create your “url_finder” object at the outset and re-use it over and over again for every tweet, instead of running the re.compile() command each time.)

url_finder = re.compile(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')

some_urls = url_finder.findall(a_tweet)

Finding Bracketed Characters in a Tweet

Next up, we’ll locate all those bracketed characters like (爆) and (汗) that often pop up in Japanese tweets (I don’t know if these have a name?). These are sort-of like a special case of the Kaomoji we’ll discuss in a moment, so be sure to strip them out before you do the Kaomoji – as mentioned before, these processing steps are being presented in order, and doing them in a different order may create some odd results.

bracket_finder = re.compile(r'[\((]' + re_text + r'[\))]')

some_brackets = bracket_finder.findall(a_tweet)

This will find the characters complete with their brackets, and will only work for single kanji (so it wouldn’t detect (爆笑) for example; but that’s a rare usage, and you start running into weird stuff like confusing this kind of character for people’s names being put in brackets etc.).

Finding Kaomoji in a Tweet

This is the really hard one. There’s no set standard for Kaomoji, and new ones seem to be invented almost every day. I really struggled to come up with a way to tokenise these peculiar beasts, until I came across a research paper from 2015 by a pair of researchers at Meiji University, Kurosaki Yuta and Takagi Tomohiro. They wanted to conduct a sentiment analysis test on Kaomoji, which is interesting in itself, but the part of their research that was really useful to me is the regular expression they constructed for locating the Kaomoji in text. Below is my Python version of their regular expression.

re_text = '[0-9A-Za-zぁ-ヶ一-龠]'
re_nontext = '[^0-9A-Za-zぁ-ヶ一-龠]'
re_allowtext = '[ovっつ゜ニノ三二]'
re_hwkana = '[ヲ-゚]'
re_openbracket = r'[\(∩꒰(]'
re_closebracket = r'[\)∩꒱)]'
re_aroundface = '(?:' + re_nontext + '|' + re_allowtext + ')*'
re_face = '(?!(?:' + re_text + '|' + re_hwkana + '){3,}).{3,}'
kao_finder = re.compile(re_aroundface + re_openbracket + re_face + re_closebracket + re_aroundface)

some_kaomoji = kao_finder.findall(a_tweet)

This works really well, except for one thing; it doesn’t know how to handle tweets with more than one Kaomoji present, so if you have a tweet like 「おはようございます!b(⌒o⌒)d 今日もいい天気じゃん!ヾ(〃^▽^)ノ」, it will match the outside edges of the Kaomoji and then extract everything between them – so we get a list with a single entry like this: ['b(⌒o⌒)d 今日もいい天気じゃん!ヾ(〃^▽^)ノ'], rather than what we actually want, which is both Kaomoji separately: ['b(⌒o⌒)d', 'ヾ(〃^▽^)ノ'].

My solution to this is not terribly elegant, but it is effective in every case I’ve tried thus far; I wrote a function that recursively divides up the string into smaller and smaller elements, and checks to see if there’s an individual Kaomoji lurking in them. Here’s what it looks like, with a sample call to the function at the bottom:

def kaomoji_find(a_tweet, facelist=None):
    if facelist is None: facelist = []
    faces = kao_finder.findall(teststring)
    for kao in faces:
        if len(kao) > 10:
            if len(re.findall(re_text, kao)) > 4:
                firstthird = kao_finder.match(kao[:int(len(kao) / 3)])
                if firstthird is not None:
                    facelist = kaomoji_find(teststring.replace(firstthird.group(), ''), facelist)
                    firsthalf = kao_finder.match(kao[:int(len(kao) / 2)])
                    if firsthalf is not None:
                        facelist = kaomoji_find(teststring.replace(firsthalf.group(), ''), facelist)
    return facelist

some_kaomoji = kaomoji_find(a_tweet)

Keeping Project Keywords Whole in Tokenisation Output

Once you’ve done all the above steps, you’re ready to feed the remainder of the tweet to MeCab for tokenisation, just like we did before; then you can stick the MeCab tokens and the tokens collected in the above steps all together to create the “bag of words” for this tweet. Remember, the order of the words doesn’t actually matter to Bag of Words approaches to machine learning, so it doesn’t matter how we stick the lists of tokens together.

There’s one final wrinkle to deal with, though. If you’ve compiled your data set by searching for a certain keyword, that keyword or phrase will appear in every tweet – and you’ll want to be sure that it’s always tokenised in the same way. You don’t want MeCab splitting up your keyword in unpredictable ways, because this can mess with various kinds of analysis that you might be doing with your bag of words further down the line.

One approach to fixing this would be to treat keywords like we treated emoji and kaomoji – stripping them out of the text before passing them to MeCab. Don’t do this! Tokenising algorithms like MeCab use the surrounding characters in text to figure out where word boundaries are most likely to be; they rely heavily on the context in which a word appears to figure out which word it is and even where the word begins and ends. By taking out your keyword, you’re mangling up a sentence and preventing MeCab from tokenising it correctly (and honestly, if you’re using Twitter data, there are going to be enough mangled, ungrammatical sentences in there to make MeCab have whole baskets of kittens anyway).

The solution, then, is to let MeCab tokenise the tweet with the keywords still intact, then check through its tokens to see if it’s split the keyword(s) up anywhere, and replace them with a complete, un-split version if so. Again, apologies for my messy Python, but here’s the function I created to accomplish that:

def find_tokens(tweet, keywords=None):
    if keywords is None: keywords = []
    mt = MeCab.Tagger("-d /usr/lib/mecab/dic/mecab-ipadic-neologd")
    mt.parse('')   # Who knows why this is required but it seems to fix UnicodeDecodeError appearing randomly.
    parsed = mt.parseToNode(tweet)
    components = []
    while parsed:
        if parsed.surface != '' and parsed.feature.split(',')[0] != "記号":
            components.append((parsed.surface, parsed.feature.split(',')[0]))
        parsed = parsed.next
    for a_keyword in keywords:
        cindex = 0
        while True:
            if cindex >= len(components):
            temp_key = a_keyword
            if components[cindex][0] == temp_key:      # If the keyword is already tagged as one item, no problem.
                cindex += 1
            elif components[cindex][0] == temp_key[:len(components[cindex][0])]:  # We just matched the start of a keyword.
                match = False
                tempindex = cindex
                temp_key = temp_key.replace(components[tempindex][0], '', 1)
                while True:
                    tempindex += 1
                    if tempindex >= len(components): 
                    else:               # Test next element.
                        if components[tempindex][0] == temp_key[:len(components[tempindex][0])]:  
                            temp_key = temp_key.replace(components[tempindex][0], '', 1)
                            if temp_key == '':
                                match = True
                if match:
                    components[cindex] = (a_keyword, 'PROJECT_KEYWORD')
                    del components[cindex+1:tempindex+1]      
                cindex += 1
                cindex += 1     # This component doesn't match the start of a keyword, so continue.

    return components

A few notes on the above code. Firstly, it works like the other functions in this post – pass it a tweet, and it’ll pass you back a list of tokens – but it also allows you to optionally give it a list of keywords which it will check through and make sure they’re tokenised as a single item.

This version of the code also throws out all punctuation and whitespace (that’s the parsed.feature.split(',')[0] != "記号" part). I figured since we’ve extracted kaomoji etc., we can live without the remaining punctuation – it’s unlikely to be of value to analysis. If you have a different set of circumstances or requirements, you can remove that part of the code to hang on to punctuation tokens. Finally, this code doesn’t just output a set of tokens, it outputs a list of tuples in the form (a_token, part_of_speech) – with the part_of_speech bit being something like 名詞 or 動詞, indicating what kind of word MeCab reckons this is. For some analysis tasks, it can be useful to do something like excluding particles (助詞) or auxiliary verbs (助動詞) – again, this really depends what you’re trying to learn from your text.

Next Steps

And that’s it! Combined with the MeCab instructions in the previous post, that’s pretty much the set of components you need to build a pretty effective bag of words representation of a Japanese language social media post. It’s obviously a lot more complex than tokenisation for a “traditional” piece of text like a newspaper article, simply because people use text in unusual and non-traditional ways on social media. At some point I intend to do a test to see whether there’s a major difference in, for example, the sentiment analysis results you get from using a normal bag of words and my improved, pre-processed bag of words; I suspect there should be a measurable difference because we’re saving a number of elements with significant relevance to sentiment, such as kaomoji, that would be thrown out by a traditional bag-of-words processor. I’ll have to get the rest of our tool pipeline up and running before I can run a side-by-side test, though.

(Next post in this intermittent series will likely be something about how we end up processing and learning from our bag of words, introducing some of the core Python tools for natural language processing and machine learning such as scikit-learn and Gensim.)

Comments are closed.