A quick look back at work that got out the door in 2021. The big accomplishment for the year was finally completing my PhD thesis, the final title of which was “Party Control and its Effects on Factions, Media, and Citizens: The Case of Japan’s Liberal Democratic Party under the Second Abe Administration” – I’m very glad to have finally finished it and hope to update and adapt a few of the chapters into journal papers this year.

Journal Papers

  • How populist attitude scales fail to capture support for populists in power” (with Sebastian Jungkunz and Airo Hino), in PLoS One.
    (What’s interesting about it: several different scales exist for measuring the extent of populist attitudes among citizens in survey research, each with its own strengths and weaknesses. However, testing the populism questions in the CSES surveys conducted in dozens of different nations reveals a potentially major problem – when the populist party is in power, the surveys no longer effectively measure their supporters’ populist attitudes. We argue that this happens because all of these scales assume that the anti-elite attitudes of populist voters are aimed at politicians – but when a populist is actually in power, the “elite” might instead mean the media, experts, business leaders, or some other group. This can cause a paradoxical situation where the supporters of a populist incumbent don’t actually seem to be populist according to surveys and opinion polls.)

  • 日本におけるポピュリズムと陰謀論の信念」(“Populism and Conspiracy Theory Beliefs in Japan“), in 日本世論調査協会報「よろん」(Yoron, Journal of the Japanese Association for Public Opinion Research) 127号, pp. 11-21.
    (What’s interesting about it: this article is an update on progress in an ongoing project with Airo Hino, Sebastian Camatarri and Sebastian Jungkunz, in which we are analysing the conspiracy beliefs of Japanese citizens and the impact of those beliefs on their political behaviours and inclinations. This paper showed how conspiracy beliefs among citizens can be broadly divided into “classic” conspiracies (such as secret organisations controlling the world, alien cover-ups, or religious groups secretly controlling society) and “anti-government” conspiracies (such as governments hiding information, covering up their own criminal activity, or secretly testing new drugs and technologies on citizens), and that respondents’ level of belief in such conspiracies and their degree of populist attitudes are both connected to their party preferences in elections.)

Book Chapters

Editorials / Articles

2020 wasn’t exactly a great year for getting things done, but hey; some things got done nonetheless.

Journal Papers

  • Covariance in diurnal patterns of suicide-related expressions on Twitter and recorded suicide deaths” (with Jeremy Boo and Michiko Ueda), in Social Science & Medicine.
    (What’s interesting about it: there’s a surprising lack of clarity about whether people making suicidal statements on social media is actually connected in any way to suicide deaths – some people think it’s an obvious link, others think talk like that on social media is just casual or attention-seeking. We took a five-year sample of Twitter data and suicide death records and showed that with a small time lag, the hour-by-hour pattern of suicide deaths was indeed correlated to the traffic on Twitter regarding suicidal ideation.)

  • COVID-19, digital privacy, and the social limits on data-focused public health responses” (with Airo Hino), in International Journal of Information Management.
    (What’s interesting about it: this was a really early attempt to untangle some of the issues surrounding the use of digital tracking and contact tracing tools for managing the COVID-19 pandemic. It touches on a few different issues ranging from privacy concerns (and how being cavalier about those concerns could sink the whole digital side of the pandemic response – which, in the end, it did in many countries) to the path-dependency created by the dominance of tech giants.

Book Chapters

Book Reviews

  • 図書紹介:『内容分析の進め方 ーメディア・メッセージを読み解く』」、日本世論調査協会報「よろん」、126号、52−54. (“Book Review: ‘Analyzing Media Messages: Using Quantitative Content Analysis in Research'”, YORON: Journal of the Japanese Association for Public Opinion Research, Vol. 126, pp.52-54)

Conference Papers / Presentations

Editorials / Articles

Last time I wrote about preparing Japanese-language Twitter data for machine learning purposes, I said I’d likely come back to the topic and discuss different approaches to using this data to find out useful or interesting things.

I will. Eventually. I’m working on a number of interesting projects at Waseda at the moment, using several different machine learning approaches (some supervised, some unsupervised) to cut through the noise of Twitter data and find out (hopefully) relevant political or sociological things. I’ll write up some practical guides to those approaches as soon as I have time.

In the meanwhile, though, I’ve battle-tested a lot of the techniques outlined in the previous blog post, fixed some bugs and updated a handful of things to be faster or more effective. I think I’ve now got a pretty solid system in place for processing Japanese-language tweets and splitting them down into handy bags-of-words – without losing semantically important features like emoji or kaomoji along the way.

Rather than update the last blog post, I’ve bundled all of the various functions into a Python object that you can use in your own code, and uploaded it to Github. You can find the code here. It’s not the world’s most professional Python code and for now at least, you’ll need to just download the file and drop it into your project – I haven’t got around to turning it into a “real” Python package yet – but it will hopefully be useful to some people working on this kind of project nonetheless.

The UK’s Labour Party yesterday announced its manifesto for the upcoming general election. It’s a dramatically different vision to the path the Conservative party has laid out for the country, and unsurprisingly, it is the most left-wing manifesto the party has had since the rise of “New Labour” under Tony Blair in the late 1990s. Contrary to the scoffing of some on the right, however, it’s far from being a loony-leftie document that appeals only to the fringes. If anything, it’s an incredibly populist manifesto; the majority of its policies enjoy very broad support from the British electorate.

The right-wing press argues that the manifesto shows Labour, under Jeremy Corbyn, trying to return the UK to the “bad old days” of the 1970s (yes, that’s the same right-wing press that’s had a months-long priapism over notions like returning to Imperial weights and measures, or issuing blue-coloured passports; the existence of irony, it seems, is just another EU strategy to undermine Brexit). To some extent, they’re right; the manifesto does hark back to the pre-Thatcher years in parts, with policies aimed at undoing some of the more egregious mistakes of the neo-liberal policy regimes of the past 35 years. These include the ideologically motivated privatisation of a number of natural monopolies like public transport and energy, or the underhanded social engineering that saw council housing being sold off cheaply and never replaced, both of which date back to the early 1980s and are targeted in the manifesto.

Rather than arguing back and forth about the benefits of various different aspects of the manifesto, though, the point I want to make is that regardless of whether you consider these policies to be economically sensible or politically desirable, they’re undeniably popular. Opinion poll after opinion poll has shown – with margins that defy post-Brexit, post-Trump scepticism about polling – that the British public support the renationalisation of rail and other core services, want to see council housing stocks replenished, and favour the roll-back of the most extreme deregulations of the labour market, such as zero-hours contracts. If you go through the Labour manifesto line-by-line with British voters, you’ll find a strong majority in favour of pretty much every major policy in the document. The last manifesto to enjoy such a strong level of support was probably Blair’s in 1997 – a very different manifesto for a very different time.

Blair won 1997 in a historic landslide. Corbyn, for all that his policies resonate, is going to lose, and lose badly – likely handing Theresa May’s Conservatives a significantly boosted majority in the House of Commons, and perhaps losing key seats once seen as Labour strongholds. This is in spite of the fact that May’s Conservative policies are actually pretty unpopular; “Hard Brexit” is opposed by a plurality of the electorate, and some of her policies around things like education, the NHS and fox-hunting (yes, the fox-hunting debate is back) are opposed by a significant majority. It doesn’t matter; she’s going to win the most convincing Conservative electoral victory in a generation.

What this means, from a political science wonk perspective, is that a significant part of the British electorate is going to go out and vote for a party whose policies they disagree with. It flies in the face of certain fields of theory, which try to link the policy preferences of voters to their choices in elections, or to model the behaviour of candidates as principal-agent relationships – in which voters (principals) elect candidates as their “agents”, who go on to represent the policy interests of the voters in order to ensure future re-election. There’s more complexity to those models, but in essence they all assume the same fundamental thing – that voters have policy preferences, and that they evaluate the distance between their own preferences and those of electoral candidates, and assess the candidates according to that measure. If you have an election in which a large portion of voters who prefer nationalisation, labour market protections and investment in social housing knowingly go out and elect candidates who want to privatise the NHS, deregulate labour markets and leave housing entirely in the private sector; well, something is up.

Specifically, what’s up is valence issues. You can broadly divide the issues of concern to voters in elections into two categories. The first category is position issues – these are issues on which parties, and voters, have divided views. Things like immigration policy, Brexit, nationalisation, labour market reforms and so on are position issues, because different voters and parties have different positions on these issues. Even where a majority of voters lean a certain direction (for example, about 80% of UK voters oppose a repeal of the fox-hunting ban), the existence of a minority who believe otherwise turns this into a position issue. We pay a lot of attention to position issues, because they fit neatly with a lot of fundamental theories about policy preferences. Perhaps more importantly, they also fit comfortably with most peoples’ basic understanding of how democracy is meant to work, and provide points of disagreement and debate which are interesting to follow as they unfold in newspapers and other media.

The second category of issue is valence issues. Valence issues are things on which the vast majority of people and parties actually agree. For example, “enhanced prosperity”, or “lower crime”, or “better education”, or “lower unemployment”; these are all things that just about every voter, and every political party, agrees to be positive. There’s lots of disagreement about how you achieve those things, of course, but fundamentally if you’re talking about issues of economic growth, human security and so on, you’re talking about a valence issue – something everyone wants to attain, regardless of where they fall on the political spectrum or how they feel about all the various position issues.

Jeremy Corbyn isn’t going to lose this election over position issues. On position issues, he’s good; the British electorate agrees with him, so much so that in an election where only the position issues mattered, he’d likely win the biggest majority Labour has ever held. You can imagine this in the form of a thought experiment; imagine a voting system where party and candidate names never appeared, and people simply selected their preferred policies, with their vote ultimately going to the party whose policies most closely match the voter’s. Assuming a kind of “veil of ignorance”, wherein voters could not guess which policies belonged to which party and thus couldn’t bias their selections according to party identification, Labour would win a huge majority this time out.

But Jeremy Corbyn is going to lose, because this election – like many in recent years – isn’t about position issues, it’s about valence issues. What valence issues boil down to is a simple question; given a core value that everyone agrees about, like “prosperity” or “security”, do you trust a given party or candidate to be able to deliver it? It’s not an assessment of policy, or a weighing of manifesto promises; it’s a simple, visceral and quite emotional choice of whether you think a person or a party has the competence to deliver the key social goods that a nation requires. Time and again in recent decades, we’ve seen electorates go to the polls, hold their noses, and vote for a party they fundamentally disagree with on many issues simply because they believe that that party is more competent and capable on the most fundamental issues of all, the valence issues.

Theresa May – for all that she has not been a particularly competent or capable leader, much as she was not particularly impressive as Home Secretary before – understands valence issues to a degree that Corbyn does not. While Corbyn has crafted policies on position issues which most of the UK electorate agrees with, May has focused entirely on projecting an image of strength and competence. She may be mocked for her constant and rather robotic delivery of her “strong and stable government” line, but it’s a good line; it speaks directly to the heart of the valence issues most people are basing their choices on. In fact, it’s rather hard to pin down the Conservatives’ exact policy positions on many things in this election, precisely because the whole party is running on valence. They’re avoiding talking about position issues, partially because they remain a party deeply divided on many of them, but mostly because their entire election pitch is that they’re a safe, competent pair of hands on the wheel, with little reference to where they’re actually planning on steering. Look also at how right-wing media and politicians alike respond to Labour’s policies. Rather than presenting an alternative or a competing worldview, their attacks are always based on claims that Labour is being unrealistic, or living in a fantasy land; that no matter how much you may like Labour’s policies (because the right wing knows that Labour’s positions are more popular), Labour in general and Corbyn specifically are too incompetent, too chaotic and too risky to put into power.

That’s why the Labour manifesto, for all that it’s a great document, isn’t going to mean much of anything in the long run. The problem isn’t that it’s too left-wing or too radical; it’s pretty apparent that the British public is quite receptive to some radical policy prescriptions on key areas right now. Rather, the problem is that Labour under Corbyn has done little to make people feel like the party has the competence to execute those policies. While those of us following the Brexit negotiations closely may be dumbfounded by the lack of competence and professionalism being demonstrated by the Conservative leadership in this area, that’s not the story that’s filtering through to the majority of UK voters. For them, the Conservatives are a competent party with some distasteful policies – and they’ll vote for that over a chaotic, incompetent party with lovely policies any day.

How did Labour get here? The blame, ultimately, has to rest with Corbyn; he’s leader, and the buck stops there. Certainly, the failure of the party’s centrists to unite behind the leader (even after their coup attempt collapsed) is also a major factor, but if Corbyn had cultivated a personal popularity beyond core leftist support then even his ideological opponents would have fallen in line. The party is fractured not because Corbyn has a different ideology to many of the Blair-era MPs, but because Corbyn is an electoral liability to the party. His great failure, I think, is that he truly believes that politics is about putting out the right policies and creating a manifesto people agree with; he has neglected the actual role of a modern party leader, which involves building a personal image of competence and leadership, and being an electoral asset for your party members around the country.

You can blame the media’s coverage of Corbyn and Labour for that negative image, as many of the party faithful do, and there’s some merit to that; but in the age of SNS and new media, Corbyn has shown no aptitude for engaging with the public through alternative channels and effectively challenging the narratives of the right-wing press. Again, I think, the problem is that he wants to let his policies do the talking, not realising that most people will not cast their vote based on policies. That’s a miscalculation that’s likely going to cost Labour a great many seats next month – because the greatest manifesto in the world is meaningless if you don’t believe Jeremy Corbyn is capable of delivering on its promises.

In my last post about performing text analysis of Japanese language texts, I outlined how to install and use the MeCab system to break up Japanese sentences into their component parts (“tokens”) which can then be used for analysis. At the end of the post, I mentioned that if you’re using text sourced from social media like Twitter or Facebook, you might want to pre-process your data to deal with things like usernames, hashtags and URLs, which MeCab doesn’t understand or handle reliably.

Since then, I’ve spent some time building a tokenisation system to deal with very large volumes of data – the databases for the research project I’m working on at present sum up to about 15 million Japanese language tweets, and we expect to end up dealing with many times that volume by the end of the project. What have I learned? Quite a few things (not least a lot of stuff about using Google Cloud Platform, since this all rapidly outgrew my humble laptop), but one of the main ones was this; Twitter data is a god-damned mess. It’s not just hashtags, URLs and so on; people also tweet a lot of Emojis, which aren’t handled very well by a lot of analysis systems, and in Japan there’s also a tendency to tweet lots of “kaomoji” (you know, stuff that looks like “(。ŏ﹏ŏ)” or “((;,;;  ิ;;◞౪◟;; ิ;))“, and no, I don’t really know what that second one is meant to convey either…) as well as expressing feelings with single characters in brackets, like (笑) meaning laughter, or (涙) implying crying, which can also end up confusing the tagger.

A lot of conventional approaches to machine learning and text analysis just throw out those elements of the data. A common approach is to strip all punctuation, since it isn’t considered to have semantic meaning that’s useful to the machine learning system – but a kaomoji or a bracketed character clearly does have semantic meaning. The inclusion of a laughing kaomoji, or an emoji with a sweatdrop, or a bracketed character for crying, can radically alter the sentiment and meaning expressed by a tweet – in fact, they’re especially important on Twitter, where the 140 character limit means that people seek to find “economical” ways of expressing complex emotions and thoughts.

As a result, the tokenisation system I’ve built for our research is a fair bit more complex than I’d originally intended; it now strips out usernames, hashtags, URLs, emoji, kaomoji and bracketed characters from the data before passing it to MeCab to tokenise the remaining Japanese. There’s also a post-processing stage where I make sure that the keywords we used to build the data set (i.e. the Twitter search terms, which should appear in every single tweet in the data) are being tokenised as a single word, and not split up into separate words, as this could mess up analysis results further down the line. For the benefit of anyone trying to build something similar, this post will introduce all the systems I pass the tweets through in order to achieve this processing, in the order in which they’re done.


Finding Emoji in Tweets

Emoji – be they smiley faces, grinning turds or tiny depictions of churches and love hotels – have become ubiquitous on the Internet, but they turn out to be rather difficult to handle in text mining / machine learning approaches. In fact, some systems which don’t handle Emoji properly can end up making serious errors, as they not only misinterpret the emoji itself, but allow it to “pollute” their understanding of surrounding text characters too. MeCab doesn’t do a terrible job with Emoji, but frequently misinterprets them – so let’s find them and strip them out before passing the tweet over.

The problem is that this is a much harder task than it looks, because the standard for Emoji changes rapidly and isn’t simple. A certain number of ranges of characters in the Unicode standard (which is a system designed to create a standardised list of every character in every world language, thus ending garbled foreign language characters (文字化け) for good) are defined as being emoji – but the list isn’t fully agreed and is often updated. The most recent list I could find is from late 2016, and I’ve uploaded a copy of it here – feel free to download it and use it in your own project. The format it’s in is a Regular Expression (kind of a programming mini-language that allows you to do complex matching of characters and strings based on a set of conditions), and the way to use it in your Python program is as follows:

with open('emoji_list.txt') as emojifile:
    emoji_regex = ''.join(emojifile.readlines()).strip()
emoji_finder = re.compile(emoji_regex)

Now you can use emoji_finder as follows:

some_emoji = emoji_finder.findall(a_tweet)

This will return a list of the emoji in the tweet. I suggest adding them to the master list of tokens, and deleting them from the tweet itself before moving on to the next step of the tokenisation process. This is what you’ll do at every step; adding the elements you’ve extracted to the token list (along with a tag to say what kind of element they are, similar to the Part of Speech (POS) tagging that MeCab provides), and removing them from the tweet itself. Here’s a bit of sample code that does that:

for an_emoji in find_emoji(a_tweet):
    some_tokens.append((an_emoji, 'EMOJI'))
    a_tweet = a_tweet.replace(an_emoji, ' ')

Note that we’re replacing the emoji with a blank space, not deleting them entirely. This is deliberate; if the emoji was separating two words / sentences, i.e. the user was using it in place of punctuation, then shoving them back together could confuse MeCab and cause inaccurate tokenisation. If you’re building your own tokeniser, you’ll create a variation of the above function for every step along the way, so I won’t repeat the code for each one.


Finding Usernames and Hashtags in Tweets

Now that we’ve stripped out the Emoji, we can handle the tasks dealing with “ordinary” unicode characters. First let’s do the easy ones – usernames (which begin with @ on Twitter) and hashtags (which begin with #).

some_usernames = re.findall("@([a-z0-9_]+)", a_tweet, re.I)

some_hashtags = re.findall("#([a-z0-9_]+)", a_tweet, re.I)

Again, each of these functions returns a list. Note that they strip off the @ and # marks, so you should add those back in when you’re using a_tweet.replace() to get rid of them from your tweet text.


Finding URLs in Tweets

URLs have a number of consistent features, but they come in all sorts of shapes and sizes, and we need a system that effective matches all of those and pulls them out of the tweet. The below code is a Python adaptation of a regular expression originally created by John Gruber, which is designed to match any kind of URL, and seems to do the job very effectively – I haven’t yet found any URLs it doesn’t match.

Don’t worry too much about what the regular expression actually does; this is very much one of those cases where there’s no shame in copy and pasting a complex piece of code that’s well tested but which you don’t fully understand… (Incidentally, though I’ve put the two commands together here, you should actually create your “url_finder” object at the outset and re-use it over and over again for every tweet, instead of running the re.compile() command each time.)

url_finder = re.compile(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')

some_urls = url_finder.findall(a_tweet)

Finding Bracketed Characters in a Tweet

Next up, we’ll locate all those bracketed characters like (爆) and (汗) that often pop up in Japanese tweets (I don’t know if these have a name?). These are sort-of like a special case of the Kaomoji we’ll discuss in a moment, so be sure to strip them out before you do the Kaomoji – as mentioned before, these processing steps are being presented in order, and doing them in a different order may create some odd results.

bracket_finder = re.compile(r'[\((]' + re_text + r'[\))]')

some_brackets = bracket_finder.findall(a_tweet)

This will find the characters complete with their brackets, and will only work for single kanji (so it wouldn’t detect (爆笑) for example; but that’s a rare usage, and you start running into weird stuff like confusing this kind of character for people’s names being put in brackets etc.).


Finding Kaomoji in a Tweet

This is the really hard one. There’s no set standard for Kaomoji, and new ones seem to be invented almost every day. I really struggled to come up with a way to tokenise these peculiar beasts, until I came across a research paper from 2015 by a pair of researchers at Meiji University, Kurosaki Yuta and Takagi Tomohiro. They wanted to conduct a sentiment analysis test on Kaomoji, which is interesting in itself, but the part of their research that was really useful to me is the regular expression they constructed for locating the Kaomoji in text. Below is my Python version of their regular expression.

re_text = '[0-9A-Za-zぁ-ヶ一-龠]'
re_nontext = '[^0-9A-Za-zぁ-ヶ一-龠]'
re_allowtext = '[ovっつ゜ニノ三二]'
re_hwkana = '[ヲ-゚]'
re_openbracket = r'[\(∩꒰(]'
re_closebracket = r'[\)∩꒱)]'
re_aroundface = '(?:' + re_nontext + '|' + re_allowtext + ')*'
re_face = '(?!(?:' + re_text + '|' + re_hwkana + '){3,}).{3,}'
kao_finder = re.compile(re_aroundface + re_openbracket + re_face + re_closebracket + re_aroundface)

some_kaomoji = kao_finder.findall(a_tweet)

This works really well, except for one thing; it doesn’t know how to handle tweets with more than one Kaomoji present, so if you have a tweet like 「おはようございます!b(⌒o⌒)d 今日もいい天気じゃん!ヾ(〃^▽^)ノ」, it will match the outside edges of the Kaomoji and then extract everything between them – so we get a list with a single entry like this: ['b(⌒o⌒)d 今日もいい天気じゃん!ヾ(〃^▽^)ノ'], rather than what we actually want, which is both Kaomoji separately: ['b(⌒o⌒)d', 'ヾ(〃^▽^)ノ'].

My solution to this is not terribly elegant, but it is effective in every case I’ve tried thus far; I wrote a function that recursively divides up the string into smaller and smaller elements, and checks to see if there’s an individual Kaomoji lurking in them. Here’s what it looks like, with a sample call to the function at the bottom:

def kaomoji_find(a_tweet, facelist=None):
    if facelist is None: facelist = []
    faces = kao_finder.findall(teststring)
    for kao in faces:
        if len(kao) > 10:
            if len(re.findall(re_text, kao)) > 4:
                firstthird = kao_finder.match(kao[:int(len(kao) / 3)])
                if firstthird is not None:
                    facelist.append(firstthird.group())
                    facelist = kaomoji_find(teststring.replace(firstthird.group(), ''), facelist)
                else:
                    firsthalf = kao_finder.match(kao[:int(len(kao) / 2)])
                    if firsthalf is not None:
                        facelist.append(firsthalf.group())
                        facelist = kaomoji_find(teststring.replace(firsthalf.group(), ''), facelist)
            else:
                facelist.append(kao)
        else:
            facelist.append(kao)
    return facelist

some_kaomoji = kaomoji_find(a_tweet)

Keeping Project Keywords Whole in Tokenisation Output

Once you’ve done all the above steps, you’re ready to feed the remainder of the tweet to MeCab for tokenisation, just like we did before; then you can stick the MeCab tokens and the tokens collected in the above steps all together to create the “bag of words” for this tweet. Remember, the order of the words doesn’t actually matter to Bag of Words approaches to machine learning, so it doesn’t matter how we stick the lists of tokens together.

There’s one final wrinkle to deal with, though. If you’ve compiled your data set by searching for a certain keyword, that keyword or phrase will appear in every tweet – and you’ll want to be sure that it’s always tokenised in the same way. You don’t want MeCab splitting up your keyword in unpredictable ways, because this can mess with various kinds of analysis that you might be doing with your bag of words further down the line.

One approach to fixing this would be to treat keywords like we treated emoji and kaomoji – stripping them out of the text before passing them to MeCab. Don’t do this! Tokenising algorithms like MeCab use the surrounding characters in text to figure out where word boundaries are most likely to be; they rely heavily on the context in which a word appears to figure out which word it is and even where the word begins and ends. By taking out your keyword, you’re mangling up a sentence and preventing MeCab from tokenising it correctly (and honestly, if you’re using Twitter data, there are going to be enough mangled, ungrammatical sentences in there to make MeCab have whole baskets of kittens anyway).

The solution, then, is to let MeCab tokenise the tweet with the keywords still intact, then check through its tokens to see if it’s split the keyword(s) up anywhere, and replace them with a complete, un-split version if so. Again, apologies for my messy Python, but here’s the function I created to accomplish that:

def find_tokens(tweet, keywords=None):
    if keywords is None: keywords = []
    mt = MeCab.Tagger("-d /usr/lib/mecab/dic/mecab-ipadic-neologd")
    mt.parse('')   # Who knows why this is required but it seems to fix UnicodeDecodeError appearing randomly.
    parsed = mt.parseToNode(tweet)
    components = []
    while parsed:
        if parsed.surface != '' and parsed.feature.split(',')[0] != "記号":
            components.append((parsed.surface, parsed.feature.split(',')[0]))
        parsed = parsed.next
    for a_keyword in keywords:
        cindex = 0
        while True:
            if cindex >= len(components):
                break
            temp_key = a_keyword
            if components[cindex][0] == temp_key:      # If the keyword is already tagged as one item, no problem.
                cindex += 1
                continue
            elif components[cindex][0] == temp_key[:len(components[cindex][0])]:  # We just matched the start of a keyword.
                match = False
                tempindex = cindex
                temp_key = temp_key.replace(components[tempindex][0], '', 1)
                while True:
                    tempindex += 1
                    if tempindex >= len(components): 
                        break
                    else:               # Test next element.
                        if components[tempindex][0] == temp_key[:len(components[tempindex][0])]:  
                            temp_key = temp_key.replace(components[tempindex][0], '', 1)
                            if temp_key == '':
                                match = True
                                break
                            else:
                                continue
                        else:
                            break
                if match:
                    components[cindex] = (a_keyword, 'PROJECT_KEYWORD')
                    del components[cindex+1:tempindex+1]      
                cindex += 1
                continue
            else:
                cindex += 1     # This component doesn't match the start of a keyword, so continue.
                continue

    return components

A few notes on the above code. Firstly, it works like the other functions in this post – pass it a tweet, and it’ll pass you back a list of tokens – but it also allows you to optionally give it a list of keywords which it will check through and make sure they’re tokenised as a single item.

This version of the code also throws out all punctuation and whitespace (that’s the parsed.feature.split(',')[0] != "記号" part). I figured since we’ve extracted kaomoji etc., we can live without the remaining punctuation – it’s unlikely to be of value to analysis. If you have a different set of circumstances or requirements, you can remove that part of the code to hang on to punctuation tokens. Finally, this code doesn’t just output a set of tokens, it outputs a list of tuples in the form (a_token, part_of_speech) – with the part_of_speech bit being something like 名詞 or 動詞, indicating what kind of word MeCab reckons this is. For some analysis tasks, it can be useful to do something like excluding particles (助詞) or auxiliary verbs (助動詞) – again, this really depends what you’re trying to learn from your text.


Next Steps

And that’s it! Combined with the MeCab instructions in the previous post, that’s pretty much the set of components you need to build a pretty effective bag of words representation of a Japanese language social media post. It’s obviously a lot more complex than tokenisation for a “traditional” piece of text like a newspaper article, simply because people use text in unusual and non-traditional ways on social media. At some point I intend to do a test to see whether there’s a major difference in, for example, the sentiment analysis results you get from using a normal bag of words and my improved, pre-processed bag of words; I suspect there should be a measurable difference because we’re saving a number of elements with significant relevance to sentiment, such as kaomoji, that would be thrown out by a traditional bag-of-words processor. I’ll have to get the rest of our tool pipeline up and running before I can run a side-by-side test, though.

(Next post in this intermittent series will likely be something about how we end up processing and learning from our bag of words, introducing some of the core Python tools for natural language processing and machine learning such as scikit-learn and Gensim.)