Posts Tagged ‘social media’

Japanese Text Analysis in Python

This is a more technical post than I usually write, but it will be useful to some people. My political science research involves some natural language processing and machine learning, which I use to analyse texts from Japanese newspapers and social media – so one of the challenges is teaching a computer to “read” Japanese. Luckily, there are some tools out there which make this (relatively) straightforward.

For this guide, I’m using Python 3.5. My development system runs on macOS 10.12 and I deploy my code to a server running Ubuntu 12.16, so this guide will include commands for setting up the software on both macOS and Linux/Ubuntu. I don’t use any version of Windows so I don’t know how to set this up on Windows – if anyone can provide a set of Windows commands that achieve the same goal, drop me an email and I’ll add them to the blog post.

What are we trying to accomplish?

The first hurdle to doing any analysis of Japanese text is segmentation, or tokenisation; breaking it down into usable chunks. For European languages, words have spaces between them, so you can just divide everything up at the word boundaries (the spaces, commas, full stops and so on), which yields an array of words that we can use to calculate things like frequencies or co-occurrence matrices. Japanese, however, has no spaces in its text, so there’s an extra pre-processing step required before we can start using these text analysis approaches.

In essence, we want to turn a string like this…

"今日はいい天気ですね。遊びに行かない?新宿で祭りがある!"

… into an array like this…


["今日", "は", "いい", "天気", "です", "ね", "遊び", "に", "行か", "ない", "新宿", "で", "祭り", "が", "ある"]

… which a computer can process to figure out frequencies, co-occurrences and so on.

A Software Shopping List

That’s the objective. Here’s a quick list of what we’re going to use to get there. (The URLs are for completeness only; we’ll be downloading everything from the command line.)

System Software:
MeCab

Dictionaries:
MeCab-ipadic
MeCab-ipadic-neologd

Python Libraries:
mecab-python3

This is not a definitive list of every piece of software that can segment Japanese text. There’s a big range out there, from the lightweight tinysegmenter through to fully featured software like kuromoji. I’m using this setup for two reasons. First, in my experience it’s the best at handling text data sourced from the Internet. Second, I think it’s the best trade-off of simplicity versus accuracy. tinysegmenter is much easier to set up and use, but its output is unreliable; it often breaks apart words that are actually common phrases or proper names (the name of the present Japanese prime minister, 阿部晋三, is rendered by tinysegmenter as [“阿部”, “晋”, “三”], not the correct [“阿部”, “晋三”] or [“安倍晋三”]; the same problem occurs with the word for prime minister itself, 総理大臣, which comes out as [“総理”, “大臣”] when what you (probably) want is [“総理大臣”]). Mecab works nicely with Python, and it’s easy to set it up with an extensive dictionary of common phrases and neologisms so its output is very accurate.

One other thing; if you’re following this tutorial on macOS, I expect you to have a little bit of familiarity with the Terminal. If you don’t know your way around Terminal at all, your homework is to go and install the “Homebrew” package; we’ll be using it to install a lot of the rest of the software. There are detailed instructions on the homepage and once you’ve done that, you’ll be able to get cracking with installing MeCab and its dictionaries.

Installing MeCab

First, let’s get MeCab (the core segmentation software) and MeCab-ipadic up and running. (Again, for macOS users, this assumes that you have successfully installed the Homebrew package.)

macOS:

brew install mecab
brew install mecab-ipadic

Ubuntu:

sudo apt-get install mecab mecab-ipadic libmecab-dev
sudo apt-get install mecab-ipadic-utf8

That’ll take a while, but once it’s done, MeCab should be up and running. You can test it by typing “mecab” at the terminal; in the blank line it gives you afterwards, type some Japanese text, and press enter. The result should look like this:

[email protected]:~$ mecab
無事にインストール出来ました !
無事 名詞,形容動詞語幹,*,*,*,*,無事,ブジ,ブジ
に 助詞,副詞化,*,*,*,*,に,ニ,ニ
インストール 名詞,一般,*,*,*,*,インストール,インストール,インストール
出来 動詞,自立,*,*,一段,連用形,出来る,デキ,デキ
まし 助動詞,*,*,*,特殊・マス,連用形,ます,マシ,マシ
た 助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
! 記号,一般,*,*,*,*,!,!,!
EOS

As you can see, it’s working nicely and correctly identifying parts of speech in this test sentence. Press Ctrl-C to get back to the terminal command line, and let’s continue.

The next step is installing mecab-ipadic-neologd working. This is a bit more complex, since we need to download the dictionary of neologisms and slang (vital for handling text from the Internet) and then recompile it for MeCab. First, we  install the tools used to download (“clone”) the most recent version of the dictionary, then we compile and install the dictionary itself.

macOS:

brew install git curl xz
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
cd mecab-ipadic-neologd
./bin/install-mecab-ipadic-neologd -n

Ubuntu:

sudo apt-get install git curl
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
cd mecab-ipadic-neologd
sudo ./bin/install-mecab-ipadic-neologd -n

You’ll need to type “yes” at some point in the install process to confirm that you’re okay with overwriting some dictionary defaults. Now let’s check that it worked. To use this dictionary at the command line, we need to specify it when we invoke MeCab:

macOS:

mecab -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd/

Ubuntu:

mecab -d /usr/lib/mecab/dic/mecab-ipadic-neologd/

(Note the different paths – if you’re on Ubuntu you’ll also need to change the path given in the Python files below to match.)

If you can type some Japanese text and have it tokenised (as in the previous example), then everything is working.

Working with Python

The next step is to get MeCab talking to Python. Again, this tutorial assumes you’re using Python 3 (I’m on 3.5, but any recent version of Python should be fine); there is also a library for Python 2, but I haven’t used or installed it. If you’re an advanced Python user and are using a virtual environment setup for your packages, you should switch to that environment now. (All commands from here onwards should work the same on both macOS and Ubuntu.)

At the terminal, type:

pip3 install mecab-python3

(The command may be “pip“, not “pip3“, on your system, but a lot of systems use Python 2 internally for various things, and therefore rename the Python 3 tools with a “3” suffix to avoid clashes. Also, depending on your setup, you may need to “sudo” that command on Ubuntu.)

Now open your Python editor, whether that’s an IDE (I use PyCharm personally) or just a text window, and let’s see if this is working. Try the following code:

import MeCab
test = "今日はいい天気ですね。遊びに行かない?新宿で祭りがある!"
mt = MeCab.Tagger("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd")
print(mt.parse(test))

When you run this code, you should see the same output you got at the Terminal earlier. Okay, so Python is now talking to MeCab, but how can we turn that output into something useful that we can use in our analysis?

import MeCab
test = "今日はいい天気ですね。遊びに行かない?新宿で祭りがある!"
mt = MeCab.Tagger("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd")

parsed = mt.parseToNode(test)
components = []
while parsed:
    components.append(parsed.surface)
    parsed = parsed.next

print(components)

This will give you the following output…


['', '今日', 'は', 'いい', '天気', 'です', 'ね', '。', '遊び', 'に', '行か', 'ない', '?', '新宿', 'で', '祭り', 'が', 'ある', '!', '']

… which is what we’ve been aiming for from the outset! The parseToNode function of the Tagger generates an object that you can iterate over to see each token in the text. The actual token is accessed as “.surface“; if you want to see details of the part-of-speech or pronunciation, you can access that as “.feature“.

You’ve now got a system that will take Japanese text input (like a tweet, a Facebook post, or a blog entry) and turn it into a list of tokens. You can then apply exactly the same techniques to those tokens that you would to (more easily segmented) European languages.

A Few Cautionary Words

Before turning this system loose on a large volume of tweets or other social media data, I would strongly suggest writing some code to clean up your input – in particular, code to strip out and separately handle URLs, image links and (on Twitter) @mentions. MeCab doesn’t handle these well – it breaks up URLs into little chunks which will throw off a lot of machine learning algorithms, for example.

One suggestion (and the one I use in my own research) is to strip non-text elements from the text before handing it over to MeCab, and then add things like URLs and images back into the dictionary that MeCab returns to you – either as the full URL (ideally the non-shortened version) and @name, or as a token referencing a table that you can look up later (“URL1248”, “USER2234”).

Also, if you’re planning on analysing hashtags, note that MeCab doesn’t understand those either (it’ll tokenise the # and the text separately), so you may also want to pre-process those.

Finally, it’s quite likely that you’ll want to strip some elements out of the list MeCab returns to you as well. It has a habit of returning empty elements at the beginning and end of text, which you can remove; depending on the analysis you’re conducting, you may also wish to strip out punctuation, or even certain parts of speech (in which case you should check the “.feature” segment of each node to decide whether to keep or dispose of each given token).

Control, Freedom and the End of China’s Boom

Yesterday’s New York Times featured a well-written and quite balanced article looking back over eight years of a China-based correspondant’s experience of the country – “Notes on the China I’m Leaving Behind” (hat tip to Peng Jingchao for the link). Ignoring the uncomfortably convenient anecdote in the last paragraph, the author’s description of the evolution of Chinese society in the 21st century is one that strikes a lot of interesting notes.

I work with two fantastic researchers who are investigating the Chinese media and the systems through which control is exerted over media narratives at a state and regional level in the country. In essence, they are trying to lay bare the cogwheels and levers that the New York Times piece hints at – the mechanisms that allow the shaping of narratives and belief systems, even while encouraging the outward appearance of more freedom, more marketisation and more democracy. It’s tricky research; most of these systems are not formal or legislative, but conducted through mutual understandings, through winks and nods and carefully coded speech, and can only be uncovered by looking at the fine detail of the outcome in the form of actual reporting of events across the country’s various media outlets.

What I’ve gleaned from watching their work progress, and from talking to other researchers who engage with Chinese social media and the control of information on China’s separate, mirror-world version of the Internet, is a sense of just what an extraordinary and darkly impressive enterprise the Chinese government is presently engaged with. It is committed to market capitalism, to economic growth (at almost any cost) and to the advancement of living standards and growth of the middle class; it is also committed to keeping the Chinese Communist Party firmly in control of the nation, and as such, its objective is to decouple democracy from capitalism, severing economic freedom from political freedom. In a philosophical sense, what China is doing right now is an utter repudiation of the beliefs that underpinned the West during the Cold War; by advancing capitalism without democracy, markets without freedom, China would prove that these things were never inextricably linked, that one can happily thrive without the other.

That’s not exactly news, of course. Countries like Singapore – which, I suspect, China’s leaders have viewed as a hugely instructive model – have effectively managed to combine fantastic economic growth and high standards of living with deeply undemocratic regimes for many years. They provide just enough of the trappings of democracy to keep international relationships nice and smooth (democratic countries often make uncomfortable noises when dealing with undemocratic states) and to allow their comfortable middle classes, enjoying the benefits of economic growth, to dismiss the complaints or unrest of less-advantaged groups as mere “troublemaking”. China is this socio-political experiment writ large upon the canvas of the world’s largest state; an attempt to generalise the model successfully implemented by the ruling elites of Singapore and elsewhere upon a population of well over a billion people. Its tools in this enterprise range from the blunt force of arrested and imprisoned activists to altogether more subtle and powerful techniques of information control – through education, through media and, increasingly, through the very Internet tools that activists so often lionise as harbingers of democracy.

Buried in the New York Times article is, I think, the most important truth about this whole process – that the Chinese authorities are afraid, primarily, of one thing, namely the Chinese people. In discourses about China, at least those taking place outside China (and especially here in Japan, a country which by and large doesn’t know quite what to think of the huge, vastly important neighbour with whom it shares such a complex and contested history), there’s a tendency to emphasise China’s external relationships. A great deal of focus is placed upon territorial disputes with Japan, Vietnam and the Philippines, on the curious and difficult relationship with Taiwan and on the complex, inter-dependent and occasionally belligerent jostling for power with its fellow superpower, the USA. When people talk about internal relationships in China, they talk of Tibet or the Uighur people, about the contested status of Hong Kong, or about the treatment of prominent activists like Ai Weiwei. I’m not a China specialist by trade (though as mentioned, I work with several), but I can’t help but feel that these foci miss the point; they’re chosen through the lens of what people outside China care about, and miss the reality of what people within the country, and people within the government and the CCP, care about.

I’d contend that the most important relationship within China, the one that really matters, is none of those listed above – it’s the relationship between the Chinese Communist Party and the huge, burgeoning Chinese middle class. These are the people who have benefitted from China’s economic growth, who enjoy a quality of life undreamed of by their parents or grandparents and who are deeply proud of China’s rise in the world (but who also still tend to see China as being bullied, disrespected or put down by powerful rivals like the USA and Japan). They are also a generation far more educated than their parents’ generation, far more exposed to global influences – and thus far less likely to accept the “elites know best”, top-down rule of the CCP. If the CCP ever loses power, it will not be because of Tibetans, Uighurs, human rights activists or interventions from its neighbours; it be because the Chinese middle class demands democracy en masse, either violently or otherwise.

Right now, that isn’t happening. The majority of the China experts I speak to see no deep wellspring of democratic sentiment, no silent majority wishing for democratic freedom. They see a middle class that’s far more interested in its freedom to consume than in its freedom to vote; a population lifted in the space of a single generation from rural poverty to urban comfort. They have flat-screen TVs, smartphones, cars; they take holidays abroad, eat well, often consuming exotic food their parents would never have tasted, buy consumer goods and electronics, and each year sees an incremental increase in quality of living which, in almost any country you care to mention, easily quenches any thirst for democratic freedom. When life is so materially better today than it was ten years ago, why risk it all by speaking out for something so abstract, so removed from your own daily existence, as democracy?

Why, then, are China’s elites afraid? Because sustaining power through economic growth can’t be done indefinitely. Economies slow down or go into recession, and the meteoric growth of a country transitioning from a rural, agricultural economy to an urban, high-tech economy is largely an exercise in picking low-hanging fruit. Giving apartments, cars and TVs to a billion people who didn’t have them before is pretty heady stuff, economically, but as the rest of the developed world has been discovering for the past decade or more, eking out growth becomes a hell of a lot harder once all that low-hanging fruit is picked. China, too, is slowing down; it has announced much lower growth numbers over the past year than in previous years, and many good economists even question those numbers, suspecting that the figures are being artifically inflated to keep things looking good. If growth stalls or, worse, starts to go backwards, it will create two major sources of unrest within China – firstly, those still in poverty who have been anticipating that economic growth will reach them eventually, but now fear that they have ended up on the wrong side of a permanent socio-economic cleavage within the country; and secondly, the new middle classes, who have become accustomed to rapid improvements in their quality of living and now find this movement stalled. Oddly, history shows that it could be the second group who are most dangerous to China’s authorities; a concept called “revolution of rising expectations” emerged in the 1950s (though Alexis de Tocqueville had explored similar ideas as far back as the mid-1800s) which showed empirically that it’s not the impoverished and hopeless who rebel against governments, it’s the segments of society that have seen rising standards of living and abruptly find their raised expectations unfulfilled.

It is inevitable that this will happen in China – which is why the authorities are so determined to explore every other avenue of control available to them, before the economic honeymoon period comes to an end. This is the most powerful and important motivator of the behaviour of the Chinese state right now, and I don’t think it’s extreme to say that almost every single action taken by the Chinese state can be read and understood in the context of this desire to control its own middle class. Its international disputes – over mere rocks and reefs in the South and East China Seas – are often explored in economic or military terms, but are arguably far more important in internal propaganda terms, by setting up minor conflicts with Japan, America and their allies which can be easily exploited for nationalist propaganda purposes. The suppression of activists and the international condemnation it attracts is played as the cultural imperialism of the West directed against China and its system of values. Trade agreements like the Trans-Pacific Partnership are presented (not entirely unfairly) as evidence of China’s rivals trying to contain its economic growth; and of course, historic disputes with Japan, over everything from Nanjing to Yasukuni and back again, are used to stoke nationalist sentiments and give the Chinese people a sense of facing a common enemy (a strategy which Japan’s own hapless nationalists fall over themselves to enable, over and over again, like particularly unintelligent dogs in a Pavlovian experiment).

Not everything that the Chinese authorities are doing to secure their position in the post-boom years is bad, of course. The country’s economic growth has vastly improved the standards of living of hundreds of millions of people. The nation’s lack of democracy shouldn’t disguise its extraordinary achievements; the sheer number of people who have jumped in single generation from village peasant lives barely changed since medieval times to being urban, college-educated professionals is staggering and hugely impressive. Nothing has moved the needle on the world’s problems with poverty in the past decade as much as China’s advancement. Under Xi Jinping, the country has also started to tackle the political corruption that was endemic at local levels, an effort largely designed to stamp out a likely source of future unrest in the Chinese people.

It’s anyone’s guess whether any of this – the information control, the stoking of nationalist fires, the careful shrouding of the harsh machinery of totalitarianism in the soft language of democracy and freedom, or even the laudable crackdown on corruption – will count for anything when China’s economic growth finally stalls badly enough for its middle classes to feel the pinch. But this is the context in which we need to read what’s happening in China today. The authorities know that the stability and security of their position has enjoyed a blessed existence under the protection of the country’s economic growth, but they see the end of that protection in sight. Extending economic growth is a priority, of course, but building the structures that will protect their position in a post-growth world is the motivation that drives China’s authorities today – and this is the only analytical lens that makes sense of the country’s actions towards its neighbours, its trading partners and its own people.