{"id":1085,"date":"2017-08-07T15:47:11","date_gmt":"2017-08-07T06:47:11","guid":{"rendered":"http:\/\/www.robfahey.co.uk\/blog\/?p=1085"},"modified":"2017-08-07T15:47:11","modified_gmt":"2017-08-07T06:47:11","slug":"tokenising-japanese-tweets-python","status":"publish","type":"post","link":"http:\/\/www.robfahey.co.uk\/blog\/tokenising-japanese-tweets-python\/","title":{"rendered":"A Python package for tokenising Japanese-language tweets"},"content":{"rendered":"<p>Last time I wrote about <a href=\"http:\/\/www.robfahey.co.uk\/blog\/tidying-japanese-sns-data-machine-learning\/\">preparing Japanese-language Twitter data for machine learning purposes<\/a>, I said I&#8217;d likely come back to the topic and discuss different approaches to using this data to find out useful or interesting things.<\/p>\n<p>I will. Eventually. I&#8217;m working on a number of interesting projects at Waseda at the moment, using several different machine learning approaches (some supervised, some unsupervised) to cut through the noise of Twitter data and find out (hopefully) relevant political or sociological things. I&#8217;ll write up some practical guides to those approaches as soon as I have time.<\/p>\n<p>In the meanwhile, though, I&#8217;ve battle-tested a lot of the techniques outlined in the previous blog post, fixed some bugs and updated a handful of things to be faster or more effective. I think I&#8217;ve now got a pretty solid system in place for processing Japanese-language tweets and splitting them down into handy bags-of-words &#8211;\u00a0<em>without<\/em> losing semantically important features like emoji or kaomoji along the way.<\/p>\n<p>Rather than update the last blog post, I&#8217;ve bundled all of the various functions into a Python object that you can use in your own code, and uploaded it to Github. <a href=\"https:\/\/github.com\/robfahey\/ja_tokeniser\">You can find the code here<\/a>. It&#8217;s not the world&#8217;s most professional Python code and for now at least, you&#8217;ll need to just download the file and drop it into your project &#8211; I haven&#8217;t got around to turning it into a &#8220;real&#8221; Python package yet &#8211; but it will hopefully be useful to some people working on this kind of project nonetheless.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Last time I wrote about preparing Japanese-language Twitter data for machine learning purposes, I said I&#8217;d likely come back to the topic and discuss different approaches to using this data to find out useful or interesting things. I will. Eventually. I&#8217;m working on a number of interesting projects at Waseda at the moment, using several &hellip;<\/p>\n<p><a href=\"http:\/\/www.robfahey.co.uk\/blog\/tokenising-japanese-tweets-python\/\" class=\"more-link\">Read More<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[296,294,295],"tags":[313,309,300,298,297,22,310],"class_list":["post-1085","post","type-post","status-publish","format-standard","hentry","category-nlp","category-programming","category-python","tag-ja_tokeniser","tag-machine-learning","tag-mecab","tag-natural-language-processing","tag-python","tag-twitter","tag-310"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p71QYy-hv","jetpack-related-posts":[{"id":1012,"url":"http:\/\/www.robfahey.co.uk\/blog\/tidying-japanese-sns-data-machine-learning\/","url_meta":{"origin":1085,"position":0},"title":"Tidying up Japanese SNS data for Machine Learning","author":"Rob Fahey","date":"19\/04\/2017","format":false,"excerpt":"In my last post about performing text analysis of Japanese language texts, I outlined how to install and use the MeCab system to break up Japanese sentences into their component parts (\"tokens\") which can then be used for analysis. At the end of the post, I mentioned that if you're\u2026","rel":"","context":"In &quot;NLP&quot;","block_context":{"text":"NLP","link":"http:\/\/www.robfahey.co.uk\/blog\/category\/nlp\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":564,"url":"http:\/\/www.robfahey.co.uk\/blog\/japanese-text-analysis-in-python\/","url_meta":{"origin":1085,"position":1},"title":"Japanese Text Analysis in Python","author":"Rob Fahey","date":"02\/12\/2016","format":false,"excerpt":"This is a more technical post than I usually write, but it will\u00a0be useful to some people. My political science research involves some natural language processing and machine learning, which I use to analyse texts\u00a0from Japanese newspapers and social media - so one of the challenges is teaching a computer\u2026","rel":"","context":"In &quot;NLP&quot;","block_context":{"text":"NLP","link":"http:\/\/www.robfahey.co.uk\/blog\/category\/nlp\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":988,"url":"http:\/\/www.robfahey.co.uk\/blog\/tokyos-tough-new-governor-is-picking-all-the-right-battles\/","url_meta":{"origin":1085,"position":2},"title":"Tokyo\u2019s Tough New Governor is Picking all the Right Battles","author":"Rob Fahey","date":"14\/12\/2016","format":false,"excerpt":"My first piece for Japan Forward, the new English language site launched by the Sankei Shimbun earlier this week, has been published.\u00a0It's about Koike Yuriko, the new governor of Tokyo, and the uncompromising stances she's been taking against corruption and crony politics in the large projects she inherited from her\u2026","rel":"","context":"Similar post","block_context":{"text":"Similar post","link":""},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1133,"url":"http:\/\/www.robfahey.co.uk\/blog\/2020-research-writing-updates\/","url_meta":{"origin":1085,"position":3},"title":"2020 Research &#038; Writing Updates","author":"Rob Fahey","date":"04\/02\/2021","format":false,"excerpt":"2020 wasn't exactly a great year for getting things done, but hey; some things got done nonetheless. Journal Papers \"Covariance in diurnal patterns of suicide-related expressions on Twitter and recorded suicide deaths\" (with Jeremy Boo and Michiko Ueda), in Social Science & Medicine.(What's interesting about it: there's a surprising lack\u2026","rel":"","context":"In &quot;research&quot;","block_context":{"text":"research","link":"http:\/\/www.robfahey.co.uk\/blog\/category\/research\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":187,"url":"http:\/\/www.robfahey.co.uk\/blog\/japan-supreme-court-social-change\/","url_meta":{"origin":1085,"position":4},"title":"Don&#8217;t look to Japan&#8217;s Supreme Court for Social Change","author":"Rob Fahey","date":"16\/12\/2015","format":false,"excerpt":"Japan\u2019s Supreme Court today announced a pair of decisions that are attracting significant media and public attention. The one dominating most of the headlines, it seems, is the ruling that a law forbidding married couples from keeping their original names (rather than one party changing their name) is perfectly constitutional,\u2026","rel":"","context":"In &quot;japan&quot;","block_context":{"text":"japan","link":"http:\/\/www.robfahey.co.uk\/blog\/category\/japan\/"},"img":{"alt_text":"Supreme Court of Japan","src":"https:\/\/i0.wp.com\/www.robfahey.co.uk\/blog\/wp-content\/uploads\/2015\/12\/Saikosaibansho.jpg?fit=593%2C278&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.robfahey.co.uk\/blog\/wp-content\/uploads\/2015\/12\/Saikosaibansho.jpg?fit=593%2C278&resize=350%2C200 1x, https:\/\/i0.wp.com\/www.robfahey.co.uk\/blog\/wp-content\/uploads\/2015\/12\/Saikosaibansho.jpg?fit=593%2C278&resize=525%2C300 1.5x"},"classes":[]},{"id":301,"url":"http:\/\/www.robfahey.co.uk\/blog\/japan-military-budget-increase\/","url_meta":{"origin":1085,"position":5},"title":"Restraint, not Aggression, in Japan&#8217;s Military Budget Increase","author":"Rob Fahey","date":"02\/09\/2016","format":false,"excerpt":"The remilitarisation of Japan is a popular theme for the international media. It gives a clear, dramatic narrative to international news coverage that might otherwise bore readers. In this narrative Japan's leadership seek to cast off the shackles of\u00a0the post-1945 world order, to rewrite the pacifist constitution, rebuild their military\u2026","rel":"","context":"In &quot;japan&quot;","block_context":{"text":"japan","link":"http:\/\/www.robfahey.co.uk\/blog\/category\/japan\/"},"img":{"alt_text":"JSDF troops with their flag","src":"https:\/\/i0.wp.com\/www.robfahey.co.uk\/blog\/wp-content\/uploads\/2015\/07\/Flag_of_JSDF20070408.jpg?fit=593%2C302&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.robfahey.co.uk\/blog\/wp-content\/uploads\/2015\/07\/Flag_of_JSDF20070408.jpg?fit=593%2C302&resize=350%2C200 1x, https:\/\/i0.wp.com\/www.robfahey.co.uk\/blog\/wp-content\/uploads\/2015\/07\/Flag_of_JSDF20070408.jpg?fit=593%2C302&resize=525%2C300 1.5x"},"classes":[]}],"_links":{"self":[{"href":"http:\/\/www.robfahey.co.uk\/blog\/wp-json\/wp\/v2\/posts\/1085","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.robfahey.co.uk\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.robfahey.co.uk\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.robfahey.co.uk\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.robfahey.co.uk\/blog\/wp-json\/wp\/v2\/comments?post=1085"}],"version-history":[{"count":1,"href":"http:\/\/www.robfahey.co.uk\/blog\/wp-json\/wp\/v2\/posts\/1085\/revisions"}],"predecessor-version":[{"id":1086,"href":"http:\/\/www.robfahey.co.uk\/blog\/wp-json\/wp\/v2\/posts\/1085\/revisions\/1086"}],"wp:attachment":[{"href":"http:\/\/www.robfahey.co.uk\/blog\/wp-json\/wp\/v2\/media?parent=1085"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.robfahey.co.uk\/blog\/wp-json\/wp\/v2\/categories?post=1085"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.robfahey.co.uk\/blog\/wp-json\/wp\/v2\/tags?post=1085"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}