Chinese characters can be roughly divided info “Traditional” and “Simplified”. If you need to convert one to another, I think the most convenient tool is OpenCC in Python.

[NLP][Python] Use OpenCC to convert simplified Chinese to traditional Chinese via opencc-python-reimplemented

Chinese characters can be roughly divided info “Traditional” and “Simplified“. If you need to convert one to another, I think the most convenient tool is OpenCC in Python.

OpenCC (Open Chinese Convert), it supports multiple platforms including Windows, Linux and MacOS.

From the name, we can know that this is an open source project. Those who are interested can refer to the link at the end of the article. If you want to try the functions of OpenCC, you can also go to the following website: https://opencc.byvoid.com/

So, in below I will start to introduce how to use Python to call OpenCC to achieve the task of converting from Traditional to Simplified (Or vice versa.)


Preparation

First we can refer the following Github project that is developed by pure Python backend: https://github.com/yichen0831/opencc-python

And we can use the command to install it.

pip3 install opencc-python-reimplemented

How to use OpenCC

This is a simple sample code (the text is Traditional Chinese):

# -*- coding: utf-8 -*-
from opencc import OpenCC

cc = OpenCC('t2s')
text = '傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。'

print(cc.convert(text))


Output:

傅达仁今将执行安乐死,却突然爆出自己20年前遭纬来体育台封杀,他不懂自己哪里得罪到电视台。

And the output text is Simplified Chinese.

We just need to import OpenCC package and use OpenCC('MODE') to initialize our converter.

There are many conversion modes, probably the following:

  • hk2s: Traditional (Hong Kong) -> Simplified
  • s2hk: Simplified -> Traditional (Hong Kong)
  • s2t: Simplified -> Traditional
  • s2tw: Simplified -> Traditional (Taiwan)
  • s2twp: Simplified -> Traditional (Taiwan, Contain idiomatic word conversion)
  • t2hk: Traditional -> Traditional (Hong Kong)
  • t2s: Traditional -> Simplified
  • t2tw: Traditional -> Traditional (Taiwan)
  • tw2s: Traditional (Taiwan) -> Simplified
  • tw2sp: Traditional (Taiwan) -> Simplified (Contain idiomatic word conversion)

Through the explanation of the different modes above, we can see that what we are currently using is only simple “convert to simplified“. If we want to convert even some commonly used words, then we need to change to another mode:

# -*- coding: utf-8 -*-
from opencc import OpenCC

cc = OpenCC('tw2sp')
text = '傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。'

print(cc.convert(text))

# -*- coding: utf-8 -*-
from opencc import OpenCC

cc = OpenCC(‘tw2sp’)
text = ‘傅達仁今將執行安樂死,卻突然爆出自己20年前遭緯來體育台封殺,他不懂自己哪裡得罪到電視台。’

print(cc.convert(text))

Output:

傅达仁今将运行安乐死,却突然爆出自己20年前遭纬来体育台封杀,他不懂自己哪里得罪到电视台。

The original word “執行” was converted to “运行“.

Source: [NLP][Python] Use OpenCC to convert simplified Chinese to traditional Chinese via opencc-python-reimplemented – Clay-Technology World