Create a traditional to simplified chinese converter using Python
Introduction
Simplified and Traditional Chinese are the standard characters of the Chinese Language. As named, simplified Chinese is the reduced (usually fewer strokes) version as compared to its more complicated cousin(usually more strokes) Traditional Chinese. I will show you an easy way to convert traditional Chinese text to Simplified Chinese.
Preparation
The first thing we need is a mapping table of traditional chinese to simplified chinese. I have gotten one from tutormandarin.net website. So do some webscrapping and transformation, I came up with this csv file
Importing the libraries
The main library we will need is the csv. Depending on if you want to convert a webpage, you can add the urllib as well.
Read CSV and store it to dictionary
We will need to store all the mapping tables in the program. I decided to use dictionary to save it as it allows quick access and there are only 900+ entries so it will not be resource intensive. We will need to open the file with the right encoding for this to work. There are a lot of ways to figure out the encoding such as opening it in text editors like sublime text or visual studio code. Another way is via Python which I showed at the bottom of this post. Then read the csv into a list and form it as a dictionary.
Read the file to be converted
I have included two versions, one for the local file and one for a webpage. Same as the csv, we will need to add the right encoding if not the program will crash.
Create the output file
We will need to create a new file to store the output indicating the encoding and write permission. The encoding should be the same as the file that is to be converted.
Actual replacement
Here is the fun part. What we will do is line by line, character by character do a replace using the dictionary. We will keep this replace in a try except because of a character not present as the dictionary key will cause keyerror so we will ‘pass’ to the next character. After completing all the characters in the line, we will write the newly converted line with simplified characters to the new file before we move to the next line and do everything all over again. After completing all the lines, close the new file. Don’t worry. The program does not take long to run, I have tested on the tw.yahoo.com website, and it took 2 seconds!
Summary
So there you have it, a traditional Chinese to Simplified Chinese converter in just 15 lines of code. You can even change it to a simplified to traditional converter by inverting the dictionary items and keys or swapping the columns of the CSV file! Full code is as below.
Bonus: Check your file for encoding type
Checking your file for the type of encoding used is simple. All you need to do is to read the file as bytes then attempt to decode it with an encoding. If it is the wrong encoding, it will trigger a UnicodeDecodeError error. So all we need to do is to try to decode with all the possible encoding types until we hit a bingo. I have already gathered all possible encoding that can support Chinese characters so we will need to loop through each one until we get the correct encoding.