Chinese Javascript String Comparison
March 20, 2015 |

Problem:

Coming soon, in KAWO v5, there will be an audit trail of every change that was made to a post. In order to show the before and after differences we needed a client side string comparison algorithm that could handle Chinese characters.

Solution:

Step One: User Experience

We started with some quick mock-ups of how we wanted it to look.

image
 

We quickly decided we preferred the first option of separated before and after diffs.

Step Two: Search for a 3rd Party Library  

Our app is entirely Javascript and so we started by searching for a JS package to calculate the difference between two strings. We found a few, but most of them outputted the second more confusing type of diff and none of them were able to cope with Chinese characters.

Step Three: Write Our Own Package

Not only is the Chinese language difficult to master for foreigners, it sometimes also introduces extra challenges for us programmers.

Chinese characters are multi-byte characters, which means that one Chinese character takes up more than 1 byte, while western characters usually take up only one. In some cases this can cause unexpected results when calculating String lengths. We will write more on this topic when we cover “How Weibo calculates your post’s length”.

Another key difference with western languages is that words in the Chinese language are not separated by spaces. This even makes a simple programmable task as ‘separating all words in a sentence’ a lot more challenging.

The closest thing we could find to what we wanted was this algorithm written by John Resig. This was similar to what we wanted: compare two strings and return an object with the changes marked. We decided to use this as our starting point. However, John’s algorithm (as with many language related algorithms published) assumes that the text to compare is a common western language. The algorithm will take all words from the text to produces arrays of words and their position in the string and compare them.

To get all the words in a Chinese sentence, you would have to use a segmenter since 1 or more characters can form one word. Luckily in our case the actual meaning of the chinese characters wasn’t important, only the characters and their location. So, that left us with 2 scenarios: the western parts had to be split by spaces, everything else by character. By modifying the regular expressions we’ve been able to make the script work for Chinese and/or other languages.

We’ve made our modified version freely available on Github and as a javascript module and made it easy to install using Bower. Enjoy!

https://github.com/mailmangroup/diffChinese

For all the latest updates follow KAWO on WeChat.

Leave a comment

Your email address will not be published. Required fields are marked *

Leave a comment...
Enter your name (people will see this) (required)
Enter your email (people won't see this) (required)

Back to blog