TL;DR: When you click the publish button it must just work. Achieving this simple goal has required relentless dedication to robustness.
In the early days, similar to so many other companies, we made the classic error of constantly pushing to add new features. We assumed these additions would translate into satisfied customers.
No matter how hard we worked on new features, users weren’t really loving KAWO. We spent a lot of time working on features that were hardly being used, while we should have been focussing on the most important part to our users: publishing.
We realised that reliability was the biggest issue. When users clicked the ‘Publish’ button there was an unacceptably large chance that the post would not get published. When they clicked ‘Schedule’ occasionally the post wouldn’t get sent out exactly on time.
So, in the summer of 2014, we set ourselves a 3 month goal of making these two core features of KAWO rock solid. Here are just a few of the steps we took…
Availability of images
One major cause for errors was the use of images. When you upload an image on KAWO, we store that image on Amazon S3. We found out that loading the images from within China was sometimes slow and unreliable so we tried many ways to improve this which we previously wrote about here on the blog.
Try, retry… and try again
Weibo API requests to upload images were the most prone to failure. So, we rebuilt our publishing service to be multithreaded and attempt to publish each image up to 10(!) times before finally giving up.
After the images upload the next step is the request to publish a post. Again we designed our publishing server to attempt this 10 times before giving up. With more testing we also learned that waiting a couple of seconds in between each attempt resulted in a higher success rate.
Automatically re-size images
Whatever image the user uploaded is what our server would send to Weibo, but weibo always downsizes the images and larger images were much more likely to fail. Our server now reduces every image to less than 1024 pixels before uploading to the Weibo API. We also convert lossless PNGs to more compact lossy JPEGs.
Even with all these improvements there are still times posts fail to publish. Sometimes it’s our fault, sometimes the Sina API doesn’t respond. We still needed a safety net of alerts when a post fails.
Whenever a post fails, our product lead, Alex, and I both receive an email from our server detailing which post on which account failed to publish. We can then use our internal dashboard to see the exact reason and begin troubleshooting.
With email increasingly less reliable in mainland China, our server uses Nexmo.com’s impressive SMS service to also send us a message whenever a post fails.
We both take the stability of KAWO so seriously that no matter what time of day or night, as soon as we see the message we investigate. Here is an example of a recent late night WeChat conversation between Alex and I when we responded to failed posts.
In the life of a developer errors are almost the only certainty. The only solution is to be aggressive in monitoring and fixing them.