Sorry Server: What it is and why you should use it
A quick look at what Sorry Server is and some key lessons from using it in production that might change how you see it and hopefully convince you to implement it in your environment.
What is Sorry Server?
"Sorry Server" is a term often used informally in web development. It typically refers to a server that serves a fallback page in order to handle errors gracefully and inform users when something goes wrong with the service. In most cases the content served is static in order to be as lightweight on the resources and fast to serve as possible.
Common use cases
Rate limiting
In this case you would show your clients a sorry page when you are experiencing higher load or notice a larger queue in your service and would like to rate limit the intake of the requests.
Maintenance Pages
One of standard use cases is also using sorry server to inform clients that currently a maintenance is happening.
Graceful Degradation:
When your service is experiencing increased load and cannot keep up with the intake of the traffic, you would use sorry server to only be shown on certain parts of your system. Meaning, it would limit the functionality, but wouldn’t out right cut everyone out.
Error handling
When your servers detect an error, let there be code issue or another resource issue, they could instead of showing an error, rather forward the user to the sorry server.
Lessons from production
There are takeaways from production where sorry server saved us:
SPOF service inplace upgrades
Sometimes, you have to patch or upgrade a SPOF service, that doesn’t have additional server that you can fallback to, during upgrade. Running commands like apt upgrade or do-release-upgrade can lead to very unexpected behaviour of the server. For example, a PHP webserver for example could unload module for PHP in Apache during the upgrade and restart the httpd server. This leads to exposing raw PHP files of all of the clients hosting websites on that server for a period of time. This is something that happened lately with ubuntu server. Perfect example, where you should put sorry server up and make sure that for that period of time, nothing connects to the server being upgraded.
Inplace updating your code
Let’s say you are hosting your code on some webhosting server provider. Developer would usually just upload files via FTP or rsync. Unfortunately, that is not immutable deploy. Meaning, for the time in between when you start syncing your new files over the old ones and time they finish transferring, there can come to unprecedented errors, even wrong execution of code, because clients are all the time executing whatever is synced at the time. This is something we saw a lot, when I was working in advertising company and decided to go with immutable squashfs images of code, that would just be remounted (there was no Docker at the time). This kind of deployment of new code is also a case for a sorry server, to put it in front of your server while you are updating files and avoid the problems.
Handling of all types of responses
As most of crypto exchanges, we serve some html websites and on certain endpoints we also offer API access that is returned as JSON data. This means, putting a standard sorry server that just returns html output, does not cut it. We have to think about our clients. What happens with the code that connects to API service and tries to JSON load the input, but instead of JSON, gets html? That leads to awful errors. This is why we modify endpoints that we serve on our sorry serve, swapping from html output to JSON on appropriate paths and returning valid JSON with error message that explains that we are in maintenance mode.
CDN implementation
For actual exchange infrastructure, we have our own sorry servers of course, but for certain third party software that we are running, having CDN serve a maintenance page is just fine. That is also an option for all of you, that don’t actually have several servers and loadbalancers, to utilize.
Serve static content
A very tricky thing happened to us back in the day, when we were trying to be very smart about our maintenance sorry server page and actually served dynamic content on it. We tried to do it lightweight, but it was also connecting to database. The real problem happened, that we didn’t think about all edge cases. One of those would be, providing favicon.ico file on it, because certain browsers like FireFox require it. That led to awful spam to the database, trying to get that file and ended up executing dynamic content. This doubled the requests that we would have to handle, one for favicon and one for website and of course that led to certain degradation of the database performance. So, keep these maintenance pages static without any logic on it.
Footnote
Incorporating a "Sorry Server" isn't just a safety net, it's a strategy to enhance resilience and user satisfaction in production environments. In case you are not using it, hopefully above examples from production convinced you to at least have a look at it and maybe it’s really easy to add it into your procedure when doing maintenance.