Battlefront Chronicles: #2 Lost clicks
"Battlefront Chronicles" series shares stories of production mishaps, how they were handled, and the lessons learned along the way.
Please keep in mind that the stories in the Battlefront Chronicles may not all be my own. Some of them come from other engineers working in the field, and their contributors may choose to remain anonymous.
Advertising company
Back in the day I was working at an advertising company, making sure our servers were delivering ads as fast as possible, to as many clients as possible. It was at the time second biggest advertising company in the central europe and we were doing roughly 40-60k requests per second on our frontend servers.
Infrastructure
There were quite a lot of supporting servers and services, but ~20 main virtual servers were responsible for delivering ads to the clients. They were all running on hosts that had special kernels, which were patched to run containerized applications. The technology was called vServer. Pretty similar to Docker, but very high performant. Along with these ones, we also ran few of virtualized frontend serving instances on kvm and XenServer. Keep in mind, these were very early days and money was scarce, so we had to do with what we got, often using very old hardware or reused/refurbished servers that we bought from other companies, which were replacing their own infrastructure with newer hardware.
Incident
One day, CTO approached me with very interesting finding. Looking at the charts, we figured out there was lower click rate of our ads. Looking further, it seemed to have started happening gradually, but in last 2 weeks or so. Advertising is very chaotic environment and a lot of events affect it a lot. For example, when Soccer World Cup was happening, we had no clue what is going on, because we were serving 30% of the traffic for an hour or so, but then got hit by a sudden enormous spike. We later figured out, it was a half time of the game. So, these kind of events drastically affect serving, but not the internal click rate and ad showing etc., so this was indeed a weird situation.
Investigation
As sysadmin in charge back at the time, I went and dug into the problem further and figured out that only 1 out of 20 servers has lower click rate. Very odd! Why only this server? Metrics were fine, everything was fine. But then, I started to think a bit and tried to understand what is actually going on. So… to put it simple, when the ad is served, we send an event of “ad served” back to the servers and when it’s seen, we send “ad opened” and so on. Logically, “ad served” and “ad opened” usually happen quite fast upon serve, but “ad clicked” really depends on a person, if they click an ad or not. Looking at code I figured additional information that helped debugging the situation. Our code was actually doing additional check, that if “ad click” came through way late, like 10-15 seconds late, that would mean for us, we need to discard it as non-valid click event. Digging even further, that is exactly what was happening on this offending server. This started ringing the bells. It has something to do with time!
The culprit
After the initial investigation, I checked server time of the offending server and then compared it with other servers. Bingo. A difference! This server, seemed to have difference of server time of about 10 seconds!
So what. You would say…
But no, this affects our “ad click” event. It will compare timestamp when ad was served, which was served from another server, that had clock in sync and then “ad" click” came to this server, that had difference in clock of about 10 seconds, which automatically flagged this click event as discarded.
Mind blown.
Reasoning
Okay, so why does this even happen? I mean, we already have ntpd running on all of the hosts! Investingating a bit more, I figured out there was a difference between the rest of the ad serving servers and this one that was faulty. This one, was actually running in full virtualization! Oh, noes.
Unfortunately full virtualization can have effects on the clock itself, since it exposes it through a virtual clock. Due to this, a clock can drift significantly.
The fix
I immediately installed ntpd on full virtualized servers. I also figured out, there were other fully virtualized VM’s, that were drifting over the months of running. As soon as ntpd synced on them, this solved the problem. Ad discards were no longer happening and we had to go and asses the damage.
Aftermath
We were actually discarding all ad clicks on 1/20th of the servers, which meant a lot of clicks in two week span. The loss in revenue wasn’t that substantial though and that was actually an issue. If it was greater loss, we would have noticed it much earlier and would go investigate. But, as it was only 1/20th of the traffic being discarded, that was still in margin of error in reporting, considering ad serving is quite chaotic environment as mentioned before. What we could do after that, was set alerts on these discards, just in case something like this happens again, this should give us fast enough alarm that there is something wrong.
Final thought
This incident was a lesson to myself that I should always have metrics for time drift from now on. No matter where I am working at and what I am doing, servers should have ntpd or chrony, since server time is something that applications rely heavily on. Any clock drift pops up in a very weird manner and it takes a lot of time to debug and pinpoint the exact issue.
Interesting situation and a nice find - I went waaay too deep into clock synchronisation when looking into EU capital markets regulation where the requirement is that all servers connected directly to a trading venue have their clocks synced. Then you find out about different standard for syncing, equipment for getting accurate time via GPS, etc. (and the cost of it all :D )