Did you ever have one of those weeks?
About a month ago we did. As a team, we never want to re-live that kind of stress.
Based on the feedback I was getting from support, the stress our subscribers were under was even greater, and understandably so.
When the podcasting service you have relied on for the last 14 years suddenly wasn’t serving your media and RSS, and there’s no immediate quick fix or a reason why, the feeling of loss of control is one that cannot be easily put in to words, at least not by me, to adequately describe the level of frustration so many of you felt and voiced to our support team.
What we experienced was best described as an internally originated cascading failure.
One issue caused another, which caused another, and so on. Finding the root cause of this failure took quite a lot longer than anyone wanted it to and we’re very sorry for that.
All of our equipment should have been providing us with the streaming telemetry, reporting states and status to other servers which could alert us of the issue well before it became the showstopper we experienced.
Unfortunately it wasn’t working as it should have been.
When the first issue occurred, it caused a bottleneck in outbound data flow,and we should have been on that within minutes. Instead, we didn’t see it, so we let it go.
When the second issue occurred, it was causing iTunes Music Service (iTMS) servers to hit us 10,000 times a second for RSS updates. It appeared to be an attack by what we perceived as “their badly configured servers.”
Finally, only when the third issue occurred, and stopped us from responding at all, we realized this wasn’t an attack, and that it was all internal. Once that happened, we weren’t able to diagnose the problem correctly because each time we looked at the servers that were the root cause, they would stop exhibiting the behavior that was causing the cascading failure, only to resume it ten minutes after we logged out and support had raised the “issue fixed” flag.
We’re very sorry for the inconvenience to you all, from the length of time it took us to realize what was happening and for how long it took us to resolve the issues.
What I can tell you at this stage is that every piece of equipment that we manage is being upgraded or replaced to accommodate five times the load it is currently experiencing, including internal and external networks.
Where we were using Quad-core servers to host media and content, we’re now replacing each of those with 32 or 64-core machines. We no longer use virtual machines to host websites or web content, due to networking issues with the various virtualization stacks that cause congestion and do nothing to mitigate it.
Our backup strategy, as it always has been, ensures that your media gets backed up to three monstrous media servers. Two are located within our network, and another is hosted externally to ensure that all of your data is secure even if our data center stops functioning.
Every piece of equipment will send status reports every 30 seconds to a service that tracks load and usage and identifies potential issues.
Finally, the network that hosts your media and RSS will be on separate equipment from the service that hosts the website, allowing us to ensure that a sudden explosion in consumption of media doesn’t stop you from accessing the site.
We realize that you have choices when it comes to selecting a podcast host, and we want to make sure that Hipcast is worthy of your continued trust.