In late October, the Roblox global game network went online, a three-day outage. The site is used by 50 million players daily. Discovering and addressing the root causes of this disruption will require a massive effort by engineers at both Roblox and major technology supplier, HashiCorp.
Roblox finally provided a great analysis in Blog post At the end of January. As it turns out, Roblox has been bitten by a strange coincidence for several events. The processes that Roblox and HashiCorp have gone through to diagnose and ultimately fix things are useful for any company that is running a large-scale code infrastructure or making heavy use of containers and small services across its infrastructure.
There are a number of lessons that can be learned from the Roblox outage.
Roblox is included in the HashiCorp software suite.
Online multiplayer Roblox games are distributed all over the world to provide the lowest possible network latency to ensure a fair playing field between players who may be connected from far away. Hence, Roblox uses HashiCorp Consul, Nomad and Vault to manage a pool of more than 18,000 servers and 170,000 containers spread all over the world. Hashi is used to discover and schedule workloads, and to store and rotate cryptographic keys.
Rob Cameron, Technical Director of Infrastructure at Roblox, gave a Show At HashiCorp’s 2020 User Conference on how the company uses these technologies and why they are essential to the company’s business model (link takes you to both transcript and video). Cameron said, “If you are in the US and you want to play with someone in France, go ahead. We will find out and give you the best possible gaming experience by placing computing servers as close to the players as possible.”
The Roblox engineering team initially followed a series of false leads.
In tracking the cause of the outage, engineers first noticed a performance issue and assumed a bad hardware assembly, which was replaced with new hardware. When performance continued to suffer, they came up with a second theory about heavy traffic, and the entire Consul suite was upgraded with twice the CPU cores (going from 64 cores to 128) and faster SSD storage. Other attempts were made including restoring from a previous healthy snapshot, going back to 64-core servers, and making other configuration changes. This was also unsuccessful.
Lesson one: While hardware issues aren’t uncommon on the scale Roblox runs, sometimes an initial intuition to blame a hardware problem can be wrong. As we will see, the outage was due to a combination of software errors.
Roblox and HashiCorp engineers eventually found two root causes.
The first was a bug in BoltDB, an open source database used within Consul to store certain log data, that didn’t clean up disk usage properly. The problem has been compounded by the unusually high load on the new Consul Broadcasting feature recently introduced by Roblox.
Lesson 2: Everything old is new again. What was interesting about these causes is that they have been associated with the same kinds of low-level resource management issues that have haunted system designers since the early days of computing. BoltDB failed to free disk storage as old log data was deleted. The Consul’s influx suffered from clerical contention under very high loads. Getting to the root cause of these issues requires deep knowledge of how BoltDB tracks free pages in its file system and how Consul Streaming uses Go concurrency.
Expansion means something completely different today.
When running thousands of servers and containers, manual management and monitoring processes are not really possible. Monitoring the health of such a complex and large-scale network requires decoding dashboards such as the following:
Lesson 3: Any large scale service provider should develop automation and orchestration procedures that can quickly focus on failures or abnormal values before they remove the entire network. For Roblox, split-second differences in latency are important, which is why they use the HashiCorp software stack. But how the services are divided is also critical. Roblox ran all of its back-end services on a single consul suite, and that ended up being a single point of failure for its infrastructure. Roblox has since added a second location and started creating multiple Availability Zones for further iterations of the Consul Group.
One of the reasons Roblox uses HashiStack is to control costs.
“We build and operate our foundation infrastructure locally because at the scale we know we will reach as our platform grows, we have been able to significantly control costs compared to using the public cloud and managing network latency,” Roblox wrote in their blog post. the “HashiStack“An efficient way to manage a global network of services, and it allows Roblox to move quickly – they can create multi-node sites in two days. With HashiStack, we have a repeatable design pattern to run workloads no matter where we go,” Cameron said during his 2020 presentation. However, much has relied on one consul group – not just the entire Roblox infrastructure, but also the monitoring and telemetry needed to understand the state of that infrastructure.
Lesson 4: Network debugging skills are dominant. If you don’t know what’s going on across your network infrastructure, you’re toasted. But debugging thousands of microservices isn’t just about checking router logs; It takes a deep dive into how different bits fit together. This was a particular challenge for Roblox because they built their entire infrastructure on their own dedicated server hardware. And since there is a circular dependency between Roblox’s surveillance systems and Consul. In the aftermath, Roblox removed this dependency and extended telemetry to provide better insight into the performance of Consul and BoltDB, and into traffic patterns between Roblox and Consul services.
Be transparent about service interruptions with your customers.
This means more than just saying “we were frustrated, now we’re back on the internet”. Details are important to get in touch. Yes, it took Roblox over two months to publish their story. But the document they produced, researching the problems, showing their false starts, and describing how the engineering teams at Roblox and HashiCorp worked together to solve the problems, is pure gold. It inspires trust in Roblox, HashiCorp, and their engineering teams.
When I sent a HashiCorp PR email, they replied, “Because of the critical role our software plays in customer environments, we actively engage with our customers to provide recommended best practices and proactive guidance in designing their environments.” We hope your critical infrastructure provider is ready when the next outage occurs.
Roblox has obviously been pushing what HashiStack can provide, but the good news is that they eventually found and fixed the issues. The three-day outage isn’t a great result, but given the size and complexity of Roblox’s infrastructure, it was an impressive feat nonetheless. And there are lessons to be learned even in less complex environments, where some software libraries may still hide a low-level bug that will suddenly reveal itself in the future.