How to Talk About Software at Scale
I’ve recently noticed that, of all the interview preparation materials out there, there’s not much guidance on how to excel in open-ended, “no code” technical interviews. Usually, when interviewing candidates for senior-level roles, a common theme is software architecture at scale. There are not many companies that actually run software at the scale of, say, Netflix, Amazon, Shopify, GitHub, and so on. How do you prepare to answer questions like this, if you’ve never worked on a similar-scale project?
This blog post is my attempt to break down how I’ve passed these types of interviews, despite not much personal experience in architecting, deploying, and supporting enormous production systems.
An aside: If you know that the company you are interviewing with has WELL under a million users, and a fairly steady growth rate that is unlikely to “hockey stick” in the next… 1-3 years, I would consider it to be a red flag if they asked me a lot of “scale” questions in the interview. That suggests to me that the hiring decision is premature, the optimization for scale is premature, and they’re probably going to burn engineering resources on the wrong priorities, at a point in time that they need to invest in market differentiation via feature set development. Not heavy lifting. Not yet, anyway.
I’m going to occasionally reference a real interview question from a company that I won’t name. An interviewer asked me once: “If you were to build Twitter today, how would you architect it, and what would you identify as the main challenges around building and scaling a platform for hundreds of millions of daily active users?”
My intended audience
I want to call out that this blogpost is not directed at junior-level interviewees. I’m assuming that you, the reader, are here, because you have a few years of professional full-stack or backend development experience, and your past work hasn’t been on any large-scale products, and you have theoretical knowledge, but lack practical experience. You’re now looking for a more challenging role at a large company, whose main product has over a million users. That’s the interview scenario where these types of questions are likely going to be asked.
Ideally, you’re going for at least a Senior Engineer position, if not an Architect or Lead role. For this reason, there is advanced terminology in this post that I have not defined — I’m assuming that you have encountered concepts like database replication previously.
If you’re earlier in your journey, I would recommend a few first steps:
- First, shameless plug: watch my beginner-friendly talk about the challenges of distributed systems
- Second, read cover-to-cover “Designing Data-Intensive Applications” by Martin Kleppmann.
- Check out this systems design primer. Hat tip to Sam Blausten for telling me about it!
1. Think about how you’re going to store data.
What database should we choose?
The first question I always ask myself is: what database are we going to primarily use to store user data? And I believe that, in the year 2020, unless you have a very, very, very, very compelling reason otherwise, the correct answer here is either MySQL or PostgresQL. Certainly for a Twitter clone, or any application that’s essentially “CRUD at scale”. There are some subtle differences between the two, that are also probably worthwhile to research a bit beforehand.
The reason here is because both MySQL and PostgresQL have literally billions of hours of battle-tested usage at tremendous production scale, by many companies who have trod this path before, and they are both designed to store and process lots of data, quickly. Almost all developers know how to interact with SQL databases. There are good reasons for arbitrarily-shaped document stores, data lakes, NoSQL, and other types of databases, but for primary user data storage, I would choose the path more taken.
What else do we have to think about, when running databases at scale?
When you start having a few 10000s of active users who are generating millions of rows of data daily, you’ll eventually need a data partitioning strategy. At some point, it’s not going to be feasible to store your entire database on one host. You’ll have to chunk up the data in some logical way, so that multiple hosts can synchronize, and make sure that read and write requests are forwarded to the correct places. This, by the way, is database sharding.
Sharding
I usually use the example of encyclopedias as a real-world example of sharding. Encyclopedias aren’t published as one massive book: you can’t physically bind a book that large, which is very similar to how data throughput and storage has a physical limit! On a dev team, you’re probably not going to literally be chunking up one table (at least, not for a long time. MySQL and PostgresQL tables can store millions of rows. Billions, maybe?). More realistically, you’d be dividing up your tables into logical groupings, and putting each group on a different host. How those groups are defined… is entirely up to the team. Some teams like to organize them by domain: if you have clearly defined hemispheres in your codebase, like Community vs. Enterprise mode, their associated tables might be possible to house on different hosts. It’s also worth thinking about how you’ll be JOIN-ing across these groupings: SQL JOINs are expensive already, and joining frequently across schema domains can be.. well, production-outage-generating, if not done properly.
Replication
You also will want to think about data replication strategies. This will require a bit of knowledge about your anticipated user base. If you’re building Twitter, it’s safe to assume that your users are globally-distributed, with perhaps high concentrations in North America and Europe, so you maybe would consider keeping replicas in datacenters in those geographies. There is a cost to replicating data and keeping it in sync, so approach this as an investment decision, like much of the rest of the decisions in this post.
Consistency & Availability Trade-offs
I don’t think it’s required to be able to explain RAFT, Paxos, or other consensus algorithms in an interview, but having some familiarity with common distributed read/write strategies can help you craft a stronger answer about data consistency requirements. For example, if you’re interviewing at a bank, data integrity is extremely important, and you’d want to use transactions to block competing write operations. Otherwise, someone could withdraw $100 three times, from a balance of $200!
For things like social media platforms, it’s less important if a user doesn’t see Chrissy Teigen’s latest tweet within seconds of posting, but the website being generally online is highly important, so we’d prioritize availability rather than consistency, and let “eventual consistency” do its thing for much of the user data.
The key here is to understand the domain, and ask clarifying questions about what the business requirements actually are, before locking yourself into technology decisions that will be expensive to migrate away from later.
Rent a hosted database, or host it yourself?
This hasn’t come up before for me personally, and I doubt you will be explicitly asked unless the role involves Solutions Architecting, or if the company has a business reason to be vigilant around data storage location (it’s, say, a bank, or a law firm, or in the business of storing medical data). For some massive-scale companies, like Facebook and Amazon, it makes sense to run your own data centers. These companies, importantly, also have highly specialized staff who administer said data centers. Because that staffing reality isn’t true or feasible for most companies, for the overwhelming majority of the rest of us, it usually makes business sense to pay someone else to be your database administrators. That “someone else” is probably AWS. Or Google Cloud, or Azure, if they have the right offerings.
2. At a high level, understand failure scenarios in distributed systems.
If you’re asked to draw boxes and lines on a whiteboard, odds are, you’ll be representing services and the communication patterns between the two of them. For critical service-to-service communication paths, people commonly reach for message queues, like RabbitMQ. For cross-domain message-sending, people commonly reach for pub-sub firehoses, like Kafka. I think the specific technologies you choose matter less than the reasons you want to use them, at least, in an interview. All these tools are solutions, and in order to talk intelligently about solutions, you need to understand what problems they solve.
Also, why distributed systems? Every system is a distributed system, and has been, for a long time.
A cynical aside: On the job, the decision you’ll make is not “What is the best tool for this problem?”, but rather, “What is the best tool, that’s been approved by security audit, signed off by Finance, available from our vendors, that my team kind of knows how to use or can learn quickly, that solves the problem?” — and that specific answer is unknowable, until you’re on the inside.
The single most valuable thing that I did, while I was working at Pivotal, to learn more about all the ways that large, sprawling, distributed systems fail in practice (even if the UML diagrams are pristine) was to spend some time riding shotgun with Support engineers. If this is a thing that you can do presently, I strongly recommend it: Ask to do a one-week rotation into Customer Support, and witness all the ways that customers are struggling to use the systems you’ve built. It’s not only a great empathy-building exercise that will make you more effective in any senior-level role; it’s an incredibly safe way to gain experience seeing how systems fail. Some companies have dedicated SRE teams who carry pagers; if it’s available to shadow someone for a week, wake up when they wake up, watch incidents unfold in real time, participate in the post-incident reviews — that is also a safe way to learn.
How can things go wrong?
My SRE friends will tell you that no two incidents are the same. Especially if they do their jobs correctly: over time, alert-producing incidents should become increasingly strange, and difficult to remediate with automation. (That means that they’re building automated recovery tools effectively.)
But, still, I think, to impress the socks off of interviewers, there are a few genres of failure scenarios that you can broadly talk about:
- Out-of-memory, stop-the-world garbage collection, and other application-level failures. Heck, even “undefined is not a function” if you’re writing JavaScript APIs counts.
- Configuration change errors
- External attacks (especially DDoS-style attacks by malicious or ignorant third parties)
- Unanticipated surges in legitimate usage that look like DDoS attacks
- Cluster-based consensus problems
- An infamous genre of problems I saw a few years ago: RabbitMQ nodes would get knocked offline by a network partition, then attempt to rejoin their old cluster, but the order in which the nodes joined was important for establishing quorum. Of course, they attempted to rejoin in the wrong order, which led to an infinite loop of attempted recovery.
There are a lot more! If you want a pleasant way to read about real-world failures, I recommend checking out Incident Labs’ “Post-Incident Review” zine: https://zine.incidentlabs.io
3. Show that you can reason about trade-offs between performance and maintainability.
Fancy whiteboard diagrams aside, I think one thing that will set you apart from the herd is to communicate that you know that the long-term success of software systems relies entirely on the humans who build and maintain them. I think there are two levels to this discussion, broadly framed as “performance (speed) vs. maintainability”: the “ground-level” discussion, which deals with your day-to-day lines of code, and the “500-feet-up” level, which is more forward-looking, and probably more of the altitude that senior-level engineers are expected to operate. The underlying assumption here is that large organizations are interested in building systems that are fast and reliable, and they want to hire people who understand how to make this happen.
Every component you add to a system adds cognitive complexity. If you are using a tool in a slightly off-the-beaten track manner, add another point to your cognitive complexity debt. If you’re replicating or partitioning, or using a less mainstream load balancing strategy, add more. And so on. The point is, that there are many design choices that can improve the overall stability, speed, or both, of the system, but will incur a real cost to the humans who have to learn them, build within or around them, support them, and perhaps one day deprecate them. Don’t forget that teams will change, people will leave and join the organization, and your decisions today heavily influence the steepness of that onboarding curve. That is the “maintainability” piece of this conversation.
Yet another aside: If my interviewers aren’t raising one eyebrow as I throw another RabbitMQ into the middle of some network calls, and if they’re not holding me accountable to justify every design choice, I would wonder about how critically the organization is thinking about maintainability. I would wonder how the on-call developers are doing, and if they need a hug.
The “close to the ground” level
I make probably 5-10 of these decisions each day. These are small decisions about performance that are usually represented in a few lines of code, that are relatively easy to change in the future: I just have to open a new PR, or maybe, if there’s likely to be cascading impact, write an RFC and get some feedback, then make the change. The time between learning about the problem, coming up with a solution, and implementing the solution is on the order of a few days or weeks, possibly hours or minutes.
For example, if I figure out a good way to optimize a SQL lookup, perhaps by adding a few extra indexes, adding a subquery, maybe doing some clever pagination, I can shave off a couple milliseconds of response time. But what if my code contained a bug, and I’m not on call when it manifests in production? Knowing not just how to make the code faster, but also questioning the maintainability of that clever code and finding some cognitive debt mitigation strategies (such as: extra unit tests that contain well-named variables and clear descriptions, extra documentation in the tracking issue, descriptive commit messages, and so on) is the baseline, for a senior level role. I would also expect strong mid-level developers to make these decisions reasonably well.
The “500 feet up” level
The same principle holds for bigger decisions, or investments, in system-level changes. Unlike a SQL query you can delete though, the choice to add a new component, or migrate to a new framework or language or hosting platform, is much harder to change later. I mentioned already that you can’t just flippantly throw RabbitMQs into the middle of architecture diagrams without justification. As part of that justification, it’s worth talking about the scale at which you’d reach for what is essentially a system-level layer of abstraction.
To keep running with this RabbitMQ example, you can also go further and talk about why, from a business perspective, it makes sense to introduce stronger guarantees around message-sending. Do you need the ability to audit requests sent through a web form? Retry failed sends because of known stability issues somewhere downstream? I think that demonstrating a deep understanding of the business needs (and, if not, ask questions until you do understand) shows that you can be trusted to consistently make measured, well-informed choices, when the cost of change is high later on.
4. Understand the basics of system visibility and troubleshooting
This has already been thousands of words more than I intended to write, so I’ll briefly end on how to talk about monitoring systems. System health is quite a holistic concept, especially when you expand the definition of the “system” to include the humans, so it’s quite hard to come up with a standard set of guidelines as to what you should monitor, for a hypothetical system that you’ve been asked to architect on the spot. But there are a few things that people typically say.
First, your core business value lives in your applications, so start by monitoring the things that would take down your applications: excessive memory usage, excessive CPU usage, non-200 response rates, slow database response times, application-level error rates. The “Monitoring” chapter of the “The SRE Handbook” does a good job of explaining what to track and why.
Second, I would deliver a short speech about how it’s no good to monitor anything, if developers don’t immediately know what to do in the event that a line on a dashboard crosses into the scary red part of the graph. All alerts, in or out of work hours, need to be actionable, and that action needs to be immediately obvious, ideally as part of the alert notification’s message body. This will usually require a bit of training, which, you don’t need to necessarily volunteer yourself to conduct, during the interview. Entire consultancies exist to do this sort of training.
A parting hot take on SLOs
If you are further asked about things like SLOs and SLAs, a year or two ago, I would have played ball and said something like, “We should strive for as many nines as we can sustainably meet.” But now, my feelings have changed: your uptime requirements should be determined in close collaboration with product managers, designers, and support teams, who likely have a deeper understanding of customers’ needs. Even within one company, SLOs can vary between teams: not all teams work in the critical path. It’s okay to take a step back, and I think it might even be impressive to dodge a question about SLOs a little bit, by acknowledging that uptime requirements are determined by customer satisfaction — not the other way around.
Anyway, I hope this was useful. I mostly wrote this for “me, but in 2018”. I’ve tried to include all the words that I wish I knew back then. Hopefully this helps set you off in the right direction to start Googling and digging into some concepts more deeply. I hope you’ll find, like I did, that systems design, and learning about distributed systems, can be a really fun, challenging pursuit.