Move Fast and Break Nothing

moving fast and breaking things

Let's start with the classic Facebook quote, Move fast and break things. Facebook's used that for years: it's a philosophy of trying out new ideas quickly so you can see if they survive in the marketplace. If they do, refine them; if they don't, throw it away without blowing a lot of time on development.

Breaking existing functionality is acceptable... it's a sign that you're pushing the boundaries. Product comes first.

Facebook was known for this motto, but in early 2014 they changed it to Move fast with stablity, among other variations on the theme. They caught a lot of flak from the tech industry for this: something something "they're running away from their true hacker roots" something something. I think that's horseshit. Companies need to change and evolve. The challenges Facebook's facing today aren't the same challenges they're facing ten years ago. A company that's not changing is probably as innovative as tepid bathwater.

Around the time I started thinking about this talk, my friend sent me an IM:

Do you know why kittens and puppies are so cute?

It's so we don't fucking eat them.

Maybe it was the wine I was drinking or the mass quantity of Cheetos® Puffs™ I was consuming, but what she said both amused me and made me think about designing unintended effects inside of a company. A bit of an oxymoron, perhaps, but I think the best way to get things done in a company isn't to bash it over your employee's heads every few hours, but to instead build an environment that helps foster those effects. Kittens don't wear signs on them that explicitly exclaim "DON'T EAT ME PLS", but perhaps their cuteness help lead us towards being kitten-carnivorous-averse. Likewise, telling your employees "DON'T BREAK SHIT" might not be the only approach to take.

I work at GitHub, so I'm not privvy to what the culture is like at Facebook, but I can take a pretty obvious guess as of the external manifestations of their new motto: it means they break fewer APIs on their platform. But the motto is certainly more inward-facing rather than external-facing. What type of culture does that make for? Can we still move quickly? Are there parts of the product we can still break? Are there things we absolutely can't break? Can we build product in a safer manner?

This talk explores those questions. Specifically I break my talk into three parts: code, internal process in your development team and company, and the talk, discussion, and communication surrounding your process.

code

I think move fast and break things is fine for many features. But the first step is identifying what you cannot break. These are things like billing code (as much as I'd like to, I probably shouldn't accidentally charge you a million dollars and then email you later with an "oops, sorry!"), upgrades (hardware or software upgrades can always be really dicey to perform), and data migrations (it's usually much harder to rollback data changes).

The last two years we've been upgrading GitHub's permissions code to be faster, safer, cleaner, and generally better. It's a scary process, though. This is an absolute, 100% can't-ever-break use case. The private repository you pay us for can't suddenly be flipped open to the entire internet because of a bug in our deployed code. 0.02% failure isn't an option; 0% failures needs to be mandatory.

But we like to move fast. We love to deploy new code incrementally hundreds of times a day. And there's good reason for that: it's safer overall. Incremental deploys are easier to understand and fix than one gigantic deploy once a year. But it lends itself to those small bugs, which, in this permissions case, is unacceptable.

So tests are good to have. This is unsurprising to say in this day and age; everyone generally understands now that testing and continuous integration is absolutely critical to software development. But that's not what's at stake here. You can have the best, most comprehensive test suite in the world, but tests are still different from production.

There are a lot of reasons for this. One is data: you may have flipped some bit (accidentally or intentionally) for some tables for two weeks back in December of 2010, and you've all but forgotten about that today. Or your cognitive model of the system may be idealized. We noticed that while doing our permissions overhaul. We'd have a nice, neat table of all the permissions of users, organizations, teams, public and private repositories, and forks, but we'd notice that the neat table would fall down on very arcane edge cases once we looked at production data.

And that's the rub: you need your tests to pass, of course, but you also need to verify that you don't change production behavior. Think of it as another test suite: for better or worse, the behavior deployed now is the state of the system from your users' perspective. You can then either fix the behavior or update your tests; just make sure you don't break the user experience.

Parallel code paths

One of the approaches we've taken is through the use of parallel code paths.

What happens is this: a request will come in as usual and run the existing (old) code. At the same time (or just right after it executes), we'll also run the new code that we think will be better/faster/harder/stronger (pick one). Once all that's done, return whatever the existing (old) code returns. So, from the user's perspective, nothing has changed. They don't see the effects of the new code at all.

There's some caveats, of course. In this case, we're typically performing read-only operations. If we're doing writes, it takes a bit more smarts to either write your code to make sure it can run both branches of code safely, or you can rollback the effects of the new code, or the new code is a no-op or otherwise goes to a different place entirely. Twitter, for example, has a very service-oriented architecture, so if they're spinning up a new service they redirect traffic and dual-write to the new service so they can measure performance, accuracy, catch bugs, and then throw away the redundant data until they're ready to switch over all traffic for real.

We wrote a Ruby library named Science to help us out with this. You can check it out and run it yourself in the github/dat-science repository. The general idea would be to run it like this:

  science "my-cool-new-change" do |e|
    e.control   { user.existing_slow_method }
    e.candidate { user.new_code_we_think_is_great }
  end

It's just like when you Did Science™ in the lab back in school growing up: you have a control, which is your existing code, and a candidate, which is the new code you want to introduce. The science block makes sure both are run appropriately. The real power happens with what you can do after the code runs, though.

We use Graphite literally all over the company. If you haven't seen Coda Hale's Metrics, Metrics Everywhere talk, do yourself a favor and give it a watch. Graphing behavior of your application gives you a ton of insight into your entire system.

Science (and its sister library, github/dat-analysis) can generate a graph of the number of times the code was run (the top blue bar to the left) and compare it to the number of mismatches between the control and the candidate (in red, on the bottom). In this case you see a downward trend: the developer saw that their initial deploy might have missed a couple use cases, and over subsequent deploys and fixes the mismatches decreased to near-zero, meaning that the new code is matching production's behavior in almost all cases.

What's more, we can analyze performance, too. We can look at the average duration of the two code blocks and confirm if the new code we're running is faster, but we can also break down requests by percentile. In the slide to the right, we're looking at the 75th and 99th percentile, i.e. the slowest of requests. In this particular case, our candidate code is actually quite a bit slower than the control: perhaps this is acceptable given the base case, or maybe this should be huge red sirens that the code's not ready to deploy to everyone yet... it depends on the code.

All of this gives you evidence to prove the safety of your code before you deploy it to your entire userbase. Sometimes we'll run these experiments for weeks or months as we widdle down all the — sometimes tricky — edge cases. All the while, we can deploy quickly and iteratively with a pace we've grown accustomed to, even on dicey code. It's a really nice balance of speed and safety.

build into your existing process

Something else I've been thinking a lot about lately is how your approach to building product is structured.

Typically process is added to a company vertically. For example: say your team's been having some problems with code quality. Too many bugs have been slipping into production. What a bummer. One way to address that is to add more process to your process. Maybe you want your lead developers to review every line of code before it gets merged. Maybe you want to add a layer of human testing before deploying to production. Maybe you want a code style audit to give you some assurance of new code maintainability.

These are all fine approaches, in some sense. It's not problematic to want to achieve the goal of clean code; far from it, in fact. But I think this vertical layering of process is really what can get aggravating or just straight-up confusing if you have to deal with it day in, day out.

I think there's something to be said for scaling the breadth of your process. It's an important distinction. By limiting the number of layers of process, it becomes simpler to explain and conceptually understand (particularly for new employees). "Just check continuous integration" is easier to remember than "push your code, ping the lead developers, ping the human testing team, and kick off a code standards audit".

We've been doing more of this lateral process scaling at GitHub informally, but I think there's more to it than even we initially noticed. Since continuous integration is so critical for us, people have been adding more tests that aren't necessarily tests in the classic sense of the word. Instead of "will this code break the application", our tests are more and more measuring "will this code be maintainable and more resilent towards errors in the future".

For example, here are a few tests we've added that don't necessarily have user-facing impact but are considered breaking the build if they go red:

Removing a CSS declaration without removing the associated class attribute in the HTML
...and vice versa: removing a class attribute without cleaning up the CSS
Adding an <img> tag that's not on our CDN, for performance, security, and scaling reasons
Invalid SCSS or CoffeeScript (we use SCSS-Lint and CoffeeLint)

None of these are world-ending problems: an unspecified HTML class doesn't really hurt you or your users. But from a code quality and maintainability perspective, yeah, it's a big deal in the long term. Instead of having everyone focus on spotting these during code review, why not just shove it in CI and let computers handle the hard stuff? It frees our coworkers up from gruntwork and lets them focus on what really matters.

Incidentally, some of these are super helpful during refactoring. Yesterday I shipped some new dashboards on github.com, so today I removed the thousands of lines of code from the old dashboard code. I could remove the code in bulk, see which tests fail, and then go in and pretty carelessly remove the now-unused CSS. Made it much, much quicker to do because I didn't have to worry about the gruntwork.

And that's what you want. You want your coworkers to think less about bullshit that doesn't matter and spend more consideration on things that do. Think about consolidating your process. Instead of layers, ask if you can merge them into one meeting. Or one code review. Or automate the need away entirely. The layers of process are what get you.

process

In bigger organizations, the number of people that need to be involved in a product launch grows dramatically. From the designers and developers who actually build it, to the marketing team that tells people about it, to the ops team who scales it, to the lawyers that legalize it™... there's a lot of chefs in the kitchen. If you're releasing anything that a lot of people will see, there's a lot you need to do.

Coordinating that can be tricky.

Apple's an interesting company to take a look at. Over time, a few interesting tidbits have spilled out of Cupertino. The Apple New Product Process (ANPP) is, at its core, a simple checklist. It goes into great detail about the process of releasing a product, from beginning to end, from who's responsible to who needs to be looped into the process before it goes live.

The ANPP tends to be at the very high-level of the company (think Tim Cook-level of things), but this type of approach sinks deeper down into individual small teams. Even before a team starts working on something, they might make a checklist to prep for it: do they have appropriate access to development and staging servers, do they have the correct people on the team, and so on. And even though they manage these processes in custom-built software, what it is at its core is simple: it's a checklist. When you're done with something, you check it off the list. It's easy to collaborate on, and it's easy to understand.

Think back to every single sci-fi movie you've ever watched. When they're about to launch the rocket into space, there's a lot of "Flip MAIN SERIAL BUS A to on". And then the dutiful response: "Roger, MAIN SERIAL BUS A is on". You don't see many responses of, "uh, Houston, I think I'm more happier when MAIN SERIAL BUS A is at like, 43% because SECONDARY SERIAL BUS B is kind of a jerk sometimes and I don't trust goddamn serial busses what the hell is a serial bus anyway yo Houston hook a brother up with a serial limo instead".

And there's a reason for that: checklists remove ambiguity. All the debate happens before something gets added to the checklist... not at the end. That means when you're about to launch your product — or go into space — you should be worrying less about the implementation and rely upon the process more instead. Launches are stressful enough as-is.

ownership

Something else that becomes increasingly important as your organization grows is that of code ownership. If the goal is to have clean, relatively bug-free code, then your process should help foster an environment of responsibility and ownership of your piece of the codebase.

If you break it, you should fix it.

At GitHub, we try to make that connection pretty explicit. If you're deploying your code and your code generates a lot of errors, our open source chatroom robot — Hubot — will notice and will mention you in chat with a friendly message of "hey, you were the last person to deploy and something is breaking- can you take a look at it?". This reiterates the idea that you're responsible for the code that you put out. That's good because, as it turns out, the people who wrote the code are typically the people who can most easily fix it. Beyond that, forcing your coworkers to always clean up your mess is going to really suck over time (for them).

There are plenty of ways to keep people responsible. Google, for example, uses OWNERS files in Chrome. This is a way of making explicit the ownership of a file or entire directories of the project. The format of an actual OWNERS file can be really simple — shout out to simple systems like flat files and checklists — but they serve two really great purposes:

They enforce quality. If you're an owner of an area of code, any new contribution to your code requires your signoff. Since you are in a somewhat elevated position of responsibility, it's on you to fight to not allow potentially buggy code into your area.
It encourages mentorship. Particularly in open source projects like Chromium, it can be intimidating to get started with your first contribution. OWNERS files make it explicit about who you might want to ask about your code or even about the high-level discussion before you get started.

You can tie your own systems together closer, too. In Haystack, our internal error tracking service at GitHub, we have pretty deep hooks into our code itself. In a controller, for example, we might have code that looks like this:

class BranchesController
  areas_of_reponsibility :git
end

This marks this particular file as being the responsibility of the @github/git team, the team that handles Git-related data and infrastructure. So, when we see a graph in Haystack like the one to the right, we can see that there's something breaking in a particular page, and we can quickly see which teams are responsible for the code that's breaking, since Haystack knows to look into the file with the error and bubble up these areas of responsibility. From here, it's a one-click operation to open an issue on our bug tracker about it, mentioning the responsible team in it so they can fix it.

Look: bugs do happen. Even if you move fast and break nothing, well, you're still bound to break something at some point. Having a culture of responsibility around your code helps you address those bugs quickly, in an organized manner.

talking & communicating

I've given a lot of talks and written a lot of blog posts about software development and teams and organizations. Probably one way to sum them all up is: more communication. I think companies function better by being more transparent, and if you build your environment correctly, you can end up with better code, a good remote work culture, and happier employees.

But god, more communication means a ton of shit. Emails. Notifications. Text messages. IMs. Videos. Meetings.

If everyone is involved with everything... does anyone really have enough time to actually do anything?

Having more communication is good. Improving your communication is even better.

be mindful of your coworker's time

It's easy to feel like you deserve the time of your coworkers. In some sense, you do: you're trying to improve some aspect of the company, and if your coworker can help out with that, then the whole company is better off. But every interaction comes with a cost: your coworker's time. This is dramatically more important in creative and problem solving fields like computer science, where being in the zone can mean the difference between a really productive day and a day where every line of code is a struggle to write. Getting pulled out of the zone can be jarring, and getting back into that mental mindset can take a frustratingly long time.

This goes doubly so for companies with remote workers. It's easy to notify a coworker through chat or text message or IM that you need their help with something. Maybe a server went down, or you're having a tough problem with a bug in code you're unfamiliar with. If you're a global company, timezones can become a factor, too. I was talking to a coworker about this, and after enough days of being on-call, she came up with a hilarious idea that I love:

You find you need help with something.
You page someone on your team for help.
They're sleeping. Or out with their kids. Or any level of "enjoying their life".
They check their message and, in doing so, their phone takes a selfie of them and pastes it into the chat room.
You suddenly feel worse.

We haven't implemented this yet (and who knows if we will), but it's a pretty rad thought experiment. If you could see the impact your actions on your coworker's life, would it change your behavior? Can you build something into your process or your tools that might help with this? It's interesting to think about.

I think this is part of a greater discussion on empathy. And empathy comes in part from seeing real pain. This is why many suggest that developers handle some support threads. A dedicated support team is great, but until you're actually faced with problems up-close, it's easy to avoid these pain points.

institutional teaching

We have a responsibility to be teachers — that this should be a central part of [our] jobs... it's just logic that some day we won't be here.

— Ed Catmull, co-founder of Pixar, Creativity, Inc.

I really like this quote for a couple reasons. For one, this can be meant literally: we're all going to fucking die. Bummer, right? Them's the breaks, kid.

But it also means that people move around. Sometimes people will quit or get fired from the company, and sometimes it just means people moving around the company. The common denominator is that our presence is merely temporary, which means we're obligated, in part, to spreading the knowledge we have across the company. This is great for your bottom line, of course, but it's also just a good thing to do. Teaching people around you how to progress in their careers and being a resource for their own growth is a very privileged position to be in, and one we shouldn't take lightly.

So how do we share knowledge without being lame? I'm not going to lie: part of the reason I'm working now is because I don't have to go to school anymore. Classes are so dulllllllllll. So the last thing I want to have to deal with is some formal, stuffy process that ultimately doesn't even serve as a good foundation to teach anyone anything.

Something that's grown out of how we work is a concept we call "ChatOps". @jnewland has a really great talk about the nitty-gritty of ChatOps at GitHub, but in short: it's a way of handling devops and systems-level work at your company in front of others so that problems can be solved and improved upon collaboratively.

If something breaks at a tech company, a traditional process might look something like this:

Something breaks.
Whoever's on-call gets paged.
They SSH into... something.
They fix it... somehow.

There's not a lot of transparency. Even if you discuss it after the fact, the process that gets relayed to you might not be comprehensive enough for you to really understand what's going on. Instead, GitHub and other companies have a flow more like this:

Something breaks.
Whoever's on-call gets paged.
They gather information in a chat room.
They fix it through shared tooling, in that chat room, in front of (or leveraging the help of) other employees.

This brings us a number of benefits. For one, you can learn by osmosis. I'm not on the Ops team, but occasionally I'll stick my head into their chat room and see how they tackle a problem, even if it's a problem I won't face in my normal day-to-day work. I gain context around how they approach problems.

What's more, if others are studying how they tackle a problem in real-time, the process lends itself to improvement. How many times have you sat down to pair program with someone and were blown away by the one or two keystrokes they use to solve a process that takes you three minutes? If you code in a vacuum, you don't have the opportunity to make quick improvements. If I'm watching you run the same three commands in order to run diagnostics on a server, it's easier as a bystander to think, hey, why don't we wrap those commands up in one command that does it all for us? Those insights can be incredibly valuable and, in time, lead to massive, massive productivity and quality improvements.

This requires some work on tooling, of course. We use Hubot to act as a sort of shared collection of shell scripts that allow us to quickly address problems in our infrastructure. Some of those scripts include hooks into PagerDuty to trigger pages, code that leverages APIs into AWS or our own datacenter, and, of course, scripts that let us file issues and work with our repositories and teams on GitHub. We have hundreds or thousands of commands now, all gradually built up and hardened over time. The result is an incredible amount of tooling around automating our response to potential downtime.

This isn't limited to just working in chatrooms, though. Recently the wifi broke at our office on a particular floor. We took the same approach to fix it, except it was in a GitHub issue instead of real-time chat. Our engineers working on the problem pasted in screenshots of the status of our routers, the heatmaps of dead zones stemming from the downtime, and eventually traced cables through switches until we found a faulty one, taking photos each step of the way and adding them to the issue so we had a papertrail of which cabinets and components were affected. It's amazing how much you can learn from such a non-invasive process. If I'm not interested in learning the nitty-gritty details, I can skip the thread. But if I do want to learn about it... it's all right there, waiting for me.

feedback

The Blue Angels are a United States Navy demonstration flight squadron. They fly in air shows around the world, maneuvering in their tight six-fighter formations 18 inches apart from one another. The talent they exhibit is mind-boggling.

Earlier this year I stumbled on a documentary on their squadron from years back. There's a specific 45-second section in it that really made me think. It describes the process the Blue Angels go through in order to give each other feedback. (That small section of the video is embedded to the left. Watch the whole thing if you want, though, who am I to stop you.)

So first of all, they're obviously and patently completely nuts. The idea that you can give brutally honest feedback without worrying about interpersonal relationships is, well, not really relevant to the real world. They're superhuman. It's not every day you can tell your boss that she fucked up and skip out of the meeting humming your favorite tune without fear of repercussions. So they're nuts. But it does make sense: a mistake at their speeds and altitude is almost certainly fatal. A mistake for us, while writing software that helps identify which of your friends liked that status update about squirrels, is decidedly less fatal.

But it's still a really interesting ideal to look up to. They do feedback and retrospectives that take twice as long as the actual event itself. And they take their job of giving and receiving feedback seriously. How can we translate this idealized version of feedback in our admittedly-less-stressful gigs?

Part of this is just getting better at receiving feedback. I'm fucking horrible at this. You do have to have a bit of a thicker skin. And it sucks! No one wants to spend a few hours — or days, or months — working on something, only to inevitably get the drive-by commenter who finds a flaw in it (either real or imagined). It's sometimes difficult to not take that feedback personally. That you failed. That you're not good enough to get it perfect on the first or second tries. It's funny how quickly we forget how iterative software development is, and that computers are basically stacking the deck against us to never get anything correct on the first try.

Taking that into account, though, it becomes clear how important giving good feedback is. And sometimes this is just as hard to do. I mean, someone just pushed bad code! To your project! To your code! I mean, you're even in the damn OWNERS file! The only option is to rain fire and brimestone and hate and loathing on this poor sod, the depths of which will cause him to surely think twice about committing such horrible code and, if you're lucky, he'll quit programming altogether and become a dairy farmer instead. Fuck him!

Of course, this isn't a good approach to take. Almost without fail, if someone's changing code, they have a reason for it. It may not be a good reason, or the implementation might be suspect, but it's reason nonetheless. And being cognizant of that can go a long way towards pointing them in the right direction. How you piece your words together are terribly important.

And this is something you should at least think about, if not explicitly codify across your whole development team entirely. What do you consider good feedback? How can you promote understanding and positive approaches in your criticism of the code? How can you help the submitter learn and grow from this scenario? Unfortunately these questions don't get asked enough, which creates a self-perpetuating cycle of cynics and aggressive discussion.

That sucks. Do better.

move fast & break nothing