How Facebook Ships CodePosted: January 17, 2011 | |
I’m fascinated by the way Facebook operates. It’s a very unique environment, not easily replicated (nor would their system work for all companies, even if they tried). These are notes gathered from talking with many friends at Facebook about how the company develops and releases software.
Seems like others are also interested in Facebook… The company’s developer-driven culture is coming under greater public scrutiny and other companies are grappling with if/how to implement developer-driven culture. The company is pretty secretive about its internal processes, though. Facebook’s Engineering team releases public Notes on new features and some internal systems, but these are mostly “what” kinds of articles, not “how”… So it’s not easy for outsiders to see how Facebook is able to innovate and optimize their service so much more effectively than other companies. In my own attempt as an outsider to understand more about how Facebook operates, I assembled these observations over a period of months. Out of respect for the privacy of my sources, I’ve removed all names and mention of specific features/products. And I’ve also waited for over six months to publish these notes, so they’re surely a bit out-of-date. I hope that releasing these notes will help shed some light on how Facebook has managed to push decision-making “down” in its organization without descending into chaos… It’s hard to argue with Facebook’s results or the coherence of Facebook’s product offerings. I think and hope that many consumer internet companies can learn from Facebook’s example.
- as of June 2010, the company has nearly 2000 employees, up from roughly 1100 employees 10 months ago. Nearly doubling staff in under a year!
- the two largest teams are Engineering and Ops, with roughly 400-500 team members each. Between the two they make up about 50% of the company.
- product manager to engineer ratio is roughly 1-to-7 or 1-to-10
- all engineers go through 4 to 6 week “Boot Camp” training where they learn the Facebook system by fixing bugs and listening to lectures given by more senior/tenured engineers. estimate 10% of each boot camp’s trainee class don’t make it and are counseled out of the organization.
- after boot camp, all engineers get access to live DB (comes with standard lecture about “with great power comes great responsibility” and a clear list of “fire-able offenses”, e.g., sharing private user data)
- [EDIT thx fryfrog] “There are also very good safe guards in place to prevent anyone at the company from doing the horrible sorts of things you can imagine people have the power to do being on the inside. If you have to “become” someone who is asking for support, this is logged along with a reason and closely reviewed. Straying here is not tolerated, period.”
- any engineer can modify any part of FB’s code base and check-in at-will
- very engineering driven culture. “product managers are essentially useless here.” is a quote from an engineer. engineers can modify specs mid-process, re-order work projects, and inject new feature ideas anytime. [EDITORIAL] The author of this blog post is a product manager, so this sentiment really caught my attention. As you’ll see in the rest of these notes, though, it’s apparent that Facebook’s culture has really embraced product management practices so it’s not as though the role of product management is somehow ignored or omitted. Rather, the culture of the company seems to be set so that *everyone* feels responsibility for the product.
- during monthly cross-team meetings, the engineers are the ones who present progress reports. product marketing and product management attend these meetings, but if they are particularly outspoken, there is actually feedback to the leadership that “product spoke too much at the last meeting.” they really want engineers to publicly own products and be the main point of contact for the things they built.
- resourcing for projects is purely voluntary.Engineers decide which ones sound interesting to work on. a PM lobbies group of engineers, tries to get them excited about their ideas. Engineer talks to their manager, says “I’d like to work on these 5 things this week.” Engineering Manager mostly leaves engineers’ preferences alone, may sometimes ask that certain tasks get done first.
- arguments about whether or not a feature idea is worth doing or not generally get resolved by just spending a week implementing it and then testing it on a sample of users, e.g., 1% of Nevada users.
- engineers generally want to work on infrastructure, scalability and “hard problems” — that’s where all the prestige is. can be hard to get engineers excited about working on front-end projects and user interfaces. this is the opposite of what you find in some consumer businesses where everyone wants to work on stuff that customers touch so you can point to a particular user experience and say “I built that.” At facebook, the back-end stuff like news feed algorithms, ad-targeting algorithms, memcache optimizations, etc. are the juicy projects that engineers want.
commits that affect certain high-priority features (e.g., news feed) get code reviewed before merge.News Feed is important enough that Zuckerberg reviews any changes to it, but that’s an exceptional case.
- [CORRECTION -- thx epriest] “There is mandatory code review for all changes (i.e., by one or more engineers). I think the article is just saying that Zuck doesn’t look at every change personally.”
- [CORRECTION thx fryfrog] “All changes are reviewed by at least one person, and the system is easy for anyone else to look at and review your code even if you don’t invite them to. It would take intentionally malicious behavior to get un-reviewed code in.”
no QA at all, zero. engineers responsible for testing, bug fixes, and post-launch maintenance of their own work. there are some unit-testing and integration-testing frameworks available, but only sporadically used.
- [CORRECTION thx fryfrog] “I would also add that we do have QA, just not an official QA group. Every employee at an office or connected via VPN is using a version of the site that includes all the changes that are next in line to go out. This version is updated frequently and is usually 1-12 hours ahead of what the world sees. All employees are strongly encouraged to report any bugs they see and these are very quickly actioned upon.”
- re: surprise at lack of QA or automated unit tests — “most engineers are capable of writing bug-free code. it’s just that they don’t have an incentive to do so at most companies. when there’s a QA department, it’s easy to just throw it over to them to find the errors.” [EDIT: please note that this was subjective opinion, I chose to include it in this post because of the stark contrast that this draws with standard development practice at other companies]
- [CORRECTION thx epriest] “We have automated testing, including “push-blocking” tests which must pass before the release goes out. We absolutely do not believe “most engineers are capable of writing bug-free code”, much less that this is a reasonable notion to base a business upon.”
- re: surprise at lack of PM influence/control — product managers have a lot of independence and freedom. The key to being influential is to have really good relationships with engineering managers. Need to be technical enough not to suggest stupid ideas. Aside from that, there’s no need to ask for any permission or pass any reviews when establishing roadmaps/backlogs. There are relatively few PMs, but they all feel like they have responsibility for a really important and personally-interesting area of the company.
- by default all code commits get packaged into weekly releases (tuesdays)
- with extra effort, changes can go out same day
- tuesday code releases require all engineers who committed code in that week’s release candidate to be on-site
- engineers must be present in a specific IRC channel for “roll call” before the release begins or else suffer a public “shaming”
- ops team runs code releases by gradually rolling code outthere are 9
concentriclevels for rolling out new code
- facebook has around 60,000 servers
- [CORRECTION thx epriest] “The nine push phases are not concentric. There are three concentric phases (p1 = internal release, p2 = small external release, p3 = full external release). The other six phases are auxiliary tiers like our internal tools, video upload hosts, etc.”
- the smallest level is only 6 servers
- e.g., new tuesday release is rolled out to 6 servers (level 1), ops team then observes those 6 servers and make sure that they are behaving correctly before rolling forward to the next level.
- if a release is causing any issues (e.g., throwing errors, etc.) then push is halted. the engineer who committed the offending changeset is paged to fix the problem. and then the release starts over again at level 1.
- so a release may go thru levels repeatedly: 1-2-3-fix. back to 1. 1-2-3-4-5-fix. back to 1. 1-2-3-4-5-6-7-8-9.
- ops team is really well-trained, well-respected, and very business-aware. their server metrics go beyond the usual error logs, load & memory utilization stats — also include user behavior. E.g., if a new release changes the percentage of users who engage with Facebook features, the ops team will see that in their metrics and may stop a release for that reason so they can investigate.
- during the release process, ops team uses an IRC-based paging system that can ping individual engineers via Facebook, email, IRC, IM, and SMS if needed to get their attention. not responding to ops team results in public shaming.
- once code has rolled out to level 9 and is stable, then done with weekly push.
- if a feature doesn’t get coded in time for a particular weekly push, it’s not that big a deal (unless there are hard external dependencies) — features will just generally get shipped whenever they’re completed.
- getting svn-blamed, publicly shamed, or slipping projects too often will result in an engineer getting fired. “it’s a very high performance culture”. people that aren’t productive or aren’t super talented really stick out. Managers will literally take poor performers aside within 6 months of hiring and say “this just isn’t working out, you’re not a good culture fit”. this actually applies at every level of the company, even C-level and VP-level hires have been quickly dismissed if they aren’t super productive.
- [CORRECTION, thx epriest] “People do not get called out for introducing bugs. They only get called out if they ask for changes to go out with the release but aren’t around to support them in case something goes wrong (and haven’t found someone to cover for you).”
- [CORRECTION, thx epriest] “Getting blamed will NOT get you fired. We are extremely forgiving in this respect, and most of the senior engineers have pushed at least one horrible thing, myself included. As far as I know, no one has ever been fired for making mistakes of this nature.”
- [CORRECTION, thx fryfrog] “I also don’t know of anyone who has been fired for making mistakes like are mentioned in the article. I know of people who have inadvertently taken down the site. They work hard to fix what ever caused the problem and everyone learns from it. The public shaming is far more effective than fear of being fired, in my opinion.”
It’ll be super interesting to see how Facebook’s development culture evolves over time — and especially to see if the culture can continue scaling as the company grows into the thousands-of-employees.
What do you think? Would “developer-driven culture” work at your company?