lishcenthetelevisyen: devops

Showing posts with label devops. Show all posts

Tuesday, 3 September 2013

This is how Facebook Develops and Deploys Software. Should you care?

A recently published academic paper by Prof. Dror Feitelson at Hebrew University, Eitan Frachtenberg a research scientist at Facebook, and Kent Beck (who is also doing something at Facebook), describes Facebook’s approach to developing and deploying their front-end software. While it would be more interesting to understand how back-end development is done (this is where the real heavy lifting is done scaling up to handle hundreds of millions of users), there are a few things in the paper that are worth knowing about.

Continuous Deployment at Facebook is not Continuous Deployment

Rather than planning work out in projects or breaking work into time boxed Sprints, Facebook developers do most of their work in independent, small changes which are released frequently. This makes sense in Facebook’s online business model, everyone constantly tuning the platform and trying out new options and applications in different user communities, seeing what sticks. It’s a credit to their architecture that so many small, independent changes can actually be done independently and cheaply.

Facebook says that they follow Continuous Deployment, but it’s not Continuous Deployment the way that IMVU made popular where every change is pushed out to customers immediately, or even how a company like Etsy does Continuous Deployment.

At Facebook, code can be released twice a day, but this is done mostly for bug fixes and internal code. New production code is released once per week: thousands of changes by hundreds of developers are packaged up by their small release team on Sundays, run through automated regression testing, and released on Tuesday if the developers who contributed the changes are present.
Release engineers assess the risk of changes based on the size of the change, the amount of discussion done in code reviews (which is recorded through an internal code review tool), and on each developer’s “push karma”: how many problems they have seen from code by this developer before.

A tool called “Gatekeeper” controls what features are available to what customers to support dark launching, and all code is released incrementally – to staging, then a subset of users, and so on. Changes can be rolled-back if necessary – individually, or, as a last resort, an entire code release. However, like a lot of Silicon Valley devops shops, they mostly follow the “Real Men only Roll Forward” motto.

Code Ownership

A key to the culture at Facebook is that developers are individually responsible for the code that they wrote, for testing it and supporting it in production. This is reflected in their code ownership model:

Developers must also support the operational use of their software — a combination that’s become known as “devops.” This further motivates writing good code and testing it thoroughly. Developers’ personal stake in keeping the system running smoothly complements the engineering procedures and lets the system maintain quality at scale. Methodologies and tools aren’t enough by themselves because they can always be misused. Thus, a culture of personal responsibility is critical.

Consequently, most source files are modified by only a few engineers. Although at least one other engineer reviews all changes before they’re committed, a third of the source files have only been edited by one engineer, and another quarter by two. Only 10 percent of the files are handled by more than seven engineers. On the other hand, the distribution of engineers per file has a heavy tail, with the most widely shared file handled by no fewer than 870 distinct engineers. These widely shared files are predominantly library files and also include major configuration and top-level PHP files.

Testing? We don’t need no stinking testing…

Facebook doesn't have an independent test team, because, they say, they don’t need one.

First, they depend a lot on code reviews to find bugs:

At Facebook, code review occupies a central position. Every line of code that’s written is reviewed by a different engineer than the original author. This serves multiple purposes: the original engineer is motivated to ensure that the code is of high quality, the reviewer comes with a fresh mind and might find defects or suggest alternatives, and, in general, knowledge about coding practices and the code itself spreads throughout the company.

Developers are also responsible for writing unit tests and their own regression tests – they have “tens of thousands of regression tests” (which doesn't sound like near enough for 10+ million lines of mostly PHP code compiled into C++, both languages which are easy to make coding mistakes in) and automated performance tests.

And developers also test the software by using the development version of Facebook for their personal Facebook use. According to the authors, “this is just one aspect of the departure from traditional software development”. But Facebook developers using their own software internally (and passing this off as “testing”) is no different than the early days at Microsoft where employees were supposed to “eat their own dog food”, a practice that did little if anything to improve the quality of Microsoft products.

Facebook also depends on customers to test the software for them. Software is released in steps for A/B testing and “live experimentation” on subsets of the user base, whether customers want to participate in this testing or not. Because their customer base is so large, they can get meaningful feedback from testing with even a small percentage of users, which at least minimizes the risk and inconvenience to customers.

Security???

While performance is an important consideration for developers at Facebook, there is no mention of security checks or testing anywhere in this description of how Facebook develops and deploys software. No static analysis, dynamic analysis/scanning, pen testing or explanation of how the security team and developers work together, not even for “privacy sensitive code” – although this code is “held to a higher standard” they don’t explain what this “higher standard” is. Presumably they rely on use of libraries and frameworks to handle at least some appsec problems, and possibly look for security bugs in their code reviews, but they don't say.

There isn’t much information available on Facebook’s appsec program anywhere. The security team at Facebook seems to spend a lot of time educating people on how to use Facebook safely and how to develop Facebook apps safely and running their bug bounty program which pays outsiders to find security bugs for them.

A search on security on Facebook mostly comes back with a long list of public security failures, privacy violations and application security vulnerabilities found over the years and continuing up to the present day. Maybe the lack of an effective appsec program is the reason for this.

This is the way Facebook is Developed. Should you care?

While it’s interesting to get a look inside a high-profile organization like Facebook and how they approach development at scale, it’s not clear why this paper was written. There is little about what Facebook is doing (on their front-end development at least) that is unique or innovative, except maybe the way they use BitTorrent to push code changes out to thousands of servers like Twitter does, something that I already heard about a few years ago at Velocity and that has been written about before.

I like the idea of developers being responsible for their work, all the way into production, which is a principle that we also follow. Code reviews are good. Dark launching features is a good practice and has been a common practice in systems for a long time (even before it was called "dark launching"). Not having testers or doing appsec is not good. Otherwise, I'm not sure what the rest of us can learn from or would want to use from this.

Thursday, 25 July 2013

Maintaining Software Sucks – and what we can do about it

If you ask most developers, they will tell you that working in maintenance sucks. Understanding and fixing somebody else’s lousy code is hard. It’s tedious. And it’s frustrating – because you know you would do a better job if you were given the chance to do it over and do it right.

I enjoy maintaining code I've built. It’s my personal responsibility to my users. On the other hand, maintaining code I didn’t build – and living with the consequences of decisions I had no hand in making – is very painful for me.

Jeff Atwood, Programming Horror

Many software developers see maintenance as a prison sentence – or at least a probationary term. They went to school to become a software engineer, and instead they ended up working as a janitor. For “real developers”, maintenance is boring and “spirit-crushing”.

I have done software maintenance in the past and it is honestly the biggest pile of shit I have ever undertaken.

Comment by hotdog, on Jeff Atwood’s blog post The Noble Art of Maintenance Programming, October 2008

Maintenance is hard work

Robert Glass, in his essay Software Maintenance is a Solution, not a Problem, explains why maintenance is so difficult and unpopular. Maintenance is:

Intellectually complex – it requires innovation while placing severe constraints on the innovator. Maintainers have to work within the existing design as well as within regulatory and governance restrictions, they have to be careful to maintain compatibility with other systems, and they can’t afford to make mistakes that could impact customers who are already relying on the system (which is why more than half of maintenance work is testing).
Technically difficult – the maintainer must be able to work with a concept and a design and its code all at the same time, and pick it all up quickly.
Unfair – the maintainer never gets all the things the maintainer needs. Like good documentation, or good tools, or the time to get work done properly.
No-win – the maintainer only deals with users who have problems, instead of hobnobbing with executive sponsors on high-profile strategic projects.
Dirty work – the maintainer has to work at the grubby level of detailed coding and debugging, instead of architecture and creative design, worrying about how to build and deploy the code and making sure that it is running correctly.
Living in the past – maintainers are always living with somebody’s design and technology choices and mistakes, instead of solving new problems and learning new tools (which means there’s no chance to pad their resumes).
Conservative – the going motto for maintenance is "if it ain’t broke, don't fix it”, so maintainers have to put up with shortcomings and shortcuts and compromises, and do the least amount possible to get the job done.

But even though this is hard work, people doing maintenance work are under appreciated – and usually under paid too. The high-profile “software architect” positions with the high salaries go to the flashy stars working on strategic development projects. So why should you – or anyone – care if maintenance work is offshored to India or Romania or somewhere else?

Maintenance is too important to suck

But the truth is that maintenance is too important to be ignored – or offshored. It’s too important to organizations and too important to software developers. Maintenance – not new development – is where most of the money is spent on software,and like it or not, it is where most of us will spend most of our careers, whether we are working on enterprise systems, or building online communities or working at a Cloud vendor or a shrink wrap software publisher, or writing mobile apps or embedded software.

In the US today, 60% of software professionals (1.5 million of an estimated total population of 2.5 million) are working on maintaining software systems (Capers Jones, Software Engineering Best Practices, 2010), and the number of people working on maintenance worldwide will continue to climb because we keep writing more software, and this software all needs to be supported and maintained. We can look forward to a future where almost all of us will spend almost all of our time in maintenance, rather than building something new.

So we have to change the way that people think about software maintenance, how it is done, and especially how it is managed.

It’s not Maintenance that Sucks. What Sucks is how we Manage Maintenance

Attempts to manage maintenance using formal governance models based on ITIL service management or structured software lifecycle standards like IEE-STD-1219 (ISO/IEC 14764) or applying the CMMi maturity model to software maintenance are hopelessly expensive, bureaucratic and impractical.

In most organizations, maintenance is handled reactively, as an unavoidably necessary part of business-as-usual operational support. What management cares most about is minimizing cost and hassle. What is the least that we have to do to keep users from complaining, and how can we get this done in the cheapest and fastest way possible? If we can’t avoid doing it, can we outsource it, maybe save some money, and make it somebody else’s problem, so we can focus on other “more important” things?

This short-term thinking is continued year after year, until the investment that the organization made in building the system, and in the people who built it, has been completely exhausted. The smart people who once cared about doing a good job left years ago. The technology has fallen behind to the point that it has become another “legacy system” that will need to be replaced. Only a handful of people understand the code and the technology, and each change takes longer and costs more than it’s worth. Technical risks and business risks keep piling up. Your customers are tired of putting up with bugs and waiting for things that they need. At some point there’s no choice but to throw it out and start over again. What a sad, sick waste of time and money.

We have to correct the misunderstanding that software maintenance is a drain on an organization, that it takes away from innovation: that innovation can only happen in expensive greenfield development projects, and that money spent working on software that you already have written is money wasted. That it isn’t important to pay attention to what your business is doing today and that you can’t build on this for the future.

Online, software-driven businesses like Facebook and Twitter and Netflix keep growing and innovating by building on what they already have, by investing in their people and in the software base that they have already written. A lot of the work that people are doing at these companies today is maintenance: adding another feature to code that is already there, fixing bugs, making the system more secure, scaling to handle increased load, analyzing feedback data, integrating with new partners, maybe rewriting a mobile app to be faster and simpler to use.

This is the same kind of maintenance work that other people are doing at other companies, and at the heart of it, the technology isn’t that much more exciting. What’s different is how their work is done, how it is managed, and how people approach their jobs. For them, it isn't “maintenance” – it’s engineering. Developers are valued and trusted and respected, whether they are building something new or supporting what’s already there. They are given the responsibility and the opportunity to solve interesting problems, to experiment and try out new ideas with customers and to see if their ideas make a difference. They work in small teams following lightweight incremental development methods and Continuous Integration and Continuous Delivery and Continuous Deployment, and other ideas and self-service tools coming out of Devops to get things done quickly and efficiently, with minimal paperwork and other bullshit.

These organizations show how software development and maintenance should be managed: hire the best people you can find; grant them agency and the opportunity to work with customers and make a meaningful difference; provide them with constant feedback so that they can keep learning and making the product better; give them the tools they need and encourage them to keep improving the software and how they build it; and hold them accountable for the system’s (and the company’s) success.

This works. What other organizations are doing doesn't. It’s not maintenance that sucks. It’s how we think about it and how we manage it that sucks. And this is something that we can fix.

Thursday, 4 April 2013

How do you measure Devops?

If you’re trying to convince yourself (or the team or management) that your operations program needs to be changed for the better, and that trying a Devops approach makes sense – or that your operations organization is improving, and that whatever changes you have made actually make a difference – you have to measure something(s). But what?

Measuring Culture

John Clapham at Nokia suggests that you should try to measure how healthy your operations culture is. At the Devops Days conference this year in London he talked about ways to measure and monitor culture – behaviour, attitudes and values – to determine whether people were focused on the “right things”, and to assess the team’s motivation and satisfaction. Nokia had already started a Devops program, and wanted to see whether the momentum for change and improvement was still there after the initial push and evangelism had worn off. So they came up with a set of vital signs that they felt would capture the important behaviours and attitudes:

Cycle time – time from development to deployment in production. Are we moving faster, or fast enough?

Shared purpose – do people all share/believe in the same goals, believe in improving how development and ops work together?

Motivation – does everyone care about what they are doing?

Collaboration – are people working together willingly?

Effectiveness – is everyone’s time being spent in a useful way? How much time is being wasted?

Cycle time is the only measure that is relatively easy to measure and report. The rest are highly subjective and fuzzy. Nokia tried to collect this information through a questionnaire that asked questions like: Do you believe there are opportunities to improve ways of working? How much time do you spend on stability, overhead, improvements, innovation? What’s in your way: lack of time, pressure to focus on features, poor tools, lack of management support, nothing…?

Operations Vital Signs that you Can and Should Measure

Clapham’s closing question was: “What vital signs would you look for?”

I'm not convinced that you can measure an organization’s cultural effectiveness, or that it would be really useful if you could. You can’t tell from a wishy-washy questionnaire whether change is making a real difference to the organization’s effectiveness and you are on the right track; or help you understand what you need to change and what the impact of change would be on the bottom line (or the top line). To do this you need concrete and results-based measurements which point out strong points and weaknesses, and that you can use to make a case for change, or justify your decisions.

Puppet Labs and IT Revolution Press have recently published a “State of Devops Report”, which is full of interesting data. The report stresses the importance of metrics in understanding how your organization is performing and why a Devops program is, or would be, worthwhile. They provide a list of objective measures, broken down into two major types.

Agility and reliability metrics:

Deployment rate/frequency

Change lead time – how long it takes to get a change approved and into production

Change failure rate (John Allspaw's brilliant presentation “Ops Meta-Metrics” explains the importance of correlating deployment frequency/size/type and failures – type and severity – in production)

Mean time to recover (and mean time to detect)

Functional metrics:

Test cycle time – how long does it take to test a change?

Deployment time – how long does it take to roll out a new change once tested and approved?

Defect rate in production (defect escape rate)

Helpdesk ticket counts – how much time is spent firefighting?

There are two other important measures that are missing from this list:

Operations costs

Employee retention – a key measure of whether people are happy

Measuring the success of a DevOps program is simple:

If you aren't saving money

If you can’t make change easier and faster

If you don’t improve quality and reliability and your organization’s ability to respond to problems

If you can’t keep good people

... then whatever you’re doing is not working or you’re not doing it right. It doesn't matter if you are “doing DevOps” or using certain tools or if people seem to be more collaborative or believe that they have a greater sense of shared purpose. What matters is the outcome. Make sure that you’re measuring the right things – so that you know that you are doing the right things.

Friday, 8 March 2013

Book Review: The Phoenix Project

Everyone who attended the “Rugged Devops” panel at RSA this year received a free copy of The Phoenix Project (by Gene Kim, Kevin Behr and George Spafford – the authors of Visible Ops), the fictional story of the education and transformation of an IT manager, an IT organization, and eventually of an entire company.

I'm not sure why they wrote the Phoenix Project as a novel. But they did. So I’ll review it first as a piece of story telling, and then look at the messaging inside.

The reason that I don’t like didactic fiction is that so much of it is so poorly written – generally forced and artificial. I was pleasantly surprised by the Phoenix Project. The first half of the novel tells a story about an IT manager forced into taking on responsibility for saving his company. Our well-meaning hero, an ex-marine sergeant (for some reason unclear to me, the hero, the CEO, and even the mysterious guru all have a military background) with an MBA but without any ambition except to quietly provide for his family. He has been successfully running his own little part of the IT group, so successfully that he is dramatically promoted to take over all of IT operations (and so successfully that his own group is never mentioned again in the story – it seems to run on auto-pilot without him). For a successful manager, our hero knows alarmingly little about how the rest of the IT organization works, or about the business, and so is unprepared for his new responsibility. He reluctantly accepts the big job, and then regrets it immediately as he realizes what a shit storm he has walked into.

It’s a compelling narrative that draws you in and is seems realistic even with the stock characters: the sociopathic SOB CEO, the unpopular everything-by-the-book CISO, the Software Development Manager who only cares about hitting deadlines (but not about delivering software that works), the Machiavellian Marketing executive, and the autistic genius that the entire IT organization of several hundred people all depend on to get all the important stuff done or fixed.

As a pure piece of story telling, things start to unravel in the second half with the emergence of the IT / Lean Manufacturing guru – when the story ends and the devops fable begins. From this point on, the plot depends on several miraculous conversions: everyone except the marketing exec sees the light and literally transform overnight. They even start to groom themselves nicely and dress better. Lifetime friendships are made over a few drinks, and everyone learns to trust and share: there’s a particularly cringe-inducing management meeting in which people bare their souls and weep. Conflicts and problems disappear magically usually because of the guru’s intervention – including an unnecessary scene where the guru shows up at a crucial audit meeting and helps convince the partner of the audit firm (an old buddy of his) that there aren't any real problems with the company’s messed up IT controls (“these aren't the droids you’re looking for”).

But the real heroes are the people running the manufacturing group: the one part of this spectacularly mismanaged company that somehow functions perfectly is a manufacturing plant where everyone can all go to learn about Kanban and Lean process management and so on. Without the help of the smug and smart-alecky guru - who apparently helped create this manufacturing success - and his tiresome Zen master teaching approach (and sometimes even with his help), our hero is unable to understand what’s wrong or what to do about it. He doesn't understand anything about demand management, how to schedule and prioritize project work, that firefighting is actual work, how to get out of firefighting mode, how to recognize and manage bottlenecks in workflow, or even how important it is to align IT with business priorities (where did he get that MBA any ways?). Luckily, the factory is right there for him learn from, if he only opens his eyes and mind.

What will you learn from The Phoenix Project?

The other problem with this story telling approach is that it takes so damn long to get to the point – it’s a 338 page book with about 50 pages of meat. Like Goldratt's The Goal (which is referenced a couple of times in this book), The Phoenix Project leads the reader to understanding through a kind of detective story. You have to be patient and watch as the hero stumbles and goes down blind alleys and ignores obvious clues and only with help eventually figures out the answers. Unfortunately, I'm not a patient reader.

This is a gentle introduction to Lean IT and devops. If you've read anything on Kanban and devops you won’t find anything surprising, although the authors do a good job of explaining how Lean manufacturing concepts can be applied to managing IT work. The ideas covered in the book are standard Lean and Theory of Constraints stuff, with a little of David Anderson’s Kanban and some devops – especially Continuous Deployment as originally described by John Allspaw and Continuous Delivery.

The guru’s lessons are mostly about visualizing and understanding and limiting demand – that you should stop taking on more work until you can get things actually working so that you’re not spending all of your time task-switching and firefighting; identifying workflow bottlenecks and protecting or optimizing around them; how reducing batch size in development will improve control and to get faster feedback; that in order to do this you have to work on simplifying and standardizing deployment; and how valuable it is to get developers and operations to work together.

My complaints aren't with the ideas – I buy into a lot of Devops and agree that Kanban and Lean have a lot to offer to IT ops and support teams (although I'm not sold on Kanban by itself for development, certainly not at scale). But I was disappointed with the unrealistic turnaround in the second half of the book. It’s all rainbows and unicorns at the end. IT, the business, development and security all start working together seamlessly. Management is completely aligned. Performance problems? No problem – just go the Cloud. And they even bring in the famous Chaos Monkey in the last couple of pages just because.

Spoiler Art: Everything goes so well in a few months that the company is back on track, plans to outsource IT and to break up the company are cancelled, the selfish head of marketing is canned, and our hero is promoted to CIO and put on the fast track to corporate second in command. Sorry: nothing this bad gets that good that easily. It is a fable after all, and too hard to swallow.

The Phoenix Project is a unique book – when was the last time that you read an actual novel about IT management?! It was worth reading, and if it introduces devops and Lean ideas to more people in IT, The Phoenix Project will have accomplished something useful. But it’s not a book that you can use to get things done. There are lessons and recipes and patterns but they take work to pull out of the story. There’s no index, no good way to go back to find things that you thought were useful or interesting. So I am looking forward to the Devops Cookbook: something practical that hopefully explains how these ideas can work in businesses that don’t look all like Facebook and Twitter.