3 - DevOps in a Cloud World

Posted on Sunday, Mar 29, 2020
In this episode, I'm very fortunate to have my first guest come and join me! And what better way to kick this off, than a topic area which is very close to my own heart - DevOps. I'm very excited that I was able to invite Abel Wang, Principal Developer Advocate and DevOps lead to come and join me to talk of his experiences at Microsoft. We jump through a few different areas, from What is DevOps, how it links back to requirements, Data DevOps and shifting left. There's plenty to learn from in this one, so let's dive straight in!

Show Notes

Hello and welcome back to Cloud with Chris! You're with me - Chris Reddington, and we'll be talking about all things cloud. I've been doing some updates to my website, so please check out the “Become a Guest” or “Topic Suggestion” Links, as you'll now see some different forms you can complete.

And of course, as a reminder - I'm also looking to increase my presence on YouTube, so if you could subscribe to my Channel, Cloud with Chris, that would be greatly appreciated!

In this episode, I'm very fortunate to have my first guest come and join me! And what better way to kick this off, than a topic area which is very close to my own heart - DevOps. I'm very excited that I was able to invite Abel Wang, Principal Developer Advocate and DevOps lead to come and join me to talk of his experiences at Microsoft. We jump through a few different areas, from What is DevOps, how it links back to requirements, Data DevOps and shifting left. There's plenty to learn from in this one, so let's dive straight in!

Chris: Hey everyone. Welcome to this episode of Cloud with Chris! Now in this episode, I have got my first guest joining me and I'm very pleased to say that I have the very well-known Abel Wang with me today. Abel!

Abel: Hey, how's it going - so happy to be on your show!

Chris: Awesome, thank you for coming. I also think I need to say a big thank you to you as well, because the whole reason I'm doing this show is because of the discussions that we had back a few weeks or a couple of months ago now. So thank you for encouraging me to do this first off, as it's a great journey that I'm starting to embark upon.

Abel: Absolutely! Man, that conversation seems like a lifetime ago doesn't it?

Chris: It does, doesn't it? A lot has happened in the world since then! But we'll try and keep current topics and affairs aside for now because I'm sure there's plenty that people can see in their newsfeeds going on right now.

Abel: Fair enough, fair enough!

Chris: I think one of the things that we've spoken about quite a few times individually is DevOps. I know it's something that you're very passionate about, and something that I'm very passionate about and it's a topic that I wanted to bring up for us today and talk through for the listeners. So, maybe a good starting point (and I know it's probably something you spoke about many times over and over), but, when you talk about DevOps what is that anchoring point for you? What does it mean when you start talking about DevOps? Maybe let's start there.

Abel: Well, first you need set the level right. You need to be able to say what exactly is DevOps, because if you ask ten different people what DevOps is going to get like twenty different answers. I always point them to Microsoft's definition of DevOps, and I'm not saying that this is the only definition, or the only right definition, but it's a definition that helps frame a conversation, right? So for us, DevOps is something very specific. To us, DevOps is the union of people, process and products to enable the continuous delivery of value to our end users. So the key there is we need to be able to continuously deliver value. Not continuously deliver code. Not continuously deliver features because those things ultimately don't necessarily help our end users. We need to continuously deliver value so that kind of starts the conversation and then we can dive in deeper from there.

Chris: Gotcha. I really love that definition. It's one I've used a few times, because I think that grounding - especially the last two words - end users - making sure that it's not just for the sake of the system or we need this cool feature. It's the fact that we're doing this for the people who are physically using our system and sometimes we just need a bit of reminding about that right? In previous episodes we've spoken about requirements and we spoke about the importance of designing to certain specs, but all of that is for reason. All of that is for our end users and that's why it's important.

Abel: It's super important, right. I was on a team where our focus really was on delivering features. We thought that pumping out features meant we were winning, right? We were doing things correctly. Then we later added telemetry… so using telemetry we could actually now start seeing what our end users are really doing in our application. So by doing that, we start being able to see - hey that new feature that I wrote - is it delivering value? It was shocking for me to realise how much or how many new features I created that nobody touched and nobody cared about, right? So even at microsoft, we ran into this problem big time. We thought we knew what our customers wanted and we would do extensive interviews with our customers, right? The top customers we had, we pulled them into Redmond and we'd have interviews with them. We questioned them. We figured out exactly what they needed, what new features they needed. So, we thought we were really good at it. Once we added telemetry, we found that we were actually pretty bad at it. About a third of the time, you know what? We were dead on. The feature that we created, our end users really really wanted. A third of the time people are kind of like, yeah who cares? And a third of the time, the features that we created were so bad that end users were like get rid of this moving forward, otherwise I'm using another product. So that was shocking to us to figure out, that we were so bad at guessing what are end users needed. But then after we did a little bit more research, we realised that this is pretty much how everybody is. A third of the time, highly successful. A third of the time, meh. And a third of the time, you're just completely off base. So we realised that adding telemetry is absolutely vital. We needed to get that feedback. Now, you know you've heard the term - “if you fail, fail fast”. It's ok to fail, but fail fast. So, that's what we wanted to do with all the features that we are producing. Let's create a feature, and if we fail, we can build fast, right? We can push it out really quickly it doesn't even have to be finished. But then end users can start using it. If this is something they like, if this is something they need and are using, great! We can double down on it and keep on developing this particular feature. If it's something that's giving no value, because nobody's even using it - We can fail really quickly and then move on to something else. If you look at how we do our features with telemetry, our success rate is way higher than just a third of the time because we do specifically that. We abandon features that nobody is using, and we quickly double down on the type of features that we can see with our telemetry that people really need.

Chris: Gotcha, gotcha. That whole summary, and that whole scenario is really interesting because - what I love - is you grounded it in your own experience there - what you've done from working on product yourself as well. Let's take this in a slightly different direction - I keep mentioning that one of the things we've really focused on (it's Cloud with Chris) - We need to build these cloud systems, we need to go and build to some kind of requirement, because once again, grounding with our users… They are expecting something from our platform. They expect some level of up time. They expect some level of availability and resilience against disasters. How often did that get incorporated, and how do you deal with some of those things perhaps from your experiences there that you've just been mentioning?

Abel: Well, uptime is vital. If you think about that definition of DevOps - continuously deliver value, right? Value doesn't mean it has to come from code. If you provide up time, then that is value in itself. If my app goes down, clearly I'm not giving any value. So that becomes extraordinarily important, so this is kind of goes with along with the whole evolution of DevOps. If you look at DevOps when we first started, even before DevOps was a term - This would be like late eighties, early nineties kind of time frame. We couldn't even build software successfully, almost ever, right? All of our software came in over budget, over time, too many bugs and so on. It was kind of a colossal mess, and because of that we started coming up with different processes to write software better. Then things like agile and extreme programming and things like that came about, and it got to the point where we got pretty good at building software. I mean if you did agile correctly sprint after sprint you start delivering exactly what you're promising, right? It turned into this nirvana type of situation, but then we realised that - yes - we can produce all this great software sprint after sprint, but we couldn't get it out into the data centres. We couldn't get it out into the hands of our end users fast enough, right? So, because of that we started concentrating on CI/CD, and that's when the whole DevOps phrase really took off. Then we got to the point where CI/CD was working fantastic. I could check in code or just touch my repo and I can kick off the build it can run my tasks that can deploy all the way even out in production, right? Then, we started noticing other types of things. Like how you deploy databases? What about security? You know things like that started popping up. Another thing - that once we got all of that down - a big thing that started popping up is site reliability. The whole we're hearing so much more now about is site reliable engineering and things like that, and that becomes absolutely vital. It is kind of interesting in the team that I'm most closely associated with. What we do for site reliability - is that - we have what is called feature teams. Each team owns a particular feature set. Now, we own that feature set all up. So we own the UI, we own the data layer, we own the back end code, we own the pipelines that deploy the code into production. We own everything. And, we even own the reliability of whatever features that we wrote. So if there is a live site incident - god forbid something like that happens - that issue gets triaged until we can figure out which team is responsible. Then there are two people in the team itself that are basically on call, so if anything bad happens - these two people on the team - they have to go and fix it live in production. Immediately a call bridge is created, they jump on there, these two poor people. This role rotates. It is so horrendous, they are literally not allowed off that call bridge until they slapped a band aid on the problem to fix the problem. They're still not allowed off that call until they figure out the root cause and come up with a plan to make sure that this same problem never happens again. So, just like before where we started embedding and making all of our engineers within a team - Instead of just saying “Oh I'm a front end code” or “I'm a backend coder” or “I'm a SQL database person” - we started building these cross functional teams, where people can pick up any type of work. We have also now incorporated SRE directly into our teams itself as well and that's actually work out really well for us.

Chris: Awesome. I think there's so many different learnings in there. This idea of ownership between the dev side and “I break, I write it, I own it” almost is the cycle there that I'm thinking of, but also then the idea of you don't just own that small piece of code that you've written - but if you are one of those folks that are on call, anything could potentially crop up. That's not going to be in your control, so you have to have that familiarity with the breadth of the product. So it also fosters the idea of ownership, not just on your small feature but the whole entire solution - having that left to right awareness as well. I think that's really key.

Abel: It really is key i think one of the funnier things that we do is when a new person joins our team. To get them up to speed on the code - the fastest way we figured out how to do that is just throw them into the fire, which is not exactly nice, but man they ramp up quickly. So we'll end up having a more senior engineer that will pair up with the junior dev that just joined the team. It doesn't have to be junior - but the new person that just joined the team - and for that particular sprint they are on call, right? When you get that live site incident, guess what you learn that code really really quickly. I don’t know if this is the best in terms of stress, but man we pick up the code best.

Chris: Thrown into the deep end is the term that comes into mind there. Gotcha, I'm with you. One of the things that intrigued me though - there were quite a few things that you talked about in one of the scenarios earlier. But, one of the things that jumped out was around DevOps for databases. I think you mentioned security as well, but if we focus may be on the database side. One of the things that I hear from customers day in day out that I work with - Lots of customers focus on DevOps for the application side, and that DevOps is apps. End of, period - that's the deal with DevOps. But, actually - when you start speaking with them and helping them understand that actually DevOps is more than just applications… It is the databases, it is the security process, it is all of those different parts of the life cycle. You can see that their eyes start to widen, both in awe thinking “wow, I can do all these things”. But, also fear - because “Hey, hold on you want me to give up that control i have on my databases and actually let some process manage that?” How do you see that? Because, i guess you probably see some similar things in that space.

Abel: Oh, absolutely - hundred percent. I agree with you completely, that the database should be part of this DevOps process, right. It really should behave no differently than code. But, if you look at the reason why there's so much friction - especially when you first get started . If you look traditionally at how we manage databases… So I'm going to go all the way back to the eighties, because that's kind of like when I first started writing code (Well, actually I wasn't quite - I'm not that old), but early nineties is when I really got started. So, during that time frame - we didn't even store the schema of our database in source control. So the gold standard for what the database schema should be would ultimately be whatever is in production. Whatever that production database is, that's what the gold standard should be. So if I had to recreate my database locally, I would just have to look at the production database and hopefully make it look the same. It was kind of interesting, but we did that for a reason because we quickly found out that engineers are specially coders like me - we are horrible with databases. We're super dangerous because we know just enough to do really bad things. So DBAs would sit and find the databases to protect it, and if I needed database tasks like updatea schema or whatever, I would discuss it with DBA, and they would make the change to the production database. They would make the change directly to the production database. So with the advent of DevOps, right where we have this idea that I should be able to check in code and it will kick off the build. Then, that build will compile everything, get everything ready and then deploy it - Dev, QA and all the way even out to production. The idea of - what do we do about our database. The app is easy enough - conceptually, that's easy to grasp . So, one of the first things that have to happen is that you need to be able to store the schema of your database in source control in some fashion. There are a variety of tools that can let you do that. From Microsoft, there's SQL server data tools, from Redgate, there's the Redgate tools, there are some open source tools as well at that do this type of stuff. Really, frankly - I don't care what tools you use, as long as you use some type of tool that can help encapsulate that database and then you can check that database schema into source control. Once you do that, now you're able to version your database schema right alongside your code. So at any point in time, I can go to any point in time, point at my source control system and be like oh there's my application code and that's what the schema of my database is. Then I can, build it and apply it. The next thing that needs to happen is that the database schema changes have to be automated in some fashion. So, traditionally how do we do things - okay if the database needs to change scripts needs to be written. Those scripts have to be verified by the DBA (very carefully, because you don't want to mess up that database) - and then because it's so vital that you don't mess up the database - The DBAs then run the scripts directly against the production database. This worked great back in the nineties, but it starts falling apart now because if we need to deploy like fifty times a day, or something like that. Or, even if we're going to deploy once every two weeks, we can't have that bottle neck of the DBAs needing to review everything in hand applying changes to the database. So, there are tools out there that can basically auto gen for you the scripts that will take whatever your schema is in your database that's running in production. It will auto gen the scripts for you, so that it will update your database schema so that it matches the schema that you have checked into source control. Whenever I say this to DBAs, their heads explode and they look at me like I'm in space, and they're like “Are you kidding me? I don't trust your tools, especially if it's microsoft tools!” and I'm like “Okay, i get it. So how, about we do this? i can add an extra step into the pipeline so that i can auto gen the scripts. Then you can read the scripts, review it, and if you think the scripts look good you can approve it. Once you approve it, then the next stage will go ahead and deploy those changes into the database, either in production or in whatever environment”. Usually DBAs are like okay, because -they still have that level of control. Then, once they do this enough, eventually they realised that - wow - these tools really do generate scripts that are good enough, that are safe to fly, and then they can kind of remove themselves from this process. Then they can really start automating this process completely, right . So, i understand the concern - I also understand the reluctance to get started. But, once you do there's really no turning back because then all of a sudden, I can change the schema of my database and I check in my code, it builds my app, it builds my database changes that needs to happen. When it deploys, it deploys my application code and also automatically updates the schema for me in a way that is safe and correct .

Chris: Gotcha. Now, it's interesting because one of the things that you commonly think about with from a DevOps perspective in general is this idea of I've got one version of my app - v1.1, v2.2 and I've got these different versions. But, I guess that mindset almost gets a little bit complicated to think of from a database perspective as well. If you think that v1 of my schema for example has FirstName and LastName as two separate columns but in my v2, I merge those columns - I've suddenly got a direct dependency between the version of my schema and the version of my applications. So, what are your thoughts around that - and the idea of rolling back for example?

Abel: Ah, rolling back is one of the most fun conversations that i have - because rolling back your app code - that is super easy to do. There's a variety of ways we can do that. We can do that with blue green environment, you can do that by copying your app code somewhere else and then swapping it back in. There are plenty of ways that we can do that easily. But, how do you roll back your database schema? The short answer is you can't, right. There are a lot of people that say they can and I pretty much just wanna punch them in the face and call them liars. It's great, because if my database is like five terabytes - big - how am I supposed to roll that data back?

Chris: Right, right.

Abel: You can't, it's just too much. So if that's the case then how are we supposed to do this DevOps process? Because there are going to be times when we make a mistake and we have to roll back in some fashion. So one of the things that we did at Microsoft is - we wanted to make sure - when we apply changes to our database, we can't roll back. So that means the only way that we can fix things is by rolling forward. One of the things that we do to help is when we make a change to the database, you need to make sure that changes that you make are backwards compatible - for at least one extra sprint right - or one extra deployment. So this way, you roll - you don't even need to roll back it just magically works. So what I'm talking about is something like this, right - let's say I need to delete a column in my database and I'm going to move that column to a whole other table. What you would end up doing is you would first - well it needs to be backwards compatible right - so then you're going to have a point in time where you're going to have the old column and also the new column. Your code needs to be able to push the information to both of these columns, because you have to make sure everything is in sync. So this way, you can roll forward or roll backwards your code without changing that database at all - so that makes it so it's much safer. We had to take that even one step further - and I'm talking about Azure DevOps services, so for Azure DevOps services, we needed to make sure that we have no downtime during our updates as well. Because we service the entire world - and at the same time - this is people's source code, this is their build and release pipelines, this is their plans. Enterprises literally can't do their own business if we stop, so we can't have that moment where we're like ohh please hold on for half an hour as we update.

Chris: I'm envisaging those under construction or updating pages right now. That's what I'm thinking of in my mind. I can't imagine that going down well.

Abel: We can't do that, so we have to have no downtime. So to do no downtime is actually really kind of interesting. So the way we solve that problem was, to make one change we would have to do multiple deployments. So, first we deploy the binaries - right - because binaries are easy to roll back. But our binaries have to be multi schema compatible. So then it will be compatible to the old schema and to the new schema, and it needs to be able to ask the database and say “Hey, what version are you in?". Depending on what version you're at, it will then deploy the code, deploy the data - however it needs to - like new data however it needs to.

Chris: Ah, so you stage the code ready for the database, right.

Abel: Right. First update the code. Then, we update the database. So when we update the database is going to be in this weird hybrid state where it's basically the old schema and also the new scheme. Then, we get to a point where we can update our code so it just uses the new schema. Then, we can go ahead and clean up the database and while we're doing all of this it's important to remember that to do this with no downtime, we can't lock any table - so we're copying a lot of these tables right - because you can't lock them. So then you update the database, so it's only the new schema you clean everything up and get ready for that. Multiple steps that you have to take to update your database schema. So, this is pretty complex - but what we gain from this complexity is no downtime. I'm not saying everybody needs to make their database changes in this fashion. But require no downtime? Guess what this problem has been solved - there's a pattern that you can follow that totally takes care of this.

Chris: Gotcha, gotcha. I guess there's one of those things there that - we just need to think about how we bring those DBAs along the journey as well, because some of them who might be listening to this right now might be thinking “Hold on now, there's processes that are coming over and taking my job”. That's not necessarily the case, right - because there's still ways to bring those folks in?

Abel: Absolutely, so that's not the case at all. The role of the DBA is still absolutely vital. The way that they do their job is going to change a little bit, so instead of going directly to a production database and just laying your changes on there, they would be making their changes and checking it into source control. Then, having the pipelines basically work through and run all the scripts and do whatever. So they have to do things a little bit differently. Another part where the database administrators really come into play, is when us developers make code changes to the schema. Because, if the schema is permanently checked into source control - that means if I need to add a column or do whatever for this new feature - I'll just go in and do it, but before my code gets merged into master. I'm going to have to go through a pull request. So now, this is the vital step where the DBA step in - right - because they are going to be part of the pull requests. A pull request, you can think of it as just a code review - so they're going to be part of this code review - they're going to look at the database changes I've made. They will be able to approve, say - “You know what, these look good! Let's go ahead and approve it” and they can merge that into master which will then go through our pipeline and push it all the way out even into production. Or, they can look at what we're doing - test it out make sure it looks good and be like - “Ohh, wow - Abel has no idea how to make these changes to the database. That looks terrible” and he can reject it with comments and then i can continue on the pull requests, keep on tweaking it until my DBA looks at my stuff and says “You know what, that's the way it's done correctly”. So yeah, their roles will change a little bit differently but they absolutely have a place and they are absolutely necessary within this whole new DevOps world we live in.

Chris: I'm with you. I think one of the things that you subtly addressed in there - as well - was this whole idea of shifting left. I think this is one of those concepts we've been aware of, but the term has really started growing I'd say over the recent months / recent years maybe. Certainly, I'm hearing more about it anyway. This idea of, “We want to bring everything earlier on into that cycle from a DevOps perspective”, because, the earlier - (We all know this, right) - The earlier we can catch some of those bugs, the easier they are to solve, the less cost they take then, as well on the organisation to actually go and remediate as well. So, from a data perspective you just mentioned - We bring the DBAs in along that journey, get them to do that pull request - But I'm sure there's plenty of things we could do to even validate automatically again there, right?

Abel: Oh, absolutely, absolutely. There's a whole bunch of tools that you can attach to your pull requests. So, before a human being even looks at the changes that I proposed - it can be validated by all these different type of scanning tools, all these different types of tools. So you mentioned shifting left - when you look at the progression of DevOps - You'll see this concept of shifting left pop up over and over and over again. For the longest time, we used to do quality by trying to bolt it on at the end of our app. We'd run our app and when we were done, we handed over to the QA teams, and the QA team would break it, and they would file like a mountain of bugs. Then we have to go back and fix it. The way that QA would do this is basically doing end to end functional testing. That mostly worked, except the problem is that it would take a really long time to do. To run all these endless functional testing, it could take weeks, can take months. For product as big as Windows, it might even take over a year. So, we needed to speed this process up. We still needed to maintain quality. So, we did the same - we shifted left, so instead of trying to build quality on at the end with the QA team, we now tried to build quality into application as we write that code. So as we're writing this code, we're writing things like unit tests. As we check in our code, we have pull requests where everybody reviews the code and also it can start running these unit tests and all these scans starts happening as well - to help find quality problems earlier in the cycle; earlier in that dev cycle. Same thing with security as well. We used to do security the exact same way - we bolted on at the end. Here is the finished app, now spend three months while the security teams does an audit. Nobody has time to do that anymore - how do you maintain security? So a lot of things that we do right now is in our pull requests. There are security experts that review our code for each of our pull requests. As well as that, we have automatic scanners that will scan our code to make sure everything looks good and we're not breaking anything from a security standpoint as well. So this shifting left concept is a great pattern that we all use throughout DevOps

Chris: Gotcha. One of the things that I really like to see, and I really push to people - and make sure this is at least one of the things that they're thinking about particularly. From some of the topics that we spoke about in previous episodes, is this idea of being able to test realistically. So, whether that's having data that is representative of what you'll be running in prod (obviously not prod data, because then you've got different challenges then, but representative data). But also, representative configuration and representative scale as well. Because things like performance tests and chaos tests are absolutely things you can bring into that pipeline - into that journey to production. And, you need those at different stages along the way - but making sure that you do those in a representative way really helps reinforce those requirements that you're trying to design towards. Those will evolve overtime and fortunately DevOps set us up in a nice way from agile practices, to be able to change those overtime and adjust for those. But that then gives us that platform and test bed to be able to go and validate that, so do we reach that level of twenty thousand users that we need to reach? Or, do we need to do some tweaking and some performance enhancements as part of their sprint as an emergency change for example. It gives you that validation of quality than ahead of time right.

Abel: Validation is absolutely vital. The other thing that I wanted to point out when you're talking about quality - because this is really interesting - one of the things that we started doing a lot more now once we've gotten this DevOps process down, is we do a lot more testing in production. I'm not saying you should do all of your testing in production, but so much testing happens in production already unintentionally, right? Because ultimately, it is impossible to absolutely replicate the exact same environment that you have for a QA environment versus your production environment. And, even if you can afford to have an identical physical or virtual environment, you still can't replicate the traffic, right. You still can't replicate that. There are all sort of tools out there that try. But at the end of the day, it is just not going to be the same. So we end up testing - unintentionally - a lot of stuff in production. So then, that got us thinking too - like - why don't we do this purposefully and so that's when we started adding a lot of feature flags to protect a lot of the new features that we write. Then we would open up that feature flag, a little bit at a time just let a tiny bit of traffic through to see the feature flag, so that we can actually test in production while minimising the impact of the new code. So that has been incredibly important and incredibly useful too.

Chris: And, Ii know it's something a lot the cloud providers use. This idea of progressive exposure. When you think of something like the different azure regions, or think about the different azure feature sets, or maybe even Windows 10 and the different types of insider rings that you have. That idea of progressive exposure, and bringing those features out of those different users at different points. None of them are less worthy than others. They are just willing to take a different amount of risk. They are just willing to potentially try some new shinier features which may be a little bit rougher. But, in the end - they get those features and they're able to able to experience those new features ahead of time compared to other users. And, that's something that they see as value to them, right - bringing us full circle to that comment earlier

Abel: Yeah.

Chris: Awesome. Well first, off there's a lot of ground that we've covered their Abel, so thank you for that! Do you have maybe, any kind of closing thoughts - Any top tricks, tips or things that you commonly see people falling into for example that might be good to leave some of our listeners with?

Abel: I think i want to jump back to the whole testing production thing and feature flags. We read about this a lot, and I've come across so many of our customers, or so many enterprises where - I run into these teams and they look at the feature flags, and they're like - “Ooh, this could be super useful. But, we're not ready to start using feature flags, our apps aren't even designed to use feature flags”. I want to point out to people that feature flags have such huge benefits to them. They are incredibly easy to write, because, literally - they're an if statement. so if you can write an if statement, and if your code is organised in a way that you can use if statements… then guess what? You can use feature flags. But, before I push feature flags too much - i will point out - that there is a cost associated by using feature flags. If you use feature flags, you're adding complexity to your code. Writing feature flags is easy. Maintaining those feature flags… it can start getting really messy really quickly. So, if you imagine - If you have nested feature flags, then that start getting complicated, right? But, i will say this - I think the benefits to having feature flags far outweigh the tax that you have to pay to maintain and use feature flags. So, there is definitely a cost involved - but man - they provide such a huge benefit to the code that you right.

Chris: Gotcha. It's that opportunity cost idea, isn't it? Can you afford to? Can you afford not to? I think once you ask those two questions can i afford to do feature flags? Well, yeah - there is this investment. Can I afford not to? What do I lose? Well, actually - I lose this opportunity to potentially experiment with some users who are really passionate about the software and the product that I'm building. That's something really big that I could be missing out on there.

Abel: Yeah, that's super valuable and then you also start getting the ability to separate your deployment from your releases, right. Instead of having those two tied hand in hand. We've been asked this, forever - which is “Oh, well okay you know team a may produce feature a and team b produced feature b and they merged their code and they tested it. But, we only want to push out feature b - but not feature a”. Okay, how are you supposed to do that? Back in the day, without feature flags - the answer was not very well. You could try to cherry pick, it basically always turned into a colossal mess. With feature flags, that becomes ridiculously easy to turn off the feature flag for A turn on the feature flag for B. Bam, you're done. So, it really separates the ability to deploy verses to release your features. Then, having the ability to test in production - that's freaking amazing as well, because then you can start doing all these different types of experimentation that you need to be able to do. You can experiment by saying that I'm just going to let 10% of the traffic use the new stuff. Validate that, make sure that this is really really good stuff - and - once it is, then I'll open it up to the rest of my people. If it's not, well guess what - we can roll that back out, then tweak it and do whatever we need to. By rolling it out I just mean turning the flag off. So, it makes managing all of that stuff so so so much more useful, and again - even if your app is totally not designed for feature flags… The very next new feature you write - wrap it in a flag, and guess what? You're on your way you have a bug fix wrap your bug fix in a future flag. Hooray for that, you're done. The only other thing that I'll say about feature flags is once you are done with the flag, remove your flag as soon as possible. It is very tempting just to leave it alone, but now you're leaving a lot of crud. You're leaving a lot of branches within your code, right? Ultimately, it's an if statement, so you need to clean up your code as soon as you possibly can.

Chris: Mm, I'm with you. But, I guess there's a balance - isn't there? Because - one of the feature flags potential opportunities that it has - is if you want to do a temporary rollback. Maybe thinking again, a bit earlier - that there's some issue that we discover in production, that actually we didn't quite catch along our testing process. We need to turn it off for all users. So, that could be a great reason to keep the feature flag right?

Abel: Absolutely, that's a great reason to keep the future flag. In fact, at Microsoft we keep our feature flags around up until it's been a couple of sprints and we know we're never going to roll that feature back out again.

Chris: Gotcha. Awesome. Well, Abel - I think there's a ton of interesting insights that we've discussed there. I'm sure the listeners will be taking many notes, and have a number of different things that they can go and start trying as part of their processes. So, thank you once again for coming onto the episode today. Hopefully we'll have an opportunity to talk more about some of these topics in the future as well

Abel: Absolutely. Thank you so much for having me, this has been a lot of fun!

Chris: Thanks Abel!

Wow, what a whistle stop tour that was! We covered many different topics, and as always - it was an absolute pleasure to chat with Abel! If you have any follow-on thoughts, or want to continue the discussion - please get in touch either on Facebook or Twitter @CloudWithChris!

If you enjoyed this session, please do let me know - so I can arrange for similar topics in the future, potentially even covering some of those topic areas in further depth. We have a growing backlog of guests and ideas coming in, but I'm always looking for more! If you'd like speak on the show, please complete the form up at cloudwithchris.com and I'll be in touch!

Thank you again for listening, and until next time… Goodbye!

Guests

Abel Wang

Abel Wang

Abel Wang is a Principal Cloud Advocate and DevOps Lead at Microsoft, specializing in DevOps and Azure with a background in application development.

Hosts

Chris Reddington

Chris Reddington

Welsh Tech Geek, Cloud Advocate, Musical Theatre Enthusiast and Improving Improviser!

Chris is currently a Senior Engineer on Microsoft's FastTrack for Azure team.