2 - Cost Control

Posted on Sunday, Mar 15, 2020
Let's introduce Episode #2! This was a suggestion from the community, so a big thank you to Garth Niblock for this one! Last time we talked about requirements, and cost is one dimension of that - but a very important one. When we move into a cloud world, the technology will change, but the way way that we think about cost changes as well. So this episode will be called ‘Cost Control’. Let's listen in…

Show Notes

Hello and welcome back to Cloud with Chris! You're with me - Chris Reddington, and we'll be talking about all things cloud. Now, just as reminder - I would love to have any and all topic suggestions, so please get in touch with me on either Facebook or Twitter, @CloudWithChris. I'm also working through inviting a number of people to come onto the show as a guest. If you'd be interest in joining me for an episode, then please in get in touch as well! I'm also looking to increase my presence on YouTube, so if you could subscribe to my channel - Cloud with Chris, that would be greatly appreciated!

Let's introduce Episode #2! This was a suggestion from the community, so a big thank you to Garth Niblock for this one! Last time we talked about requirements, and cost is one dimension of that - but a very important one. When we move into a cloud world, the technology will change, but the way way that we think about cost changes as well. So this episode will be called ‘Cost Control’. Let's listen in…

Let's start by setting the scene. We've decided we're going to make the journey to the cloud. One of the first considerations that will be on our mind is cost, as it will likely have been a driver for our decision to move towards the cloud (as is the case for many people!)

The important thing to bear in mind is that you shouldn't use is as the only basis for your architectural decisions. Even though it is likely a key driver to move towards a cloud based solution, it is not the only set of requirements you need to focus on. We spoke about requirements last time, and the importance of defining those up front for your solution. Cost is one of those pillars of requirements, don't let it solely guide you down a path that could eventually lead you to a path of more cost, more complexity and potentially even misleading decisions later on down the line.

What we will do throughout this episode is explore some of these mindset shifts and considerations that we need to make. Let's first start by looking into our current mindset.

If we think about an on-premises world, we will think about cost in a slightly different way. We consider the costs up-front as capital expenditure, and think about that as an investment up-front (often thought as a sunk cost). This is different in the cloud, because we now pay for the costs of our cloud infrastructure on an ongoing regular basis (typically monthly) , which we can then consider as an ongoing operational expenditure instead.

The stereotypical (or comparable) example is paying your Gas or Electricity bills (or maybe think of a similar service where you may pay based on what you consume). This concept of “Pay as you Go” reduces the strain of up-front investment, and shifts our Infrastructure into a realm which resembles a commodity. So for example, we only pay for what we use.

Dwell on that thought for a few moments. Infrastructure as a commodity. That can be treated as an operational expense. Is that a good thing, or a bad thing? It's not necessarily, either - from a business perspective, it does introduces opportunities as well as challenges.

On the plus side,

  • Depending on the exact nature of your business, this could bring in something valuable directly to your entire business model, potentially even act as a disruptor to that model. For example, you could change your business model from a license based approach to an ongoing / subscription based model, where you then have a consistent and predictable, continuous revenue stream.
  • You could potentially waste less cost, because each month you have the chance to assess the costs from the previous month any reduce any overheads, potentially optimising your costs.

However, that last point could also be considered a challenge. I often hear that the alternate approach of cloud billing (i.e. pay per use, per month), introduces some challenges as it is difficult to predict or estimate your potential cost. While it's a challenge, I wouldn't say it's impossible - if you can implement the right operational procedures from a resource governance, and financial monitoring perspective, then the challenge does become easier to solve.

Remember what we talked about last time? Requirements? Requirements drive everything. Think about it - One of the topics that we covered in the last episode was around having undefined requirements, or loosely defined requirements. If we cannot clearly articulate the needs of the business, then the architects may overcomplicate the proposed solution to achieve some kind of “worst case scenario” (a very highly available system for example), and therefore over-engineer the requirements, which then drives up cost - potentially unnecessarily.

I said it enough times in the last episode, but to drive the point home - Don't over engineer services for requirements that are not clearly defined. Pause, sit back, and work with the stakeholders to determine what is more of a priority (When you show them example costings of differing architectural approaches, they'll be able to work towards a decision and help you approach that problem!)

Let's think of a tangible example. We are an organisation that focuses on retail, and we have a system that is non-essential to the ongoing business operations, based upon a low SLA and unadventurous recovery objectives. In this scenario, the business has noted that cost-saving is a priority.

  • It would not be appropriate to go for some kind of deployment where we deploy our resources in a highly available (or active/active) approach across our Regions, as we would be effectively doubling the costs.
  • Instead, we may want to look at some kind of single region setup or maybe an Active/Passive, or some kind of Infrastructure as Code Disaster Recovery approach where we can go and easily spin up new deployments in alternate zones ore alternate regions if we need to.
  • If we over-complicated this and went for multiple regions, then we would then need to factor in the additional costs of running those key resources, but also things like replication costs and data egress costs/bandwidth costs between regions.
  • We haven't yet defined in this scenario whether there is a need to communicate back to on-premises, or potentially between another clouds as well. If those are requirements, then that would potentially drive up the bandwidth costs as well, further increasing the costs of running our solution.
  • That’s before we even talk about the compliance requirements which could of course drive the overall solution in a given direction. They may in fact even rule out certain services as well. For example, if we need to adhere to PCI Certifications, then that may rule out multi-tenant options and as a side-affect, increase the cost of running our solution.
  • And as a final thought, we want to ensure that we're using the right tool or the right service for the job. Just because we can store many different datatypes in a SQL database, or in a Document store doesn't mean we should. We should choose the right tool for the job, because if we can optimise our usage, we can typically optimise our cost alongside. Choosing the right data store for the right scenario is vital, and we'll talk more about that in another episode!

Now overall, what I have talked about there is how important it is to understand our solution. Understanding our solution and what it needs to achieve is pivotal. Remember what we said last time, what the main message was of the episode - context is key! The decision to move towards the cloud, or re-architect an existing cloud-based solution gives us a great opportunity to evaluate how we are currently approaching the problem. We can then identify whether it is appropriate and adjust as needed. As we go through that analysis, we may want to consider some key questions -

  • Is the workload sized appropriately? Commonly, I see that customers migrating from on-premises servers find that they can reduce the specs (CPU / RAM) of the machines that they are running on, because they are over-provisioned for the purpose. This wasn't a problem in an on premises world (because we've already paid for the cost up-front). However, because we pay that cost every month in the cloud, it is something that we now want to optimise for. You will typically find that optimising performance, or fine-tuning your specifications could delight you in your next bill.
    • Consider the scenario where you have 10 virtual machines, or 10 instances of some platform as a service option that each run an individual service, and nothing more. Could we leverage an architectural pattern known as Compute Resource Consolidation? Compute Resource Consolidation is a fancy way of saying, can we run the components on a smaller amount of compute, which would therefore cost us less to run the workload. So in this context, rather than 10 instances, could we run the 10 component across perhaps 2 instances to save on cost and do some kind of co-location.
    • Remember what I said earlier though, this needs to be taken into consideration with the wider requirements (not compromising any availability requirements as part of this decision). A concrete example of this, would be to transition from 10 VMs/instances towards an approach where we run the 10 service on fewer VMs. Or as an alternative, we could containerise the application and run them as PODs on a Kubernetes cluster for example. Fundamentally, we are increasing the density of the components deployed on those Virtual Machines, thereby reducing cost.
  • Can the workload scale? Commonly people think of “scalability” from a performance perspective. But think for a moment of an Infrastructure as a Service (IaaS) based workloads or on-premises workload. They tend to be quite static and designed to handle peak-load scenarios (so provisioned at a level where it can handle full amount of expected load). Instead, what if we can optimise to the usage patterns of the application? These usage patterns could either be driven by a time factor (seasonality, so certain hours of a day; days of a week or times of a year), or by metrics generated by the underlying infrastructure (such as CPU, Queue Length, etc.)
  • Whilst we're covering scalability, we should cover off the difference between scaling up and down compared with scaling out and in.
    • Consider our workload runs on an individual machine. That machine will have a certain amount of RAM and CPU associated with it. On premises, we would commonly think about “Scaling Up” that machine (so increasing the amount of RAM or CPU in the machine, because we can only have a finite amount of space in our datacenter). As a result, we find that we love and care for those machine, just like we would a pet. This is the same as changing the SKU of Virtual Machine that we are using in the cloud or the SKU of a Platform as a Service option. Scaling down is the opposite concept, so taking that larger machine size and reducing its capability (i.e. reducing the amount of CPU/RAM that the machine has, or changing to a lower SKU in the cloud).
    • In contrast, scaling out is the idea where we add more machines to solve a problem (also known as adding more instances). We therefore have many “workers” (or machines) performing the same task in parallel. Once the task is complete, we can then scale in (i.e. remove those instances), to ensure we have the appropriate amount of instances needed to solve the current needed level of load.
    • This mindset shift is typically because of the constraints that we had in an on premises environment, where we had fixed space in our datacenters. In the cloud, we have an abundance of compute available to us, so have the luxury of being able to throw more machines at a problem. If this is for a small amount of time and the workload is quite bursty, then it's likely to be more cost-effective to scale out for the time where we need that extra compute, run the workload as we need to at that level of instances and then bring it back in, rather than run the workload at a higher SKU for an ongoing period of time.
    • This also allows us to start thinking about deploying our workload as a “stamp”, and having different performance tiers of our application (so bronze, silver, gold or low, medium, high loads) and being able to deploy those across different regions and predictably be able to determine the level of loads certain deployments may be able to handle.
  • Now let's revisit the retail scenario from earlier. Your workload may have some kind of usage pattern -
    • Do they commonly use the solution between 9 - 5 on business days in a certain geographic region? Are there spikes at certain time of the year (Given that we're a retailer, Black Friday could be a prime example)? Or are there common peaks (for example, is it a payroll system, and we see more users accessing their payslips around pay-day every month?)
    • Each of the cloud providers has some concept of “autoscale”, the idea that we can trigger a scaling event reacting upon some kind of metric, or proactively scaling the number of instances based upon the current day or current time. This starts us along our journey away from treating infrastructure as “pets”, and thinking about infrastructure as ephemeral, or as some refer to it “treating the infrastructure as cattle”.
    • Given this “variable” nature of cloud deployments, we also need to be aware that the level of load that we see in the application may of course grow over time. This will of course mean that our costs may increase over time as well, in line with that - as we need more instances to deal with the increased level of load that the application is seeing. This does bring an interesting question. If we think about that, what happens if we get a Distributed Denial of Service or DDOS attack? There is some indirect damage that those kind of attacks can do from a cost perspective (as well as the direct damage they can do towards the availability of a workload, and brand impact for an organisation). This is why autoscale rules typically have a maximum number of instances so that you can control the boundaries; the minimum and the maximum number of instances that you would expect to run. And also why the main cloud providers have some kind of DDOS Protection Service (some of which even have cost guarantees as well, so they are worth evaluating and checking out).

Let's pause for a moment and think what we have discussed so far. We've looked at the mindset shift, moving from a capital expenditure model to an operational expenditure approach. We have once again reviewed the importance of requirements, and also the value in assessing our workload's current state and how we can potentially optimise for a cloud cost model.

Once we apply those changes, we then need to understand how to estimate costs of running the workload, and then measure that on an ongoing basis.

  • Each of the cloud providers have some kind of pricing calculator. These tools are your friend, and are well worth familiarise yourself with! Use these tools in parallel when you are designing your solutions. Whilst they're not going to give you a 100% accurate costs, it will give you a rough order of magnitude in the costings of your application, and gives you the ability to discuss a “Bronze, Silver, Gold” approach, to help identify with your key stakeholders which solution is right.
  • I often hear people talk about the fact that they can't estimate their costs because of the degree of variability and scalable nature of their solution. I agree that this can be challenging, but it is not impossible. If you are bringing an existing workload, then you will roughly know how you run it today, and can use that as an initial basis of cost. (For example, number of instances required and the types of SKUs that you may need). You may need to add extra services along the way, and this may evolve over time, as you may expect from your architecture.
  • Make sure that you factor in any extra environments into your costings. Most people think about Dev/Test/Production, but what about the cost of any Disaster Recovery environments? This of course depends on whether you are running Active/Active, Active/Passive or deploying new infrastructure by code in a disaster scenario, but must still be planned up front!
  • Operational Procedures are crucial. Who is responsible for tracking your spend, at the individual application level, at the wider project level and an at the overall organisation level? Each of those individuals will have a different viewpoint that they will be looking at cost from, and be looking from different perspectives…
  • All of the cloud providers have some form of tooling that allow you to set budgets and alerts on those budgets, so that you can proactively determine if you are approaching some kind of “limit” that you have set (Now whilst I use the word limits, those are generally soft limits as cloud providers don't want to risk stopping your production workloads! So that's up to you to actually take some action and stop the workload if you need to because of costs)
  • That point leads us onto the importance of good governance practices. In particular, being able to associate your resources with some kind of metadata information, so that you can report back on the metrics that are most important and most relevant to you. For example, environment (e.g. Dev, Test, Prod), which microservice costs the most to run, or which team is running the highest bill? On that last point, just because a team has a high monthly cost month-on-month doesn't mean they're doing a bad job. If they're running the core platform infrastructure that is underpinning a number of applications, you would likely expect that to cost more due to resilience concerns if it went down. Again, you need to look at this in the right perspective and the right context.
  • Thinking on governance, we should also consider things like policies. Are there certain rules or items that we can enforce, or get the platform to automatically apply, or audit, so that we can ensure a base level of information is being captured about our resources, or potentially that we are restricting what our users can actually go and do.
  • Overall, when you think about the things that we are talking about here, we're taking some concepts that we know from DevOps. In particular, being data driven. We are now billed every month, which means that if we see a spike in costs in one month, we can take action, remediate and address any potential issues, so that the cost is optimised for the next month.

Before we wrap up - let's think of a few final quick-and-easy wins to reduce costs directly in our workload

  • Is there the potential to optimise and use something like a low priority or spot instance? If the workload doesn't need some kind of guaranteed compute, and can rely on these types of low-cost instance types to burst, then that could be a great way to save on costs.
  • Can you commit a certain amount of money up front? If you can predict the number of instances that you will need, then you may be able to use some kind of offer that cloud providers typically have around using “Reserved Instances” which will allow you to gain good discounts.
  • Typically, cloud providers also have some kind of deal or subscription for workloads which are not running in production, where you can also shave some of those costs off as well. This does typically mean that you forfeit any formal SLA on resources being used in that context though, so do be mindful of that trade-off!
  • Can you transfer any kind of license benefits that you get elsewhere into your cloud deployment? For example, if you are running a virtual machine and the software that you are running requires some kind of license, could you potentially save some of that cost by porting existing licenses and just pay for the underlying infrastructure? Rather than the bundled license and infrastructure costs?
  • Remember to provision resources starting on a smaller tier and scale up/out as needed, again, focusing around your requirements.
  • And on that note, regularly review the components that you are using and make sure that you are using them effectively.

We have covered a lot of ground on points to consider from a cost perspective -

  • Make sure that your requirements are driving your decisions - Remember that cost is just one of those pillars.
  • Make sure that you are treating your resources as a commodity, because then that enables you to deeply understand your workload and the nuances of how you may be able to optimise (such as right-sizing or using scale in and out capabilities) rather than having some kind of static workload
  • Be data-driven. Regularly review your usage, alongside your budgets and factor that into your backlog to optimise where appropriate.

Thank you for joining me on this episode. As always, I appreciate your feedback and would love to hear from you. If you have any suggestions for future episodes, or would like to join me as a guest - then please just get in touch! You can do that either on Twitter or Facebook @CloudWithChris. And finally, please don't forget to check and out and subscribe to my YouTube Channel, Cloud with Chris. Until next time, Goodbye!

Hosts

Chris Reddington

Chris Reddington

Welsh Tech Geek, Cloud Advocate, Musical Theatre Enthusiast and Improving Improviser!

Chris is currently a Senior Engineer on Microsoft's FastTrack for Azure team.