Red flag #1: You’re paying for AWS CloudTrail
If you find yourself paying for CloudTrail at all, you have an opportunity to save money on your cloud bill. The first CloudTrail trail in a region is free, and your only trail should be at the AWS Organization level because Organization trails are automatically created in all member accounts within the Organization.
Keep in mind that the new trail is created in addition to any existing trails in member accounts. Therefore, if you created separate trails in member accounts in the past, you have an opportunity to save money by removing them.
Creating your CloudTrail trail at the Organization level will help you uniformly apply and enforce your event logging strategy across the accounts in your organization, because the organization trail configuration propagates to all accounts. As a result, you should verify that the configuration of your Organization trail matches how you want trails configured for all accounts within it.
Red flag #2: Storage costs are steadily increasing due to lack of data lifecycle policies
If you're seeing your cloud storage costs steadily increase over time, that might be a sign you don’t have the appropriate object lifecycle policies applied.
Object lifecycle policies automate the process of transitioning data to different storage classes, or deleting data based on predefined rules, aligning the storage cost with the value and accessibility of the data. This ensures that you're not overpaying for storing data that doesn't require immediate access or that has become obsolete.
Without lifecycle policies, you’ll end up with a build-up of data, an ever-growing log store, and/or excess snapshots. As a result your storage costs will trend upward, especially if older or less frequently accessed data remains in high-cost storage tiers.
Most of the time, it’ll suffice to transition or expire objects after 30-90 days. But the telltale sign that storage is worth examining closer is if you see costs trending up.
Red flag #3: GetMetricData API costs from CloudWatch are high
Third-party services like New Relic and Datadog scan your cloud accounts — typically CloudWatch metrics — regularly to get up-to-date info on your usage.
However, many don't realize you also pay for the API requests these services make. These API requests are reflected in CloudWatch in the “GetMetricData” API SKU. If you’re not careful, you’ll end up paying a huge amount of money for CloudWatch because of these GetMetricData API calls coming from the third-party software.
As a result, you need to be mindful of:
The frequency of these API calls, and
What metrics and data are being scanned
For example, you may have a dev account with a lot of resources that you're spending a lot of money on CloudWatch for because API calls are being made every minute. In situations like this, you should ask yourself if the frequency — and perhaps the granularity of data — is necessary.
To reduce CloudWatch costs coming from these API calls, in many cases you can simply ask the 3rd-party services to tweak the frequency and metrics being pulled for specific accounts/projects.
Red flag #4: Logging costs are >20% of your cloud bill
While logging is essential for monitoring and troubleshooting, excessive logging can inflate your cloud bill.
Much like with the advice given around API calls in our previous red flag, you should ask yourself whether the frequency and metrics you’re gathering from your logging fits the use case. For example, if you’re feeding a dashboard with data from logs, you don’t necessarily need to get per-second updates — an update every five minutes may suffice.
As a rule of thumb, you shouldn’t be spending more than 20% of your cloud bill on logging. If you’re exceeding that 20% threshold, it’s a sign you should take a closer look at what’s comprising those logging costs. Ask the various teams utilizing your logs what they’re using it for, and you might be able to determine where you can tweak frequency or the logging metrics you’re gathering.
Additionally, be sure to take a close look at logging in non-production environments, as those aren’t necessarily making you money. You likely won’t need the same frequency or metrics that you would be tracking in production accounts. If something breaks in non-production environments, you can just switch logs on and off, unlike with production where you’ll need more information as to why something broke.
Red Flag #5: Not cross-checking decisions
While this red flag isn’t a specific item you can spot in your bill or a cost and usage report, not cross-checking your technology and design decisions can cascade into cloud bill headaches and performance problems down the road.
A very common example of this is when one team purchases a commitment discount (ex. Savings Plan) in isolation, without taking the broader organizational strategy or other teams’ needs into account. The tech department may want to go serverless in the next two years, but suddenly someone purchases a 3-year Savings Plan, locking them into using VMs.
There is no “objectively right” decision when building in the cloud. Making cloud-related decisions in isolation can lead to inefficiencies and increased costs. Ask yourself how you are checking internally with peers and/or other teams that your decisions are sound. More tangibly, this could mean that you’re validating cloud infrastructure decisions against the engineering strategy and vision, or the relevant RFC/ADR documents.
Red Flag #6: You’re not limiting regions and instance types with organizational policies
Organizational policies (see in Google Cloud; AWS docs) help you define how your cloud users can access, utilize, and manage your company’s cloud resources.
From a cost optimization (and security) perspective, they’re especially useful for making sure that the people don’t spin up services where they shouldn’t be.
Specifically, without organizational policies in place to restrict instance types and regions, you're exposing your cloud infrastructure to security vulnerabilities and overspending risks. Malicious actors can exploit this lack of control to deploy instances in unused regions, evading detection and carrying out their malicious activities.
Limit instance types and regions to the ones you use so that nobody — whether it is done maliciously or by mistake — can spin up, for example, an x1 instance instead of a t4 in South America when all of your resources are in Europe.
By implementing org policies, you can effectively safeguard your cloud environment and optimize resource utilization.
Red Flag #7: Excess API calls to storage buckets
Frequent and unnecessary API calls to storage buckets can inflate storage costs and affect performance. This red flag manifests itself in many different situations.
Your application(s) could be making frequent API calls to cloud storage buckets. This can be particularly problematic for applications that generate a high volume of data or perform frequent data transfers. Or perhaps it's a scheduled process that interacts with your storage buckets, slowly accumulating a significant number of API calls over time.
In addition to cost reasons, you should also consider that frequent API calls also impact your application performance, causing slowdowns, timeouts, and even system outages.
Without proper monitoring, you can easily overspend without raising alarms until your bill comes in or a service quota limit is reached.
As a result, you should review and optimize your application code to minimize the number of API calls required for each operation. Additionally, implement caching mechanisms to store frequently accessed data in memory and reduce the need for repeated API calls to the storage buckets.
Asking the right questions of your cloud bill
While we can provide you with a list of red flags to look for, the list of items to look at are endless. For this reason, it’s crucial in the long-term to be curious about your cloud spend. Don’t just accept that your bill is the amount it is. Ask why, then ask why again.
For example, if S3 costs are increasing, ask which bucket(s) are driving that increase. Then ask which SKUs are responsible for the cost increase in the buckets. Then, when you discover that it was data transfer costs, ask yourself and your team whether that increase in data transfer costs for those buckets was anticipated or not. Perhaps the costs increased for a good reason, but you won’t know unless you ask.
Over time, this contributes to a cost optimization culture across your company where everyone is aware of their contribution to the cloud bill and feels empowered to take action on it.
Each of these red flags underscores the importance of shared responsibility and being curious about your cloud bill. Remember, identifying these optimization opportunities shouldn't rest solely on the shoulders of the Head of Infrastructure or FinOps Lead. It's a journey that thrives on teamwork.