In the world of SRE, we talk a lot about observability, such as metrics, logs, and traces. In most cases, these three pillars of observability are related to infrastructure metrics or business metrics. For example, the current CPU usage of a Kubernetes pod, the number of videos played by online users per minute, etc. Besides this, there is another important area that is being explored nowadays: FinOps.
“FinOps is an evolving cloud financial management discipline and cultural practice that enables organizations to get maximum business value by helping engineering, finance, technology and business teams to collaborate on data-driven spending decisions.” by FinOps Foundation Technical Advisory Council
Adopting FinOps enables us to answer some key questions related to the cloud investments:
In order to analyze cloud costs, there are many cost management products available. Generally speaking, the cloud management products can be split into two categories:
The tools belong to the first category usually support one provider only since they are used for checking the costs from this provider. Furthermore, the cost information is displayed separately from other operational metrics. In the case of AWS, developers check business metrics on CloudWatch. If there is a spike in the cost of one EKS cluster, developers may switch between CloudWatch and Cost Explorer to figure out the correlations between costs and system workload.
The tools in the second category are designed with a multi-cloud mind. For example, Datadog Cloud Cost Management currently supports AWS and Azure. Of course Datadog and other teams keep improving their products and will support more cloud providers. However, they have to start from supporting mainstream cloud providers, and cannot collect all kinds of cloud costs for users. There are also some other challenges for an organization to get onboard:
Cost information is just as the same as other operational metrics. They can be used to identify underutilized resources, track the cost of changes to applications, and also understand the impact of changes on their costs and make informed decisions about future changes. When designing cost metrics, we recommend following FOCUS, a technical specification to build and maintain an open standard for cloud cost, usage, and billing data. When visualizing cost metrics, we recommend having them on the same dashboard with operational metrics.
In order to provide a vendor-neutral solution, we would like to implement a set of cost exporters for different vendors. The idea is to utilize their cost-related APIs to fetch the cost data, and then transform them to standard Prometheus metrics. In this way, we are able to address the challenges mentioned in the previous section:
Here is an example that shows a visualization of cost metrics from different cloud providers in one Grafana dashboard (using fake data).
The following figure shows the overall design of this solution.
Here is a list of the implemented cost exporters. MongoDB Atlas Cost Exporter will be the next one :)