Build a solution to track DORA metrics #125

New Issue

Jafner · 2024-03-12T10:23:21-07:00

Jafner commented

2024-03-12 10:23:21 -07:00

We want to track the four DORA metrics:

Deployment frequency.
Lead time for changes.
Time to restore service.
Change failure rate.

Implementation will require answering some questions specific to our little homelab environment.

Do we track metrics for the entire "organization"? Or on a per-application basis?
Do we filter our deployment frequency data to days with commits to code? Or do we include days, weeks, or months without any development?
Where do we get data from? Where do we send it? And how do we present it?

https://docs.gitlab.com/ee/user/analytics/dora_metrics.html
https://cloud.google.com/blog/products/devops-sre/using-the-four-keys-to-measure-your-devops-performance
https://dora.dev/

We want to track the four DORA metrics: - Deployment frequency. - Lead time for changes. - Time to restore service. - Change failure rate. Implementation will require answering some questions specific to our little homelab environment. 1. Do we track metrics for the entire "organization"? Or on a per-application basis? 2. Do we filter our deployment frequency data to days with commits to code? Or do we include days, weeks, or months without any development? 3. Where do we get data from? Where do we send it? And how do we present it? https://docs.gitlab.com/ee/user/analytics/dora_metrics.html https://cloud.google.com/blog/products/devops-sre/using-the-four-keys-to-measure-your-devops-performance https://dora.dev/

Jafner commented

2024-03-12 10:24:51 -07:00

Metrics: Organization-wide, or per-application?

Given that DORA metrics are intended to predict the performance of an organization, and that each metric is dependent on that service getting development attention, it would be less useful and less feasible to track on a per-application basis.

We'll track org-wide.

## Metrics: Organization-wide, or per-application? Given that DORA metrics are intended to predict the performance of an organization, and that each metric is dependent on that service getting development attention, it would be less useful and less feasible to track on a per-application basis. We'll track org-wide.

Jafner commented

2024-03-12 10:27:45 -07:00

Dataset filtering for "deployment frequency"

This "organization" often goes weeks or months without a commit to the repo. This would bring down the "deployment frequency" metric in a way that is not helpful, or in the spirit of the metric.

We'll filter our "deplyoment frequency" dataset to just "development days", where at least one commit is made to functional code (as opposed to documentation).

## Dataset filtering for "deployment frequency" This "organization" often goes weeks or months without a commit to the repo. This would bring down the "deployment frequency" metric in a way that is not helpful, or in the spirit of the metric. We'll filter our "deplyoment frequency" dataset to just "development days", where at least one commit is made to functional code (as opposed to documentation).

Jafner commented

2024-03-12 10:52:30 -07:00

The Data

We need the following data in order to track these metrics:

Hashes and timestamps for commits which modify functional code. Should be available from the Git server platform.
Commit hashes, timestamps, and success rate for the completion of deployment pipelines triggered by changes to functional code. Should be available from the CI/CD platform.
Timestamps for service outages and restorations. Should be available from the observability platform.

From those, we can compute:

Deployment frequency as a count of completed (not necessarily successful) deployments to production over some period of time (usually day).
Lead time for changes as the difference between the oldest commit timestamp in a deployment and the timestamp of the completion of the deployment.
Time to restore service as the difference between timestamps of the first service outage and the final service restoration (e.g. if service A goes down, then service B goes down, then service A is restored, then service B is restored, we would use the time between service A going down and service B being restored).
Change failure rate as the rate of deployments which either fail to deploy or immediately result in a service outage (how to correlate?).

As for ingesting, processing, and presenting the data:

We can append webhooks to the end of our CD pipelines to pass information from our Git and CI/CD servers into the system. The webhooks should include: timestamp of pipeline completion, list of Git hashes included in the deployment, timestamp of oldest Git commit included in the deployment, return status of the deployment (success/fail).
We can configure webhook notifications for our observability platform which include service ID, timestamp, and event type (outage/restoration).
We're gonna roll our own solution for ingesting and processing the data. Not sure I want to build a frontend yet. So we'll need an API to query the current metrics over some time slice.

## The Data We need the following data in order to track these metrics: - Hashes and timestamps for commits which modify functional code. Should be available from the Git server platform. - Commit hashes, timestamps, and success rate for the completion of deployment pipelines triggered by changes to functional code. Should be available from the CI/CD platform. - Timestamps for service outages and restorations. Should be available from the observability platform. From those, we can compute: - **Deployment frequency** as a count of completed (not necessarily successful) deployments to production over some period of time (usually day). - **Lead time for changes** as the difference between the oldest commit timestamp in a deployment and the timestamp of the completion of the deployment. - **Time to restore service** as the difference between timestamps of the first service outage and the final service restoration (e.g. if service A goes down, then service B goes down, then service A is restored, then service B is restored, we would use the time between service A going down and service B being restored). **Change failure rate** as the rate of deployments which *either* fail to deploy or immediately result in a service outage (how to correlate?). As for ingesting, processing, and presenting the data: 1. We can append webhooks to the end of our CD pipelines to pass information from our Git and CI/CD servers into the system. The webhooks should include: timestamp of pipeline completion, list of Git hashes included in the deployment, timestamp of oldest Git commit included in the deployment, return status of the deployment (success/fail). 2. We can configure webhook notifications for our observability platform which include service ID, timestamp, and event type (outage/restoration). 3. We're gonna roll our own solution for ingesting and processing the data. Not sure I want to build a frontend yet. So we'll need an API to query the current metrics over some time slice.

Jafner commented

2024-03-12 10:54:05 -07:00

Maybe for vanity we can include a "rating system" based on Google's State of DevOps report to return "Elite", "High", "Medium", or "Low" tier based on the computed metrics.

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: Jafner/homelab#125