In this article, I discuss how to apply golden signals to monitor services, recognise faults, and ideally identify the reasons for faults.
Recently, Christian Hardenberg shared Delivery Hero’s Reliability Manifesto with the world. The manifesto discusses various points that will make sure that your services are fault tolerant. One rule of the Manifesto to increase service’s resilience is:
“R-6 We track the golden signals: Every team needs to have a real time dashboard showing, at a minimum, requests per minute, error rate, server response time and a business metric that is highly correlated with system health.”
In Fintech, when we started developing our services, we wanted to make sure that our services would alert us when something goes wrong, and ideally also tell us what went wrong. This would ultimately lead to engineers being able to sleep peacefully at night knowing that they have guards in the form of alerts, protecting their precious services.
So, today I will share with you how we broke down and applied the four golden signals of monitoring our services, as part of our efforts to make them more reliable. These golden signals, when applied properly, allow us to observe critical paths of service.
Latency: Measure of time taken to serve a request
Latency is a pretty crucial factor when it comes to order checkout, and even more so when it is an order checkout for food. The restaurants only start making food once the payments notify a success, which ultimately affects time taken for the delivery of food.
However, there are multiple types of latencies when it comes to a typical service.
1. Latency in http requests:
This is probably one of the most basic measurements of latency factor. The most common way of measuring http request latencies in spring-boot services is probably through micrometer’s timed annotation
How to use @Timed annotation for all methods in controller:
@RestController
@Timed // <--- will measure the time taken by each method in this controller
public class MyObservabilityController {
@GetMapping("/api/payments")
public List<Payments> getPayments()
{
return ...
}
}
It is also useful to know that @Timed annotation is not only limited to controllers but can be used for any arbitrary method.
2. Latency of order checkout:
Http request latencies, albeit useful, may not give us an idea about latencies of business logic. Let’s take the example of order checkout. We may have multiple http requests during the payment processing stage, but the latency for the entire business process will be combined latencies of all these http requests.
Latency for business logic is the sum of two methods:
public class MyBusinessLogic{
@Timed
public void authorizeCustomer() {
// authenticate the customer
}
@Timed
public void makePayment() {
// book the money from customer's account
}
/*
* Total Latency for business logic of order processing =
* time taken by authorizeCustomer() + time taken by makePayment()
*/
}
3. Latency due to third parties:
Any third party dependencies must be timed if they are part of a critical path. Any third party dependency automatically becomes part of the critical path if the service’s core responsibility would fail upon failure of this dependency. A good practice would be to set a connection and read timeouts on your services, so you do not wait too long for third party’s reply. Further, you can measure how many times your read and connection times go beyond set thresholds, and trigger an alert if your third party starts to misbehave.
4. Latency due to database:
Database latencies is a huge topic. Sometimes even a simple query can cause a lot of problems. ‘EXPLAIN ANALYZE’ would help in identifying queries which are inefficient. Database monitoring has a whole set of tools that you can use to EXPLAIN queries, but we will not go into detail on how to use those tools here, as it deserves its own post.
Traffic: Measure of demand that is being placed on your services
You can ideally have a service with zero error, with no traffic. This is why we measure two types of traffic: http requests served per second, and orders served per second. Of course, the latter is a business metric. We added Anomaly Monitors in our services from Datadog, which I feel is pretty cool as it gives an out of the box solution for alerting when there is a drop in http requests served by our services, while comparing this drop to similar times in the past. So, it will not alert us and wake us up when there are zero requests served in Taiwan because it is night time there and everyone is sleeping.
Errors: Measure of rate of requests that fail
Much like latency, there are multiple facets to error monitoring.
1. Http Error Status Codes: There are many solutions to count http status codes that your service sends. For Spring, auto-configuration will add the status information to metric with default name (which you can override if you would like to)
Metric with default name:
http.server.requests
2. Non-200 status: not all non-200 statuses are errors. Let’s take an example of a third party rejection which can be caused due to many reasons (fraud detection, blocked card, etc.) — your service may send a http status code of 40X, which is not a success / OK code but isn’t a failure according to business logic. On the contrary, if you receive anything but 200 for your health endpoint, it is definitely more concerning. This is why you should set different monitors and thresholds while dealing with !200 status codes.
3. Business metric for error (Order error): Much like latency, we would also like to see if our business logic is working. So similar to http requests status codes, in Fintech we also track the order statuses, i.e., whether the multiple http requests finally led to a successful checkout of an order or not.
4. Exceptions: Additionally, you can also have a monitor that keeps track of overall exceptions and errors that may be thrown by your services, and alarms you in case there is a sudden increase in those. This could often occur when a new release is carried out — which also makes this monitor a pretty crucial one, as it will detect problems with new releases very quickly.
Saturation: Measure of load on your services
Your services have many resources and if any of these resources were to become full, it would have an immediate impact on the entire processing.
1. CPU: If the cpu of all your pods/cluster is getting maxed out, it could be an indication of serious issues like something eating away the processing power in your service, incorrect resource config, incorrect choice of k8s node classes or sizes, etc. If your infrastructure uses Kubernetes like Delivery Hero’s Fintech then you can observe your pod’s/node’s resources through Metrics API. Datadog also provides a pretty extensive guideline on how to observe kubelet statistics using tools that are free.
2. JVM resources: JVM resources like threads, memory, and heap usage can also lead to saturation. For example, if all your threads are saturated for a very long time or in a deadlock, your latency will increase which as already discussed is problematic. Heap and memory overview will indicate OutOfMemory issues. Spring boot’s Auto-configuration enables JVM metrics which can be found as following:
jvm metrics:
jvm.thread.*
jvm.gc.*
jvm.memory.*
Alternatively, you could also rely on container metrics. You can get the following docker container metrics via Datadog agent.
Docker metrics for thread and memory:
docker.thread.*
docker.mem.*
3. Pod availability: Since Fintech infrastructure uses Kubernetes, pod health and availability of pods in the desired state is another crucial metric for us. Health endpoint is one of the things that directly impacts a pod’s health. So, if /health
or /liveness
returns non-200
status, the pods will eventually terminate and restart. If you use spring boot actuator, then you can run the following query to monitor.
Health check metric accumulation:
"query": "http.server.requests.avg{uri:/actuator/health,status:200}" // to show number of successful health calls
"query": "http.server.requests.avg{uri:/actuator/health,status:!200}" // to show number of failed health calls
Additionally, you would want to have at least 20% or more of your pods in the desired state when compared to the unavailable state. You can monitor this with the following query that uses Kubernetes metrics.
Metric for pod availability:
//alert if following query results in a value greater than 0
"avg:kubernetes_state.deployment.replicas_unavailable" - 0.2*("avg:kubernetes_state.deployment.replicas_desired")
With that, I would like to wrap this post up. In Fintech, the 4 golden principles definitely help us to understand what kind of monitoring we need to add, in order to increase our reliability. I hope that with the above breakdown, it will help you to cover your bases as well.
If you would like to contribute to our reliability and stability, Delivery Hero’s Fintech is hiring. Feel free to drop me a message or checkout our open Fintech roles directly. I am looking forward to hearing from you!
Thank you, Rishita!
As always, we are still hiring, so check out some of our latest openings, or join our Talent Community to stay up to date with what’s going on at Delivery Hero and receive customized job alerts!