On Logs Metrics and Events

Throughout my career so far, nearly every application I have come across logs far too much, and consistently at too high of a log level.

Lots of things which should not really be logs, but rather be either operational or business metrics, or may even be a domain event, are chucked in as a log.

Let’s categorise the different things which I have seen logged, and talk about how we should handle them instead. And finish up with what a valuable set of logs actually looks like.

Is my application working?

This is the pattern where you are logging all of the positive cases throughout your app. Is it writing to the DB? Is it writing to the event log? did the validation pass?

We are essentially stepping through the code, every branch has a log, and you can step through the exact path the program took just by looking at the logs.

This might seem nice, but it is actually incredibly noisy, and makes it difficult to debug when something really does go wrong.

This kind of log should at most be a Debug log, which is not turned on in production.
It is expensive (log ingest charges, and more logs to chunk through when you do a query/search on your logs)

I have found this pattern appears mostly when we lack confidence in our ability to reconstruct a flow through a system from a bug report. So we feel the need to leave breadcrumbs.

Logging without intent

A continuation of the above pattern, sometimes we log just because we think the log might be useful. It never really is. These logs accumulate, because people struggle to delete logs.

You find yourself asking, does anything alert on this log? Do we use this for anything? And it’s almost always impossible to tell. So do your best to only log explicitly with intent, you should know who will read this log and why.

Validation failures

We have had some bad data come into our API, and it’s failed validation! So we are going to return a 400 Bad Request response back, with a detailed body response following RFC 7807.

Very commonly in this kind of scenario, I see Engineers produce an Error log.

This is not an error. It is the program working exactly how it should.

Please, stop writing error logs when your application is doing exactly what it is meant to do. Error logs should be reserved for unexpected scenarios, ones which cannot be recovered from gracefully, and require human intervention.

You may actually want to log an Information log here, because a user might reach out wondering why their request was rejected. You can log the problem details too if you want, but be very careful not to log any PII. But it certainly is not an Error log, at most perhaps a Warning, but Information should be fine.

Unexpected failures

This is when you really do want to write an Error log. Like say, we were unable to connect to the database to perform an update, or we were unable to talk to an API which is required to fulfil the request.

The anti-pattern I most often see here, using C# as an example, looks something like this

try
{
  _thing.DoSomethingThatMightFail();
}
catch (Exception exception)
{
  _logger.Log(exception, “failed to do thing”);
  throw;
}

Do you see the problem?

In Go, it would look something like this

err := thing.DoSomethingThatMightFail()
if err != nil {
    logger.Error("failed to do thing", "err", err)
    return err
}

Hopefully one of those two examples has worked for you.

But the problem here is that the same error will wind up getting logged at multiple points in the code. So a single error, or bad thing happening, will be very noisy and hard to sift through.

You may find yourself wanting to attach some extra context or information to the logs which only this layer of the application knows about. There are better ways to do this, for example for our two languages it would be

C#:

try
{
    _thing.DoSomethingThatMightFail();
}
catch (Exception ex)
{
    throw new MySpecialCustomException(“Failed to do thing", ex);
}

Go:

if err := thing.DoSomethingThatMightFail(); err != nil {
    return fmt.Errorf("failed to do thing: %w", err)
}

We want to preserve the context, add important information we might need for debugging, but we don’t want to pollute the logs by being so noisy. Hopefully there is some way to apply this technique to your favourite language.

Over-wrapping can be just as harmful as over-logging, obscuring the original error and making stack traces harder to follow. In Go it might be more idiomatic, but in C# it is not so much. My philosophy here is to touch the error as little as possible, and have it bubble up to the highest level at which you can take an action about it. For an API that might be the controller where you will turn it into an error response.

The user performed some action which we want to know about

This is a Business Metric, possibly a Domain Event, maybe both.

The problem with putting these in the logs, is that you want to do quite different things with them compared to logs. You might want to graph them over time, correlate them with other things, attach a lot of context about a user or their tenant in the system to it. Or you might even want some action to occur based on it.

You might be able to do this with your logging system, but it will not be optimal.

You should certainly have a pipeline for Business Metrics. Rather than logging at various points in your app, and hoping that the business can understand the flow and get what they need out of it. Start from the other side. Engage product / business stakeholders while you are building, and talk to them about what kinds of reports they will want to generate relating to this flow. How will they know if it was a success? What do they want to know about the user and how they are using the app?

Your organisation should have a single place where all of these Business Metrics should flow into. This is because you will likely want to correlate across them.

You may choose to have a shared ingest pipeline for Domain Events and Business Events. This would work fine too. It can be consumed into some kind of datalake for crunching, but also exist on an Event Stream which could be subscribed to. Either way, it probably shouldn’t be a log.

This request took X seconds to process

You might want to record how long it took to process a request, or some part of it such as talking to the DB or another API.

This isn’t a log, it’s a metric.

OpenTelemetry has a separate concept for these, or you might use something like prometheus to expose these. You will likely want to be able to correlate these with individual logs or requests. OpenTelemetry is really your best bet at the moment for a really solid system which you can correlate across. It’s really quite good.

Logging as an Audit Trail

I am guilty of doing this one a couple of times. But have wound up having to go back and build a proper trail.

Audit logs often need immutability guarantees and strict retention policies, which most logging systems are not designed to provide.

There are a few problems here. An Audit Trail log might contain PII, which is likely not compatible with your logging system. You will also have to have quite a large log message to describe how the state has changed. And you will find your logs do not have the same rigid atomic promises that you might want from an Audit Trail. It is common for logs to be a little bit lossy for a performance tradeoff.

The best kind of Audit Trail is an Event Stream. Explicit audit tables, possibly built off something like SQL Server Temporal Tables. Or something custom that you roll yourself.

So, after all that, what does good logging look like?

Each request or action should have about one log. See stripe’s canonical log line technique for inspiration if your needs are more complex - https://stripe.com/blog/canonical-log-lines

I have built a system which did not log at all for the happy path. And we never had a problem with it in over 4 years of running in prod serving millions of requests a day. It only ever logs Errors, e.g. if it could not write to S3. A few people have looked at it and been a bit nervous, unable to tell if it’s working. But a quick look over at our Metrics tool quickly showed just how many requests it was successfully serving.

Think, do you really need to log that?

As for how to go about logging when you do. Most of us are already doing this, but Structured Logging and OpenTelemetry are an excellent default. It will have Traces, Metrics, and Logs all built in by default. When you do have Business Metrics or Domain Events you want to report on, I am a big fan of the Event Stream. Choose one with nice event retention, playback, and all those nice things. And project it into a nice datalake of your choosing, something like snowflake or even just S3 with Athena on top of it. Whatever makes sense for you.

Logs answer “what unexpected thing happened here?”
Metrics answer “how often / how much?”
Events answer “what meaningful thing occurred?”

Don’t log too much. Don’t log errors for normal behaviour. And always be clear about why a log exists.

Is my application working?#

Logging without intent#

Validation failures#

Unexpected failures#

The user performed some action which we want to know about#

This request took X seconds to process#

Logging as an Audit Trail#

So, after all that, what does good logging look like?#