Is OpenTelemetry Excessive?
This article is a brief account of my experience setting up, operating, and using Open Telemetry on a very small software development project wherein I reach the surprising conclusion that it's probably worthwhile much earlier and at much smaller scales than you might expect.
The project in question was the back end for a proof-of-concept mobile app that I worked on as part of my day job. This wasn't even a Minimum Viable Product, more of an experiment to demonstrate what an MVP might look like. When I adopted Open Telemetry I was worried that it might be adding needless complexity and overhead to a very basic app, but to my surprise and delight it paid for itself several times over.
Open Telemetry
Open Telemetry describes itself as
High-quality, ubiquitous, and portable telemetry to enable effective observability
It's pitched as a tool for tackling enterprise-grade-highly-distributed-microservice-enabled complexity–the sort of thing that Charity, Liz, and Jessica talk about on the O11ycast.
Concretely, it's a set of standards for
- adding diagnostic events to an application (called "instrumenting")
- filtering, transforming, and delivering those events to a variety of back ends
as well as
- open-source libraries implementing those standards for various programming languages and runtimes, databases, etc.
- open-source and proprietary tools for collecting and analyzing the diagnostic events your application is producing
Once you've set it up, you can turn on "auto-instrumentation" for common software components, which ended up being very valuable.
What I put into it
Unfortunately, it's not all good news: setting up Open Telemetry was more work
than I was expecting. The NodeJS libraries are complex (and seem to be in a
state of flux?). There's a lot of configuration and setup. The library's
interface is also more complicated (and quite a bit more powerful) than
console.(log|info|error|debug)
, which is what I would usually be doing. This
all took work and precious time to learn.
I ended up sending logs to stdout
as nicely formatted JSON. More sophisticated
setups are available, but this 12-factor sort of
approach served me well in development (Docker Compose, where I could inspect
the logs with docker-compose logs
) and in production (SystemD services on EC2,
where I used journalctl
).
What I got out of it
Once I got the SDK configured properly and wrapped my head around how to use it I was able to instrument my own code, which was valuable as expected. What I wasn't expecting was the comprehensive auto-instrumentation for things like NodeJS's HTTP stack and PostGRES client.
This let me inspect the details of:
- every HTTP request that came in to my app
- every HTTP request it sent to third-party services
- the content, parameters, and timing of every database query
- uncaught exceptions
This helped me catch and fix:
- several minor-but-subtle bugs and misconfigurations in my own code
- request parameter mismatches coming from the mobile app
- a catastrophic bug in my auth middleware
- problems in the SDKs for third-party services (I have no idea how I would have caught these without detailed HTTP tracing)
These were bugs that slipped past a decent test suite and TypeScript annotations, and I diagnosed them without modifying my app. That's the promise of observability: you can't predict what you should be recording but if you're disciplined and systematic about instrumenting your code you'll be able to figure everything out when you discover what you need.
This seemed like common sense for big complicated distributed systems, but I might be starting to believe it for small straightforward greenfield projects as well.