Michael Haberman for Aspecto

Posted on Aug 23, 2022 • Originally published at aspecto.io

Jaeger Tracing: The Ultimate Guide

#opentelemetry #devops #monitoring #jaeger

In this guide, you’ll learn what Jaeger tracing is, what distributed tracing is, and how to set it up in your system. We’ll go over Jaeger’s UI and touch on advanced concepts such as sampling and deploying in production.

You’ll leave this guide knowing how to create spans with OpenTelemetry and send them to Jaeger tracing for visualization. All that, from scratch.

Jaeger Tracing: Contents

What is Distributed Tracing? Introduction
What is Jaeger Tracing?
Jaeger Tracing Architecture
Running Jaeger locally using Docker
Jaeger Tracing and OpenTelemetry
Jaeger Tracing Python Example
Jaeger Tracing UI Review
Jaeger Tracing Advanced Concepts
Jaeger Tracing Production Deployment
The Costs of Running Jaeger
Jaeger Tracing Glossary

What is Distributed Tracing? Introduction

Before we dive into explaining everything you need to know from 0 to 100 about Jaeger tracing, it’s important to understand the umbrella term that Jaeger is part of – distributed tracing.

In the world of microservices, most issues occur due to networking issues and the relations between the different microservices.

A distributed architecture (as opposed to a monolith) makes it a lot harder to get to the root of an issue.

To resolve these issues, we need to see which service sent what parameters to another service or a component (a DB, queue, etc.).

Distributed tracing helps us achieve just that by enabling us to collect data from the different parts of our system, to enable this desired observability into our system. You can think of it as ‘call-stacks’ for distributed services.

To learn more about the benefits of distributed tracing vs. logs, read this quick guide.

In addition, traces are a visual tool, allowing us to visualize our system to better understand the relationships between services, making it easier to investigate and pinpoint issues.

Distributed tracing visualization with Aspecto

What is Jaeger Tracing?

Now that you know what distributed tracing is, we can safely talk about Jaeger.

Jaeger is an open-source distributed tracing system created by Uber back in 2015.

The Jaeger data model is compatible with OpenTracing – which is a specification that defines how the collected tracing data would look, as well as libraries of implementations in different languages.

(More on OpenTracing and OpenTelemetry later)

As in most distributed tracing systems, Jaeger works with spans and traces, as defined in the OpenTracing specification.

A span represents an action (HTTP request, call to a DB, etc) and is Jaeger’s most basic unit of work. A span must have an operation name, start time, and duration.

A trace is a collection/list of spans connected in a child/parent relationship (and can also be thought of as a directed acyclic graph of spans). Traces specify how requests are propagated through our services and other components.

Jaeger Tracing Architecture

Here’s what Jaeger architecture looks like

It consists of a few parts, all of which I explain below:

Jaeger client: The Jaeger clients are libraries written in various programming languages, and are responsible for span creation. Note: they are being deprecated in favor of OpenTelemetry. Again, more on that later.
Jaeger Agent: Jaeger agent is a network daemon that listens for spans received from the Jaeger client over UDP. It gathers batches of them and then sends them together to the collector.
Jaeger Collector: The Jaeger collector is responsible for receiving traces from the Jaeger agent and performs validations and transformations. After that, it saves them to the selected storage backends.
Storage Backends: Jaeger supports various storage backends – which store the spans & traces for later retrieving them. Supported storage backends are In-Memory, Cassandra, Elasticsearch, and Kafka (only as a buffer to another storage like Cassandra or elasticsearch).
Jaeger Query: This is a service responsible for retrieving traces from the jaeger storage backend and making them accessible for the jaeger UI.
Jaeger UI: a React application that lets you visualize the traces and analyze them. Useful for debugging system issues.
Ingester: Ingester is relevant only if we use Kafka as a buffer between the collector and the storage backend. It is responsible for receiving data from Kafka and ingesting it into the storage backend. More info can be found in the official Jaeger Tracing docs.

Running Jaeger locally using Docker

Jaeger comes with a ready-to-use all-in-one docker image that contains all the components necessary for Jaeger to run.

It’s really simple to get it up and running on your local machine:

docker run -d --name jaeger \
  -e COLLECTOR_ZIPKIN_HOST_PORT=:9411 \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14250:14250 \
  -p 14268:14268 \
  -p 14269:14269 \
  -p 9411:9411 \
  jaegertracing/all-in-one:1.30

Then you can simply open the jaeger UI on http://localhost:16686.

Below you’ll learn how to send data to it using Python and OpenTelemetry.

Jaeger Tracing and OpenTelemetry

Yes, you’re right. I did mention before that Jaeger’s data model is compatible with the OpenTracing specification.

You may already know that OpenTracing and OpenCensus have merged to form OpenTelemetry and are wondering why does Jaeger use OpenTracing and if you can use OpenTelemetry to report to jaeger instead.

As to why Jaeger uses OpenTracing – well, the reason is that Jaeger exists from before the above-mentioned merger.

To get a full understanding of OpenTelemetry, what is it, its components, and how you can use with, read this comprehensive guide.

Deprecation of Jaeger Client in favor of OpenTelemetry Distro:

I also mentioned that Jaeger clients are now deprecating.

You can find more info about this deprecation here, but essentially the idea is that you should now use the OpenTelemetry SDK in the programming language of your choice, alongside a Jaeger exporter.

This way created spans would be converted to a format Jaeger knows how to work with, passing all the way through to the Jaeger collector and then to the storage backend.

At the time of writing this, the OpenTelemetry collector is not considered a replacement for the Jaeger collector [1]. In the future, the idea is that the OpenTelemetry collector would be used instead of the Jaeger collector that would be deprecated [2].

If you can’t wait and want to try using the OpenTelemetry collector with Jaeger now – see this guide.

Jaeger Tracing Python Example

Here is a Python example of creating spans and sending them to Jaeger. Note that you could also use automatic instrumentations and still use the Jaeger exporter (assuming you’re running Jaeger locally like shown above):

# jaeger_tracing.py
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
trace.set_tracer_provider(
   TracerProvider(
       resource=Resource.create({SERVICE_NAME: "my-hello-service"})
   )
)
jaeger_exporter = JaegerExporter(
   agent_host_name="localhost",
   agent_port=6831,
)
trace.get_tracer_provider().add_span_processor(
   BatchSpanProcessor(jaeger_exporter)
)
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("rootSpan"):
   with tracer.start_as_current_span("childSpan"):
           print("Hello world!")

This is what they would look like in the Jaeger UI:

To learn how to hands-on use OpenTelemetry in Python from scratch, read this guide.

Jaeger Tracing UI Review

The jaeger UI is a powerful tool for us to debug and understand our distributed services better.

Here’s what you need to know about it:

The search pane:

You can use the search pane to search for traces with specific properties.

Which service they come from, what operation was made, specific tags that were included within the trace (for example, the http status code), how long in the past to look for and result amount limiting.

When you’re done defining your search in this pane, click on Find Traces.

The search results section:

In this example, I chose to query the jaeger-query service. I can see my traces on a timeline or as a list. Click on the desired trace to drill down into it.

The specific trace view:

When you find a specific trace where you think there might be an issue and click on it, you’d see a screen that looks like this:

Here you can find specific information about execution times, which calls were made and their durations, specific properties like http status code, route path (in the case of an http call), and more.

Feel free to play around and investigate for yourself with your actual data.

Jaeger Tracing Advanced Concepts

Jaeger Sampling

Sampling is a complex aspect by itself.

You should know that there are 2 places where you can make the sampling decision: at the client code (at the distro level – head-based sampling), and at the collector level (tail-based sampling).

Sampling at the Distro (SDK) level

The deprecating Jaeger client has 4 sampling modes:

Remote: the default, and is used to tell the Jaeger client that sampling is taken from the Jaeger backend. More on this when we talk about tail based sampling.
Constant: either take all traces or take none. Nothing in between. Receives 1 for all and 0 for none
Rate limiting: choose how many traces would be sampled per second.
Probabilistic: choose a percentage of traces that would be sampled, for example – choose 0.1 to have 1 of each 10 traces to be sampled.

Collector level sampling

If we choose to enable sampling at the Jaeger collector, Jaeger supports 2 modes: file sampling and adaptive sampling.

File sampling – you specify a configuration file path to the collector that contains the per-service and per-operation sampling configuration.

Adaptive sampling – let Jaeger learn the amount of traffic each endpoint receives and calculate the most appropriate rate for that endpoint. Note that at the time of writing only Memory and Cassandra backends support this.

More info on Jaeger sampling can be found here: https://www.jaegertracing.io/docs/1.30/sampling/

Jaeger Tracing Production Deployment

All-in-one or separate containers ?

Jaeger all-in-one is a pre-built Docker image containing all the Jaeger components needed to get up and running quickly with Jaeger tracing by launching a single command.

A lot of people (including my past self) ask themselves what’s the correct way to launch Jaeger in production. If it’s safe to use Jaeger all-in-one in production, etc.

While at the time of writing I could not find any official answer to use or not to use it, I think the right answer is – you could, but you probably shouldn’t.

Using it as in production means you have a single source of failure which is not distributed.

Theoretically, an issue even with the Jaeger UI might crush the entire container and you wouldn’t be able to receive critical spans created by your system.

The best way to go about this would be to run each image separately, without the all-in-one.

The Costs of Running Jaeger (and why you may want an alternative)

As you can probably tell by now, Jaeger is a powerful but complex beast.

Deploying and maintaining it is time-consuming and can be costly as it involves storage, compute and would usually need constant uptime for a production use case.

Of course, there is no amount I could measure now and claim it fits all, as it varies depending on your cloud provider and infrastructure choices.

However, you may be glad to know you don’t have to do all this on your own.

Aspecto has a free-forever tier and provides everything included in Jaeger and more. Sort of like Jaeger on steroids.

Feel free to try the playground yourself and sign up for the free forever plan.

A quick look into Aspecto

If you have any questions, feel free to reach out to me on Twitter @thetomzach, and to join our #OpenTelemetry-Bootcamp slack channel to be on top of what’s happening in observability.

Jaeger Tracing Glossary

Span – a description of an action/operation that occurs in our system; an HTTP request or a database operation that spans over time (start at X and has a duration of Y milliseconds). Usually, it will be the parent and/or child of another span.

Trace – a tree/list of spans representing the progression of requests as it is handled by the different services and components in our system. For example, sending an API call to user-service resulted in a DB query to users-db. They are ‘call-stacks’ for distributed services.

Observability – a measure of how well we can understand the internal states of a system based on its external outputs. When you have logs, metrics, and traces you have the “3 pillars of observability”.

OpenTelemetry – OpenTelemetry is an open-source tool. A collection of tools, APIs, and SDKs, led by the CNCF (Cloud Native Computing Function). OpenTelemetry enables the automated collection and generation of traces, logs, and metrics with a single specification.

OpenTracing – an open-source project for distributed tracing. It was deprecated and “merged” into OpenTelemetry. OpenTelemetry offers backward compatibility for OpenTracing.

DEV Community