Observability#
Permission Service is instrumented end-to-end with OpenTelemetry for metrics and traces, and uses the Rust tracing framework for structured logs. This page describes what the service emits, how distributed tracing is wired up, which log levels are used, and how to configure all three through the Helm chart.
OpenTelemetry Overview#
Metrics and traces are exported using the OTLP protocol over gRPC to a user-supplied collector (for example, the OpenTelemetry Collector). Each exporter is configured independently:
Metrics are exported periodically from an SDK-managed
PeriodicReaderto the endpoint inobservability.metrics.endpoint.Traces are exported in batches from an SDK-managed batch exporter to the endpoint in
observability.tracing.endpoint.
If either endpoint is left empty, the corresponding exporter is disabled and the data is dropped locally (no collector required). Logs are always written to stdout regardless of OpenTelemetry configuration.
All exported telemetry carries a common OpenTelemetry Resource populated from observability.serviceName (as service.name) and the pod’s hostname (as service.instance.id), so data from individual pods can be distinguished in a multi-replica deployment.
Metrics#
The service exposes two kinds of metrics: service-specific metrics that describe authorization behavior and internal caches, and generic transport metrics that describe the REST and gRPC surfaces.
All metrics are OpenTelemetry instruments emitted through the global MeterProvider. Their names follow the OpenTelemetry semantic conventions where applicable.
Service-specific metrics#
These metrics capture the permission-service’s authorization decisions and the state of its in-memory caches. They are emitted from the service’s own instrumentation, not from any transport layer.
Name |
Type |
Unit |
Description |
|---|---|---|---|
|
Counter |
|
Total number of authorization requests that resulted in an |
|
Counter |
|
Total number of authorization requests that resulted in a |
|
Gauge |
— |
Current number of entries held in the authorization-result LRU cache. |
|
Gauge |
— |
Current number of entries held in the service-metadata LRU cache. |
|
Gauge |
— |
Current number of validated access tokens and principals held in the token cache. |
The three cache gauges are refreshed on every request, so their sampled values reflect live cache utilization.
Generic transport metrics#
These metrics are emitted by the REST and gRPC middleware layers for every request the service handles. They are useful for general request-rate, latency, and error-rate dashboards regardless of which authorization endpoint is called.
REST metrics#
The following REST metrics are emitted by an internal middleware and are attributed with the matched route path, the HTTP method, and the resolved principal (empty when authentication is disabled or the caller is unauthenticated). The error counter additionally carries the status_code attribute:
Name |
Type |
Unit |
Description |
|---|---|---|---|
|
Counter |
|
Total number of HTTP requests received. |
|
Histogram |
|
End-to-end latency of HTTP request handling. |
|
Counter |
|
Total number of HTTP requests that returned a 4xx or 5xx status code. |
In addition, the service mounts the axum-observability server metric layer, which emits metrics that follow the OpenTelemetry HTTP semantic conventions. These are attributed with http.request.method, http.route (when a matched path is available), http.response.status_code (for the duration metric), and network.protocol.version:
Name |
Type |
Unit |
Description |
|---|---|---|---|
|
Histogram |
|
Duration of inbound HTTP server requests, measured from call entry to the start of the response. |
|
Histogram |
|
Size of the request body consumed by the handler (headers and trailers excluded). |
|
Histogram |
|
Size of the response body observed by the client (headers and trailers excluded). |
gRPC metrics#
The following metrics are emitted by an internal gRPC middleware and are attributed with rpc.service, rpc.method, and the final rpc.grpc.status_code. The naming follows the OpenTelemetry RPC semantic conventions:
Name |
Type |
Unit |
Description |
|---|---|---|---|
|
Counter |
|
Total number of gRPC server requests received. |
|
Histogram |
|
Duration of gRPC request handling. |
|
Counter |
|
Total number of gRPC requests that completed with a non-OK status (non-zero |
Tracing#
Tracing is built on the Rust tracing ecosystem, bridged to OpenTelemetry via tracing-opentelemetry. Spans are created either implicitly by #[tracing::instrument] annotations on handler functions or explicitly by the transport layer middlewares.
For each inbound REST request the TraceLayer middleware creates an http_request span with the HTTP method, the matched route, and OpenTelemetry attributes (otel.name = "<METHOD> <ROUTE>", otel.kind = "server"). For each inbound gRPC request a grpc_request span is created with otel.name = "<path>", otel.kind = "server", and a trace_id field populated from the current OpenTelemetry context.
Child spans are emitted for individual handler functions (policy and metadata CRUD, authorization middleware, JWT validation, userinfo lookups, token caching, event publishing, and so on) so that each stage of request processing is visible in the trace tree.
Distributed tracing#
Distributed tracing is supported. The service configures the W3C Trace Context propagator as the global OpenTelemetry text-map propagator, which means:
When an incoming request arrives (either REST or gRPC), the service extracts the parent trace context from the standard
traceparent/tracestateHTTP headers and makes it the parent of the server-side span. If no valid context is present, a new root trace is started.Trace IDs flow unchanged across the REST and gRPC surfaces, so a single client request that spans multiple services shows up as one end-to-end trace in the collector.
Callers that initiate traces upstream (for example, an API gateway, a client SDK, or another Omniverse service) only need to propagate the traceparent header for their traces to stitch together with Permission Service spans.
Sampling is driven by a TraceIdRatioBased sampler. The sampling ratio is taken from observability.tracing.samplingRatio (a value between 0.0 and 1.0); 1.0 samples every trace, 0.0 samples none.
Logging#
Logs are emitted through tracing and written to stdout. Log filtering uses the standard tracing_subscriber::EnvFilter directives syntax (the same syntax used by RUST_LOG), so the log level can be set globally or per target — for example:
info— enableinfoand above for every target.permission_service=debug,hyper=warn— debug logs for the service itself, warnings only forhyper.
Two output formats are supported and are chosen at startup:
Pretty (default) — human-readable, colorized, multi-line output for interactive and development use.
Structured — single-line JSON records produced by
tracing-subscriber’s built-in JSON formatter, suitable for ingestion by log aggregators. Enabled by settingobservability.logging.structured.enabledtotrue.
Standard log-crate records from dependencies are forwarded into the same pipeline, so third-party libraries honor the configured level and format.
Log levels#
The service uses the five standard tracing levels with the following conventions:
error— Unrecoverable or user-visible failures. Examples: failure to initialize authentication on startup, failure to verify a user JWT, failure to fetch or parse JWKS or userinfo responses, failure to refresh the service-identity access token, invalid access-token files, and malformed JWT headers or claims. These messages indicate a real problem to investigate.warn— Degraded but non-fatal conditions. Examples: authentication disabled viaDISABLE_AUTH, missing or non-BearerAuthorizationheaders, non-2xx responses from the identity provider’s JWKS or userinfo endpoints, missingjwks_urion the discovery document, failure to connect to the Event Aggregation Service, and failures to publish policy-change events.info— Normal lifecycle events and operator-visible state changes. Examples: the effective service configuration at startup, “Listening HTTP on …” and “Listening gRPC on …” bind confirmations, successful JWKS pulls, assignment of system policies and system metadata, successful connection to the Event Aggregation Service, access-token file reads, and scheduled token refreshes. Each instrumented handler also emits info-level spans for incoming requests.debug— Verbose per-request diagnostics useful during troubleshooting. Examples: absence of anAuthorizationheader, number of policies matched for an action, request payloads for outgoing notification events, successful publication of policy-change events, and successful refreshes of client-credentials tokens.trace— Reserved for very fine-grained diagnostics. The service itself does not currently emit trace-level records, but the level is wired up and can be enabled to pull trace-level output from dependencies throughRUST_LOG-style directives.
Unless a specific subsystem needs a deeper view, info is the recommended level for steady-state operation and permission_service=debug is the recommended level for focused troubleshooting of the service itself.
Configuring Observability with the Helm Chart#
All observability settings live under the observability key in values.yaml. The chart translates these values into environment variables on the deployment (RUST_LOG, STRUCTURED_LOGGING, TRACES_ENDPOINT, TRACE_SAMPLING_RATIO, METRICS_ENDPOINT, SERVICE_NAME, SERVICE_SCOPE).
Helm values reference#
Value |
Default |
Description |
|---|---|---|
|
|
|
|
|
Instrumentation scope name used on exported metrics. |
|
|
Log level filter (maps to |
|
|
Enables JSON-structured log output produced by |
|
|
OTLP/gRPC endpoint that receives trace spans, for example |
|
|
Head-based sampling ratio between |
|
|
OTLP/gRPC endpoint that receives metrics, for example |
Example: export metrics, traces, and structured logs#
The following values block points both exporters at an OpenTelemetry Collector reachable inside the cluster, raises the service log level to debug, and switches log output to JSON so that a log pipeline can parse it:
observability:
serviceName: "permission-service"
serviceScope: "permission_service"
logging:
level: "permission_service=debug,info"
structured:
enabled: true
tracing:
endpoint: "http://otel-collector.observability.svc.cluster.local:4317"
samplingRatio: "0.1"
metrics:
endpoint: "http://otel-collector.observability.svc.cluster.local:4317"
Additional environment variables#
The OpenTelemetry Rust SDK also honors its standard environment variables (for example OTEL_EXPORTER_OTLP_HEADERS, OTEL_RESOURCE_ATTRIBUTES, or OTEL_SDK_DISABLED). These can be injected through the chart’s extraVars map when a deployment needs to authenticate to a managed collector or add extra resource attributes:
extraVars:
OTEL_EXPORTER_OTLP_HEADERS: "authorization=Bearer <token>"
OTEL_RESOURCE_ATTRIBUTES: "deployment.environment=production"