Observability#

Permission Service is instrumented end-to-end with OpenTelemetry for metrics and traces, and uses the Rust tracing framework for structured logs. This page describes what the service emits, how distributed tracing is wired up, which log levels are used, and how to configure all three through the Helm chart.

OpenTelemetry Overview#

Metrics and traces are exported using the OTLP protocol over gRPC to a user-supplied collector (for example, the OpenTelemetry Collector). Each exporter is configured independently:

  • Metrics are exported periodically from an SDK-managed PeriodicReader to the endpoint in observability.metrics.endpoint.

  • Traces are exported in batches from an SDK-managed batch exporter to the endpoint in observability.tracing.endpoint.

If either endpoint is left empty, the corresponding exporter is disabled and the data is dropped locally (no collector required). Logs are always written to stdout regardless of OpenTelemetry configuration.

All exported telemetry carries a common OpenTelemetry Resource populated from observability.serviceName (as service.name) and the pod’s hostname (as service.instance.id), so data from individual pods can be distinguished in a multi-replica deployment.

Metrics#

The service exposes two kinds of metrics: service-specific metrics that describe authorization behavior and internal caches, and generic transport metrics that describe the REST and gRPC surfaces.

All metrics are OpenTelemetry instruments emitted through the global MeterProvider. Their names follow the OpenTelemetry semantic conventions where applicable.

Service-specific metrics#

These metrics capture the permission-service’s authorization decisions and the state of its in-memory caches. They are emitted from the service’s own instrumentation, not from any transport layer.

Name

Type

Unit

Description

authorization.request.allowed

Counter

requests

Total number of authorization requests that resulted in an allow decision.

authorization.request.denied

Counter

requests

Total number of authorization requests that resulted in a deny decision.

evaluation.cache.size

Gauge

Current number of entries held in the authorization-result LRU cache.

metadata.cache.size

Gauge

Current number of entries held in the service-metadata LRU cache.

token.cache.size

Gauge

Current number of validated access tokens and principals held in the token cache.

The three cache gauges are refreshed on every request, so their sampled values reflect live cache utilization.

Generic transport metrics#

These metrics are emitted by the REST and gRPC middleware layers for every request the service handles. They are useful for general request-rate, latency, and error-rate dashboards regardless of which authorization endpoint is called.

REST metrics#

The following REST metrics are emitted by an internal middleware and are attributed with the matched route path, the HTTP method, and the resolved principal (empty when authentication is disabled or the caller is unauthenticated). The error counter additionally carries the status_code attribute:

Name

Type

Unit

Description

http.request.count

Counter

requests

Total number of HTTP requests received.

http.request.duration

Histogram

ms

End-to-end latency of HTTP request handling.

http.request.errors

Counter

errors

Total number of HTTP requests that returned a 4xx or 5xx status code.

In addition, the service mounts the axum-observability server metric layer, which emits metrics that follow the OpenTelemetry HTTP semantic conventions. These are attributed with http.request.method, http.route (when a matched path is available), http.response.status_code (for the duration metric), and network.protocol.version:

Name

Type

Unit

Description

http.server.request.duration

Histogram

s

Duration of inbound HTTP server requests, measured from call entry to the start of the response.

http.server.request.body.size

Histogram

By

Size of the request body consumed by the handler (headers and trailers excluded).

http.server.response.body.size

Histogram

By

Size of the response body observed by the client (headers and trailers excluded).

gRPC metrics#

The following metrics are emitted by an internal gRPC middleware and are attributed with rpc.service, rpc.method, and the final rpc.grpc.status_code. The naming follows the OpenTelemetry RPC semantic conventions:

Name

Type

Unit

Description

rpc.server.request.count

Counter

requests

Total number of gRPC server requests received.

rpc.server.duration

Histogram

ms

Duration of gRPC request handling.

rpc.server.request.error.count

Counter

requests

Total number of gRPC requests that completed with a non-OK status (non-zero grpc-status, or a transport failure reported as UNKNOWN).

Tracing#

Tracing is built on the Rust tracing ecosystem, bridged to OpenTelemetry via tracing-opentelemetry. Spans are created either implicitly by #[tracing::instrument] annotations on handler functions or explicitly by the transport layer middlewares.

For each inbound REST request the TraceLayer middleware creates an http_request span with the HTTP method, the matched route, and OpenTelemetry attributes (otel.name = "<METHOD> <ROUTE>", otel.kind = "server"). For each inbound gRPC request a grpc_request span is created with otel.name = "<path>", otel.kind = "server", and a trace_id field populated from the current OpenTelemetry context.

Child spans are emitted for individual handler functions (policy and metadata CRUD, authorization middleware, JWT validation, userinfo lookups, token caching, event publishing, and so on) so that each stage of request processing is visible in the trace tree.

Distributed tracing#

Distributed tracing is supported. The service configures the W3C Trace Context propagator as the global OpenTelemetry text-map propagator, which means:

  • When an incoming request arrives (either REST or gRPC), the service extracts the parent trace context from the standard traceparent / tracestate HTTP headers and makes it the parent of the server-side span. If no valid context is present, a new root trace is started.

  • Trace IDs flow unchanged across the REST and gRPC surfaces, so a single client request that spans multiple services shows up as one end-to-end trace in the collector.

Callers that initiate traces upstream (for example, an API gateway, a client SDK, or another Omniverse service) only need to propagate the traceparent header for their traces to stitch together with Permission Service spans.

Sampling is driven by a TraceIdRatioBased sampler. The sampling ratio is taken from observability.tracing.samplingRatio (a value between 0.0 and 1.0); 1.0 samples every trace, 0.0 samples none.

Logging#

Logs are emitted through tracing and written to stdout. Log filtering uses the standard tracing_subscriber::EnvFilter directives syntax (the same syntax used by RUST_LOG), so the log level can be set globally or per target — for example:

  • info — enable info and above for every target.

  • permission_service=debug,hyper=warn — debug logs for the service itself, warnings only for hyper.

Two output formats are supported and are chosen at startup:

  • Pretty (default) — human-readable, colorized, multi-line output for interactive and development use.

  • Structured — single-line JSON records produced by tracing-subscriber’s built-in JSON formatter, suitable for ingestion by log aggregators. Enabled by setting observability.logging.structured.enabled to true.

Standard log-crate records from dependencies are forwarded into the same pipeline, so third-party libraries honor the configured level and format.

Log levels#

The service uses the five standard tracing levels with the following conventions:

  • error — Unrecoverable or user-visible failures. Examples: failure to initialize authentication on startup, failure to verify a user JWT, failure to fetch or parse JWKS or userinfo responses, failure to refresh the service-identity access token, invalid access-token files, and malformed JWT headers or claims. These messages indicate a real problem to investigate.

  • warn — Degraded but non-fatal conditions. Examples: authentication disabled via DISABLE_AUTH, missing or non-Bearer Authorization headers, non-2xx responses from the identity provider’s JWKS or userinfo endpoints, missing jwks_uri on the discovery document, failure to connect to the Event Aggregation Service, and failures to publish policy-change events.

  • info — Normal lifecycle events and operator-visible state changes. Examples: the effective service configuration at startup, “Listening HTTP on …” and “Listening gRPC on …” bind confirmations, successful JWKS pulls, assignment of system policies and system metadata, successful connection to the Event Aggregation Service, access-token file reads, and scheduled token refreshes. Each instrumented handler also emits info-level spans for incoming requests.

  • debug — Verbose per-request diagnostics useful during troubleshooting. Examples: absence of an Authorization header, number of policies matched for an action, request payloads for outgoing notification events, successful publication of policy-change events, and successful refreshes of client-credentials tokens.

  • trace — Reserved for very fine-grained diagnostics. The service itself does not currently emit trace-level records, but the level is wired up and can be enabled to pull trace-level output from dependencies through RUST_LOG-style directives.

Unless a specific subsystem needs a deeper view, info is the recommended level for steady-state operation and permission_service=debug is the recommended level for focused troubleshooting of the service itself.

Configuring Observability with the Helm Chart#

All observability settings live under the observability key in values.yaml. The chart translates these values into environment variables on the deployment (RUST_LOG, STRUCTURED_LOGGING, TRACES_ENDPOINT, TRACE_SAMPLING_RATIO, METRICS_ENDPOINT, SERVICE_NAME, SERVICE_SCOPE).

Helm values reference#

Value

Default

Description

observability.serviceName

"permission-service"

service.name resource attribute attached to all exported metrics and traces, and used as the tracer/meter name.

observability.serviceScope

"permission_service"

Instrumentation scope name used on exported metrics.

observability.logging.level

"info"

Log level filter (maps to RUST_LOG). Supports the full tracing directives syntax, for example info or permission_service=debug,hyper=warn.

observability.logging.structured.enabled

false

Enables JSON-structured log output produced by tracing-subscriber’s JSON formatter. Leave disabled for human-readable pretty logs.

observability.tracing.endpoint

""

OTLP/gRPC endpoint that receives trace spans, for example http://otel-collector:4317. Leaving this empty disables trace export.

observability.tracing.samplingRatio

"1.0"

Head-based sampling ratio between 0.0 and 1.0. 1.0 samples every trace; lower values proportionally reduce the exported volume.

observability.metrics.endpoint

""

OTLP/gRPC endpoint that receives metrics, for example http://otel-collector:4317. Leaving this empty disables metrics export.

Example: export metrics, traces, and structured logs#

The following values block points both exporters at an OpenTelemetry Collector reachable inside the cluster, raises the service log level to debug, and switches log output to JSON so that a log pipeline can parse it:

observability:
  serviceName: "permission-service"
  serviceScope: "permission_service"

  logging:
    level: "permission_service=debug,info"
    structured:
      enabled: true

  tracing:
    endpoint: "http://otel-collector.observability.svc.cluster.local:4317"
    samplingRatio: "0.1"

  metrics:
    endpoint: "http://otel-collector.observability.svc.cluster.local:4317"

Additional environment variables#

The OpenTelemetry Rust SDK also honors its standard environment variables (for example OTEL_EXPORTER_OTLP_HEADERS, OTEL_RESOURCE_ATTRIBUTES, or OTEL_SDK_DISABLED). These can be injected through the chart’s extraVars map when a deployment needs to authenticate to a managed collector or add extra resource attributes:

extraVars:
  OTEL_EXPORTER_OTLP_HEADERS: "authorization=Bearer <token>"
  OTEL_RESOURCE_ATTRIBUTES: "deployment.environment=production"