Observability (Metrics, Traces, Logs)#

This document describes Helm configuration and metric definitions for each service. Storage Service includes the Storage Core component; Event Aggregation Service and Event Consumer Service use the shared Notification Common library for telemetry.

Part I – Helm configuration by service#

Storage Service#

Helm / behavior

Details

Metrics

OTEL_METRICS_EXPORTER is set to prometheus. OTEL_EXPORTER_PROMETHEUS_PORT is set from service.metrics.port.

Ports

gRPC: service.grpc.port (default 8011). REST: service.rest.port (default 8012). Metrics: service.metrics.port (default 8013); Prometheus scrapes GET /metrics on this port.

Logging

RUST_LOG is built from config.logging.level and optional config.logging.extra_targets. RUST_BACKTRACE comes from config.logging.backtrace.

Traces

Not configured by default in Helm. OTLP tracing can be enabled by setting the appropriate OTEL environment variables (e.g. traces exporter and endpoint).

Logs

Logs go to stdout only.

Event Aggregation Service#

Helm / behavior

Details

Telemetry

When telemetry.enabled is true, the service exports traces, metrics, and logs via OTLP.

OTEL endpoints

telemetry.otlp_tracing_endpoint, telemetry.otlp_metrics_endpoint, telemetry.otlp_logs_endpoint. Default: otel-collector.observability.svc.cluster.local:4317.

ConfigMap env

Values are passed into the pod as OMNI_EVENTS_TELEMETRY_ENABLED, OMNI_EVENTS_OTLP_TRACING_ENDPOINT, OMNI_EVENTS_OTLP_METRICS_ENDPOINT, OMNI_EVENTS_OTLP_LOGS_ENDPOINT, OMNI_EVENTS_LOG_LEVEL, OMNI_EVENTS_GRPC_PORT, and related variables.

Logging

logging.level (e.g. INFO) maps to the service log level.

Event Consumer Service#

Helm / behavior

Details

Telemetry

Same pattern as Event Aggregation Service: when telemetry.enabled is true, the service exports traces, metrics, and logs via OTLP.

OTEL endpoints

telemetry.otlp_tracing_endpoint, telemetry.otlp_metrics_endpoint, telemetry.otlp_logs_endpoint. Default: otel-collector.observability.svc.cluster.local:4317.

ConfigMap env

Same env vars as Event Aggregation Service (OMNI_EVENTS_TELEMETRY_ENABLED, OMNI_EVENTS_OTLP_*_ENDPOINT, etc.).

Notification Common (shared by Event Aggregation Service and Event Consumer Service)#

Helm / behavior

Details

Env prefix

Configuration for the event services uses the OMNI_EVENTS_ prefix.

Settings

Telemetry is controlled by the same Helm values as Event Aggregation Service and Event Consumer Service: telemetry enabled flag and OTLP endpoints for traces, metrics, and logs.

Part II – Metric explanations by service#

Storage Service#

Common attributes include method, API version, storage backend, pod name, package version, and result.

Metric name

Type

Description

storage.requests

Counter

Total storage API requests

storage.request.duration

Histogram

Request duration (seconds)

storage.sdk.requests

Counter

Requests to the storage backend SDK

storage.sdk.request.duration

Histogram

Backend SDK call duration

storage.enumeration.items

Histogram (U64)

Items returned per enumeration (list, list_stat, enumerate, enumerate_versions)

storage.read.redirects

Counter

Redirect URLs returned for read operations

storage.read.chunk.size

Histogram (U64)

Chunk size from backend (bytes)

storage.read.object.size

Histogram (U64)

Total size of read objects (bytes)

storage.write.operations

Counter

Write operations by upload method (body, redirect, multipart)

gRPC server metrics#

Metric name

Type

Description

grpc.server.call.started

Counter

Server calls started

grpc.server.call.rcvd_total_compressed_message_size

Histogram (U64)

Compressed bytes received per RPC

grpc.server.call.sent_total_compressed_message_size

Histogram (U64)

Compressed bytes sent per call

grpc.server.call.duration

Histogram (f64)

Call duration (seconds). Attributes: grpc.method, grpc.service, grpc.name, grpc.status

Metadata cache#

Metric name

Type

Description

storage_metadata_cache_cache_access

Counter

Cache access count (attributes: pod, name)

storage_metadata_cache_cache_miss

Counter

Cache miss count

storage_metadata_cache_entries

Gauge

Current cache entry count

storage_metadata_cache_size

Gauge

Cache memory footprint (bytes)

Metadata backends#

S3: storage_s3_metadata_list_objects_latency, storage_s3_metadata_get_object_latency, storage_s3_metadata_put_object_latency, storage_s3_metadata_delete_object_latency (histograms).

DynamoDB: storage_aws_dynamodb_metadata_put_item_latency, storage_aws_dynamodb_metadata_delete_item_latency, storage_aws_dynamodb_metadata_batch_get_item_latency, storage_aws_dynamodb_metadata_batch_get_item_request_items, storage_aws_dynamodb_metadata_query_latency.

Azure Blob: storage_azure_blob_metadata_list_blobs_latency, storage_azure_blob_metadata_get_blob_latency, storage_azure_blob_metadata_put_blob_latency, storage_azure_blob_metadata_delete_blob_latency.

Azure Table: storage_azure_table_metadata_query_latency, storage_azure_table_metadata_query_response_size, storage_azure_table_metadata_delete_latency, storage_azure_table_metadata_update_latency, storage_azure_table_metadata_insert_or_update_latency.

HTTP status is also recorded per backend call (per status code).

Pub/sub#

SQS: storage_sqs_delete_message_batch_latency, storage_sqs_delete_message_batch_batch_size, storage_sqs_delete_message_batch_active_requests, storage_sqs_receive_message_latency, storage_sqs_receive_message_batch_size, storage_sqs_receive_message_active_requests, storage_sqs_events_per_message.

Azure Service Bus: storage_azure_service_bus_receive_messages_latency, storage_azure_service_bus_receive_messages_batch_size, storage_azure_service_bus_receive_messages_active_requests, storage_azure_service_bus_complete_message_latency, storage_azure_service_bus_complete_message_active_requests.

Notification client: storage_notification_service_publish_latency, storage_notification_service_publish_active_requests. OAuth2: storage_oauth2_client_credentials_provider_get_token_latency, storage_oauth2_client_credentials_provider_get_token_results.

Event Aggregation Service#

Uses the same gRPC server metrics as Event Consumer Service (Notification Common), listed in the Event Consumer Service section. It does not define additional service-specific metrics.

Event Consumer Service#

Notification Common gRPC metrics (shared with Event Aggregation Service)#

Metric name

Type

Description

rpc.server.active_requests

UpDownCounter

In-flight requests

rpc.server.requests_per_rpc

Histogram

Requests per RPC

rpc.server.request_size

Histogram

Request size

rpc.server.response_size

Histogram

Response size

rpc.server.responses_per_rpc

Histogram

Responses per RPC

rpc.server.calls

Counter

Total gRPC calls

rpc.server.active_methods

UpDownCounter

In-flight methods

rpc.server.active_responses

UpDownCounter

In-flight responses

rpc.server.duration

Histogram

RPC duration

Consumer metrics (Event Consumer Service only)#

Metric name

Type

Description

events_by_type_total

Counter

Events processed by event type (label: event_type)

events_processed_successfully_total

Counter

Successfully processed events

events_processing_failed_total

Counter

Events that failed during processing

event_processing_duration_ms

Histogram

Time spent processing each event (ms)

SSE metrics (Event Consumer Service only)#

Metric name

Type

Description

sse.server.active_connections

UpDownCounter

Active SSE connections

sse.server.connections_total

Counter

Total SSE connections established

sse.server.connection_duration

Histogram

Connection duration (ms)

sse.server.events_sent_total

Counter

Events sent via SSE

sse.server.events_per_connection

Histogram

Events sent per SSE connection

sse.server.connection_errors_total

Counter

SSE connection errors

Labels include endpoint and, where applicable, event_type.