Observability (Metrics, Traces, Logs)#

This document describes Helm configuration and metric definitions for each service. Storage Service includes the Storage Core component; Event Aggregation Service and Event Consumer Service use the shared Notification Common library for telemetry.

Part I – Helm configuration by service#

Storage Service#

Helm / behavior	Details
Metrics	`OTEL_METRICS_EXPORTER` is set to `prometheus`. `OTEL_EXPORTER_PROMETHEUS_PORT` is set from `service.metrics.port`.
Ports	gRPC: `service.grpc.port` (default 8011). REST: `service.rest.port` (default 8012). Metrics: `service.metrics.port` (default 8013); Prometheus scrapes `GET /metrics` on this port.
Logging	`RUST_LOG` is built from `config.logging.level` and optional `config.logging.extra_targets`. `RUST_BACKTRACE` comes from `config.logging.backtrace`.
Traces	Not configured by default in Helm. OTLP tracing can be enabled by setting the appropriate OTEL environment variables (e.g. traces exporter and endpoint).
Logs	Logs go to stdout only.

Event Aggregation Service#

Helm / behavior	Details
Telemetry	When `telemetry.enabled` is true, the service exports traces, metrics, and logs via OTLP.
OTEL endpoints	`telemetry.otlp_tracing_endpoint`, `telemetry.otlp_metrics_endpoint`, `telemetry.otlp_logs_endpoint`. Default: `otel-collector.observability.svc.cluster.local:4317`.
ConfigMap env	Values are passed into the pod as `OMNI_EVENTS_TELEMETRY_ENABLED`, `OMNI_EVENTS_OTLP_TRACING_ENDPOINT`, `OMNI_EVENTS_OTLP_METRICS_ENDPOINT`, `OMNI_EVENTS_OTLP_LOGS_ENDPOINT`, `OMNI_EVENTS_LOG_LEVEL`, `OMNI_EVENTS_GRPC_PORT`, and related variables.
Logging	`logging.level` (e.g. INFO) maps to the service log level.

Event Consumer Service#

Helm / behavior	Details
Telemetry	Same pattern as Event Aggregation Service: when `telemetry.enabled` is true, the service exports traces, metrics, and logs via OTLP.
OTEL endpoints	`telemetry.otlp_tracing_endpoint`, `telemetry.otlp_metrics_endpoint`, `telemetry.otlp_logs_endpoint`. Default: `otel-collector.observability.svc.cluster.local:4317`.
ConfigMap env	Same env vars as Event Aggregation Service (`OMNI_EVENTS_TELEMETRY_ENABLED`, `OMNI_EVENTS_OTLP_*_ENDPOINT`, etc.).

Notification Common (shared by Event Aggregation Service and Event Consumer Service)#

Helm / behavior	Details
Env prefix	Configuration for the event services uses the `OMNI_EVENTS_` prefix.
Settings	Telemetry is controlled by the same Helm values as Event Aggregation Service and Event Consumer Service: telemetry enabled flag and OTLP endpoints for traces, metrics, and logs.

Part II – Metric explanations by service#

Storage Service#

Common attributes include method, API version, storage backend, pod name, package version, and result.

Metric name	Type	Description
`storage.requests`	Counter	Total storage API requests
`storage.request.duration`	Histogram	Request duration (seconds)
`storage.sdk.requests`	Counter	Requests to the storage backend SDK
`storage.sdk.request.duration`	Histogram	Backend SDK call duration
`storage.enumeration.items`	Histogram (U64)	Items returned per enumeration (list, list_stat, enumerate, enumerate_versions)
`storage.read.redirects`	Counter	Redirect URLs returned for read operations
`storage.read.chunk.size`	Histogram (U64)	Chunk size from backend (bytes)
`storage.read.object.size`	Histogram (U64)	Total size of read objects (bytes)
`storage.write.operations`	Counter	Write operations by upload method (body, redirect, multipart)

gRPC server metrics#

Metric name	Type	Description
`grpc.server.call.started`	Counter	Server calls started
`grpc.server.call.rcvd_total_compressed_message_size`	Histogram (U64)	Compressed bytes received per RPC
`grpc.server.call.sent_total_compressed_message_size`	Histogram (U64)	Compressed bytes sent per call
`grpc.server.call.duration`	Histogram (f64)	Call duration (seconds). Attributes: grpc.method, grpc.service, grpc.name, grpc.status

Metadata cache#

Metric name	Type	Description
`storage_metadata_cache_cache_access`	Counter	Cache access count (attributes: pod, name)
`storage_metadata_cache_cache_miss`	Counter	Cache miss count
`storage_metadata_cache_entries`	Gauge	Current cache entry count
`storage_metadata_cache_size`	Gauge	Cache memory footprint (bytes)

Metadata backends#

S3: storage_s3_metadata_list_objects_latency, storage_s3_metadata_get_object_latency, storage_s3_metadata_put_object_latency, storage_s3_metadata_delete_object_latency (histograms).

DynamoDB: storage_aws_dynamodb_metadata_put_item_latency, storage_aws_dynamodb_metadata_delete_item_latency, storage_aws_dynamodb_metadata_batch_get_item_latency, storage_aws_dynamodb_metadata_batch_get_item_request_items, storage_aws_dynamodb_metadata_query_latency.

Azure Blob: storage_azure_blob_metadata_list_blobs_latency, storage_azure_blob_metadata_get_blob_latency, storage_azure_blob_metadata_put_blob_latency, storage_azure_blob_metadata_delete_blob_latency.

Azure Table: storage_azure_table_metadata_query_latency, storage_azure_table_metadata_query_response_size, storage_azure_table_metadata_delete_latency, storage_azure_table_metadata_update_latency, storage_azure_table_metadata_insert_or_update_latency.

HTTP status is also recorded per backend call (per status code).

Pub/sub#

SQS: storage_sqs_delete_message_batch_latency, storage_sqs_delete_message_batch_batch_size, storage_sqs_delete_message_batch_active_requests, storage_sqs_receive_message_latency, storage_sqs_receive_message_batch_size, storage_sqs_receive_message_active_requests, storage_sqs_events_per_message.

Azure Service Bus: storage_azure_service_bus_receive_messages_latency, storage_azure_service_bus_receive_messages_batch_size, storage_azure_service_bus_receive_messages_active_requests, storage_azure_service_bus_complete_message_latency, storage_azure_service_bus_complete_message_active_requests.

Notification client: storage_notification_service_publish_latency, storage_notification_service_publish_active_requests. OAuth2: storage_oauth2_client_credentials_provider_get_token_latency, storage_oauth2_client_credentials_provider_get_token_results.

Event Aggregation Service#

Uses the same gRPC server metrics as Event Consumer Service (Notification Common), listed in the Event Consumer Service section. It does not define additional service-specific metrics.

Event Consumer Service#

Notification Common gRPC metrics (shared with Event Aggregation Service)#

Metric name	Type	Description
`rpc.server.active_requests`	UpDownCounter	In-flight requests
`rpc.server.requests_per_rpc`	Histogram	Requests per RPC
`rpc.server.request_size`	Histogram	Request size
`rpc.server.response_size`	Histogram	Response size
`rpc.server.responses_per_rpc`	Histogram	Responses per RPC
`rpc.server.calls`	Counter	Total gRPC calls
`rpc.server.active_methods`	UpDownCounter	In-flight methods
`rpc.server.active_responses`	UpDownCounter	In-flight responses
`rpc.server.duration`	Histogram	RPC duration

Consumer metrics (Event Consumer Service only)#

Metric name	Type	Description
`events_by_type_total`	Counter	Events processed by event type (label: event_type)
`events_processed_successfully_total`	Counter	Successfully processed events
`events_processing_failed_total`	Counter	Events that failed during processing
`event_processing_duration_ms`	Histogram	Time spent processing each event (ms)

SSE metrics (Event Consumer Service only)#

Metric name	Type	Description
`sse.server.active_connections`	UpDownCounter	Active SSE connections
`sse.server.connections_total`	Counter	Total SSE connections established
`sse.server.connection_duration`	Histogram	Connection duration (ms)
`sse.server.events_sent_total`	Counter	Events sent via SSE
`sse.server.events_per_connection`	Histogram	Events sent per SSE connection
`sse.server.connection_errors_total`	Counter	SSE connection errors

Labels include endpoint and, where applicable, event_type.