Observability (Metrics, Traces, Logs)#
This document describes Helm configuration and metric definitions for each service. Storage Service includes the Storage Core component; Event Aggregation Service and Event Consumer Service use the shared Notification Common library for telemetry.
Part I – Helm configuration by service#
Storage Service#
Helm / behavior |
Details |
|---|---|
Metrics |
|
Ports |
gRPC: |
Logging |
|
Traces |
Not configured by default in Helm. OTLP tracing can be enabled by setting the appropriate OTEL environment variables (e.g. traces exporter and endpoint). |
Logs |
Logs go to stdout only. |
Event Aggregation Service#
Helm / behavior |
Details |
|---|---|
Telemetry |
When |
OTEL endpoints |
|
ConfigMap env |
Values are passed into the pod as |
Logging |
|
Event Consumer Service#
Helm / behavior |
Details |
|---|---|
Telemetry |
Same pattern as Event Aggregation Service: when |
OTEL endpoints |
|
ConfigMap env |
Same env vars as Event Aggregation Service ( |
Part II – Metric explanations by service#
Storage Service#
Common attributes include method, API version, storage backend, pod name, package version, and result.
Metric name |
Type |
Description |
|---|---|---|
|
Counter |
Total storage API requests |
|
Histogram |
Request duration (seconds) |
|
Counter |
Requests to the storage backend SDK |
|
Histogram |
Backend SDK call duration |
|
Histogram (U64) |
Items returned per enumeration (list, list_stat, enumerate, enumerate_versions) |
|
Counter |
Redirect URLs returned for read operations |
|
Histogram (U64) |
Chunk size from backend (bytes) |
|
Histogram (U64) |
Total size of read objects (bytes) |
|
Counter |
Write operations by upload method (body, redirect, multipart) |
gRPC server metrics#
Metric name |
Type |
Description |
|---|---|---|
|
Counter |
Server calls started |
|
Histogram (U64) |
Compressed bytes received per RPC |
|
Histogram (U64) |
Compressed bytes sent per call |
|
Histogram (f64) |
Call duration (seconds). Attributes: grpc.method, grpc.service, grpc.name, grpc.status |
Metadata cache#
Metric name |
Type |
Description |
|---|---|---|
|
Counter |
Cache access count (attributes: pod, name) |
|
Counter |
Cache miss count |
|
Gauge |
Current cache entry count |
|
Gauge |
Cache memory footprint (bytes) |
Metadata backends#
S3: storage_s3_metadata_list_objects_latency, storage_s3_metadata_get_object_latency, storage_s3_metadata_put_object_latency, storage_s3_metadata_delete_object_latency (histograms).
DynamoDB: storage_aws_dynamodb_metadata_put_item_latency, storage_aws_dynamodb_metadata_delete_item_latency, storage_aws_dynamodb_metadata_batch_get_item_latency, storage_aws_dynamodb_metadata_batch_get_item_request_items, storage_aws_dynamodb_metadata_query_latency.
Azure Blob: storage_azure_blob_metadata_list_blobs_latency, storage_azure_blob_metadata_get_blob_latency, storage_azure_blob_metadata_put_blob_latency, storage_azure_blob_metadata_delete_blob_latency.
Azure Table: storage_azure_table_metadata_query_latency, storage_azure_table_metadata_query_response_size, storage_azure_table_metadata_delete_latency, storage_azure_table_metadata_update_latency, storage_azure_table_metadata_insert_or_update_latency.
HTTP status is also recorded per backend call (per status code).
Pub/sub#
SQS: storage_sqs_delete_message_batch_latency, storage_sqs_delete_message_batch_batch_size, storage_sqs_delete_message_batch_active_requests, storage_sqs_receive_message_latency, storage_sqs_receive_message_batch_size, storage_sqs_receive_message_active_requests, storage_sqs_events_per_message.
Azure Service Bus: storage_azure_service_bus_receive_messages_latency, storage_azure_service_bus_receive_messages_batch_size, storage_azure_service_bus_receive_messages_active_requests, storage_azure_service_bus_complete_message_latency, storage_azure_service_bus_complete_message_active_requests.
Notification client: storage_notification_service_publish_latency, storage_notification_service_publish_active_requests. OAuth2: storage_oauth2_client_credentials_provider_get_token_latency, storage_oauth2_client_credentials_provider_get_token_results.
Event Aggregation Service#
Uses the same gRPC server metrics as Event Consumer Service (Notification Common), listed in the Event Consumer Service section. It does not define additional service-specific metrics.
Event Consumer Service#
Consumer metrics (Event Consumer Service only)#
Metric name |
Type |
Description |
|---|---|---|
|
Counter |
Events processed by event type (label: event_type) |
|
Counter |
Successfully processed events |
|
Counter |
Events that failed during processing |
|
Histogram |
Time spent processing each event (ms) |
SSE metrics (Event Consumer Service only)#
Metric name |
Type |
Description |
|---|---|---|
|
UpDownCounter |
Active SSE connections |
|
Counter |
Total SSE connections established |
|
Histogram |
Connection duration (ms) |
|
Counter |
Events sent via SSE |
|
Histogram |
Events sent per SSE connection |
|
Counter |
SSE connection errors |
Labels include endpoint and, where applicable, event_type.