Metrics#
Overview#
The Omniverse DGX Cloud Portal (Version 1.2.0 and greater), includes built-in OpenTelemetry (OTel) metrics export capabilities for comprehensive session monitoring and observability. This documentation details the metrics exported and the required configuration.
Metrics Exported#
Metric |
Description |
|---|---|
sessions.active.count |
Current number of active streaming sessions |
sessions.start.count |
Total number of sessions started |
sessions.end.count |
Completion rate analysis, session lifecycle tracking |
sessions.duration |
Session duration in seconds with histogram buckets |
Dimensional Data#
Each Portal Sample metric includes the following attributes for filtering and analysis:
Attribute |
Description |
|---|---|
session.id |
Unique session identifier |
session.username |
Name of the user who initiated the session |
session.user |
User ID |
session.app |
Name of the Kit App streamed |
nvcf.function_id |
NVIDIA Cloud Function ID |
nvcf.function_version_id |
NVIDIA Cloud Function Version |
session.duration.seconds |
Session duration |
Prerequisites#
Portal deployment deployed as a container within a Kubernetes cluster or a standalone instance
An OTel collector instance available for your deployment. TCP Ports 4317 (gRPC) and 4318 (HTTP) must be open and accessible on the collector instance
Network connectivity between the Portal Sample and the collector instance
An observability platform such as Azure Monitor, Datadog, or Grafana Cloud
OTel Config For Azure Monitor#
The OTel collector can be hosted on the same instance as the portal or on a separate host. Configure the collector to receive metrics from the portal, process them (add labels, batch), and forward them to Azure Monitor.
Create a directory for organizing the OTel collector configuration:
mkdir observability
Create the OTel collector config file:
touch otel-collector-config.yaml
Copy the following configuration into otel-collector-config.yaml:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
# Add resource attributes to identify the source
resource:
attributes:
- key: service.name
value: "ov-dgxc-portal"
action: upsert
- key: service.version
value: "1.0.0"
action: upsert
- key: deployment.environment
value: "production"
action: upsert
# Batch processor for efficient export
batch:
timeout: 1s
send_batch_size: 1024
send_batch_max_size: 2048
# Memory limiter to prevent OOM
memory_limiter:
limit_mib: 256
check_interval: 1s
exporters:
debug:
verbosity: detailed
azuremonitor:
instrumentation_key: "${APPLICATIONINSIGHTS_CONNECTION_STRING}"
service:
pipelines:
metrics:
receivers: [otlp]
processors: [memory_limiter, resource, batch]
exporters: [debug, azuremonitor]
The azuremonitor exporter above uses the Application Insights ConnectionString as an environment variable. For additional information see: https://opentelemetry.io/docs/collector/
Azure Monitor Setup#
Use an existing Azure Monitor/Application Insights instance, or create a new one. After creation, open the resource -> Overview -> JSON View and copy the ConnectionString value. It should resemble the following:
InstrumentationKey=xxxxxxxxxx;IngestionEndpoint=https://xxxxx.applicationinsights.azure.com/;LiveEndpoint=https://xxxxx.monitor.azure.com/;ApplicationId=xxxxxxx
Create the OTel Collector#
Launch the collector as a Docker container so it can receive metrics from the Portal Sample and forward them to Azure Monitor:
docker run -d \
--name otel-collector \
-p 4317:4317 \
-p 4318:4318 \
-e APPLICATIONINSIGHTS_CONNECTION_STRING="InstrumentationKey=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx;IngestionEndpoint=https://xxxx.in.applicationinsights.azure.com/;LiveEndpoint=https://xxxxxx.livediagnostics.monitor.azure.com/;ApplicationId=xxxxxx-xxxxx-xxxxx-xxxxx-xxxxxxxxxxxx" \
-v "$(pwd)/otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml" \
otel/opentelemetry-collector-contrib:latest
Export Environment Variables#
Configure the portal to send built-in OTel metrics to your collector by setting these environment variables on the Portal Sample instance:
export OTEL_EXPORTER_OTLP_ENDPOINT="http://<IP_OF_OTEL_INSTANCE>:4317"
export OTEL_SERVICE_NAME="web-streaming-backend"
Note
The OTEL exporter endpoint must be an IP Address or a fully qualified DNS name.
Metrics Export Verification#
Check collector logs:
docker logs -f otel-collector
Test Metrics Export (Portal Sample)#
On the Portal Sample instance run the test script (example):
cd ov-dgxc-portal-sample/backend
poetry run test-metrics
Expected output:
Testing OpenTelemetry metrics...
Recording session start...
Incrementing active sessions...
Recording session end...
Decrementing active sessions...
Metrics recorded. Check your collector/backend for the data.
Waiting 10 seconds to ensure export...
To generate session activity from the Portal Sample, start a streaming session.
Confirm Telemetry on Azure Monitor#
In the Azure Portal, browse to Application Insights for the resource receiving telemetry. Open Monitoring -> Metrics and verify that custom metrics appear in the metrics dropdown.
Sample Azure Monitor Queries#
Active Sessions Monitoring:
customMetrics
| where name == "sessions.active.count"
| extend session_app = tostring(customDimensions.session_app)
| extend session_user = tostring(customDimensions.session_user)
| extend nvcf_function_id = tostring(customDimensions.nvcf_function_id)
| project timestamp, name, value, session_app, session_user, nvcf_function_id
Active Session Duration:
customMetrics
| where name == "sessions.duration"
| extend session_app = tostring(customDimensions.session_app)
| extend session_user = tostring(customDimensions.session_user)
| extend nvcf_function_id = tostring(customDimensions.nvcf_function_id)
| project timestamp, name, value, session_app, session_user, nvcf_function_id
Usage Trends over time:
customMetrics
| where name == "sessions.start.count"
| extend session_app = tostring(customDimensions.session_app)
| summarize session_starts = count() by bin(timestamp, 1h), session_app
| render timechart