Metrics#

../_images/ov_cloud_banner.jpg

Overview#

The Omniverse DGX Cloud Portal (Version 1.2.0 and greater), includes built-in OpenTelemetry (OTel) metrics export capabilities for comprehensive session monitoring and observability. This documentation details the metrics exported and the required configuration.

Metrics Exported#

Metric

Description

sessions.active.count

Current number of active streaming sessions

sessions.start.count

Total number of sessions started

sessions.end.count

Completion rate analysis, session lifecycle tracking

sessions.duration

Session duration in seconds with histogram buckets

Dimensional Data#

Each Portal Sample metric includes the following attributes for filtering and analysis:

Attribute

Description

session.id

Unique session identifier

session.username

Name of the user who initiated the session

session.user

User ID

session.app

Name of the Kit App streamed

nvcf.function_id

NVIDIA Cloud Function ID

nvcf.function_version_id

NVIDIA Cloud Function Version

session.duration.seconds

Session duration

Prerequisites#

  • Portal deployment deployed as a container within a Kubernetes cluster or a standalone instance

  • An OTel collector instance available for your deployment. TCP Ports 4317 (gRPC) and 4318 (HTTP) must be open and accessible on the collector instance

  • Network connectivity between the Portal Sample and the collector instance

  • An observability platform such as Azure Monitor, Datadog, or Grafana Cloud

OTel Config For Azure Monitor#

The OTel collector can be hosted on the same instance as the portal or on a separate host. Configure the collector to receive metrics from the portal, process them (add labels, batch), and forward them to Azure Monitor.

Create a directory for organizing the OTel collector configuration:

mkdir observability

Create the OTel collector config file:

touch otel-collector-config.yaml

Copy the following configuration into otel-collector-config.yaml:

receivers:
        otlp:
                protocols:
                        grpc:
                                endpoint: 0.0.0.0:4317
                        http:
                                endpoint: 0.0.0.0:4318

processors:
        # Add resource attributes to identify the source
        resource:
                attributes:
                        - key: service.name
                                value: "ov-dgxc-portal"
                                action: upsert
                        - key: service.version
                                value: "1.0.0"
                                action: upsert
                        - key: deployment.environment
                                value: "production"
                                action: upsert

        # Batch processor for efficient export
        batch:
                timeout: 1s
                send_batch_size: 1024
                send_batch_max_size: 2048

        # Memory limiter to prevent OOM
        memory_limiter:
                limit_mib: 256
                check_interval: 1s

exporters:
        debug:
                verbosity: detailed
        azuremonitor:
                instrumentation_key: "${APPLICATIONINSIGHTS_CONNECTION_STRING}"

service:
        pipelines:
                metrics:
                        receivers: [otlp]
                        processors: [memory_limiter, resource, batch]
                        exporters: [debug, azuremonitor]

The azuremonitor exporter above uses the Application Insights ConnectionString as an environment variable. For additional information see: https://opentelemetry.io/docs/collector/

Azure Monitor Setup#

Use an existing Azure Monitor/Application Insights instance, or create a new one. After creation, open the resource -> Overview -> JSON View and copy the ConnectionString value. It should resemble the following:

InstrumentationKey=xxxxxxxxxx;IngestionEndpoint=https://xxxxx.applicationinsights.azure.com/;LiveEndpoint=https://xxxxx.monitor.azure.com/;ApplicationId=xxxxxxx

Create the OTel Collector#

Launch the collector as a Docker container so it can receive metrics from the Portal Sample and forward them to Azure Monitor:

docker run -d \
        --name otel-collector \
        -p 4317:4317 \
        -p 4318:4318 \
        -e APPLICATIONINSIGHTS_CONNECTION_STRING="InstrumentationKey=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxx;IngestionEndpoint=https://xxxx.in.applicationinsights.azure.com/;LiveEndpoint=https://xxxxxx.livediagnostics.monitor.azure.com/;ApplicationId=xxxxxx-xxxxx-xxxxx-xxxxx-xxxxxxxxxxxx" \
        -v "$(pwd)/otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml" \
        otel/opentelemetry-collector-contrib:latest

Export Environment Variables#

Configure the portal to send built-in OTel metrics to your collector by setting these environment variables on the Portal Sample instance:

export OTEL_EXPORTER_OTLP_ENDPOINT="http://<IP_OF_OTEL_INSTANCE>:4317"
export OTEL_SERVICE_NAME="web-streaming-backend"

Note

The OTEL exporter endpoint must be an IP Address or a fully qualified DNS name.

Metrics Export Verification#

Check collector logs:

docker logs -f otel-collector

Test Metrics Export (Portal Sample)#

On the Portal Sample instance run the test script (example):

cd ov-dgxc-portal-sample/backend
poetry run test-metrics

Expected output:

Testing OpenTelemetry metrics...
Recording session start...
Incrementing active sessions...
Recording session end...
Decrementing active sessions...
Metrics recorded. Check your collector/backend for the data.
Waiting 10 seconds to ensure export...

To generate session activity from the Portal Sample, start a streaming session.

Confirm Telemetry on Azure Monitor#

In the Azure Portal, browse to Application Insights for the resource receiving telemetry. Open Monitoring -> Metrics and verify that custom metrics appear in the metrics dropdown.

../_images/portal_metrics_azure_1.png

../_images/portal_metrics_azure_2.png

Sample Azure Monitor Queries#

Active Sessions Monitoring:

customMetrics
| where name == "sessions.active.count"
| extend session_app = tostring(customDimensions.session_app)
| extend session_user = tostring(customDimensions.session_user)
| extend nvcf_function_id = tostring(customDimensions.nvcf_function_id)
| project timestamp, name, value, session_app, session_user, nvcf_function_id

Active Session Duration:

customMetrics
| where name == "sessions.duration"
| extend session_app = tostring(customDimensions.session_app)
| extend session_user = tostring(customDimensions.session_user)
| extend nvcf_function_id = tostring(customDimensions.nvcf_function_id)
| project timestamp, name, value, session_app, session_user, nvcf_function_id

Usage Trends over time:

customMetrics
| where name == "sessions.start.count"
| extend session_app = tostring(customDimensions.session_app)
| summarize session_starts = count() by bin(timestamp, 1h), session_app
| render timechart