Advanced Configuration#

../../_images/ov_cloud_banner.jpg

This section describes implementing health checks and automated Vector restarts to improve application reliability and tracking. It also shows how to retrieve a streaming session identifier to tie a user to their application invocation logs.

Vector Health Check#

Create a new Health Check script:

nano vector_health_monitor.sh

Script overview#

  • PID-Based Monitoring: Uses process ID tracking to monitor Vector’s health status.

  • Automatic Restart: Restarts failed Vector processes automatically.

  • Health Check Loop: Continuously monitors Vector every 30 seconds using kill -0.

  • Startup Validation: Ensures Vector stabilizes for 15 seconds after restart before marking healthy.

  • Dual Logging: Outputs to both stdout and /tmp/kit_structured_logs.log.

Configuration variables (defaults shown):

  • VECTOR_HEALTH_CHECK_INTERVAL = 30 (seconds) — health check frequency

  • VECTOR_MAX_RESTART_ATTEMPTS = 3 — max restart attempts before giving up

  • VECTOR_RESTART_COOLDOWN = 60 (seconds) — wait time between restart attempts

  • VECTOR_CONFIG_PATH = “/tmp/vector.toml” — path to Vector config

  • VECTOR_BINARY_PATH = “/opt/vector/bin/vector” — Vector executable path

Health monitor script (copy into vector_health_monitor.sh):

#!/bin/bash

# Vector Health Monitor - PID-based health check service
# This script runs independently and monitors Vector's health via process checking

# Health check configuration
VECTOR_HEALTH_CHECK_INTERVAL=30  # Check every 30 seconds
VECTOR_MAX_RESTART_ATTEMPTS=3    # Maximum restarts before giving up
VECTOR_RESTART_COOLDOWN=60       # Wait 60 seconds between restarts
VECTOR_CONFIG_PATH="/tmp/vector.toml"
VECTOR_BINARY_PATH="/opt/vector/bin/vector"

# Logging function - outputs to both stdout and log file
log() {
    local message="[$(date '+%Y-%m-%d %H:%M:%S')] [HealthMonitor] $1"
    echo "$message"
    echo "$message" >> /tmp/kit_structured_logs.log
}

# Vector health check function - PID-based
vector_health_check() {
    local vector_pid=$(get_vector_pid)

    if [ -n "$vector_pid" ]; then
        # Check if process is actually running and responding
        if kill -0 "$vector_pid" 2>/dev/null; then
            log "Vector health check PASSED - PID $vector_pid is running"
            return 0
        else
            log "Vector health check FAILED - PID $vector_pid is not responding"
            return 1
        fi
    else
        log "Vector health check FAILED - No Vector process found"
        return 1
    fi
}

# Get Vector PID
get_vector_pid() {
    pgrep -f "vector --config $VECTOR_CONFIG_PATH" | head -1
}

# Stop Vector process
stop_vector() {
    local vector_pid=$(get_vector_pid)

    if [ -n "$vector_pid" ]; then
        log "Stopping Vector process (PID: $vector_pid)"
        kill "$vector_pid" 2>/dev/null
        sleep 5

        # Force kill if still running
        if kill -0 "$vector_pid" 2>/dev/null; then
            log "Force killing Vector process"
            kill -9 "$vector_pid" 2>/dev/null
        fi

        log "Vector process stopped"
    else
        log "No Vector process found to stop"
    fi
}

# Start Vector process
start_vector() {
    if [ ! -f "$VECTOR_CONFIG_PATH" ]; then
        log "ERROR: Vector config file not found at $VECTOR_CONFIG_PATH"
        return 1
    fi

    if [ ! -x "$VECTOR_BINARY_PATH" ]; then
        log "ERROR: Vector binary not found at $VECTOR_BINARY_PATH"
        return 1
    fi

    log "Starting Vector process..."
    "$VECTOR_BINARY_PATH" --config "$VECTOR_CONFIG_PATH" &
    local new_pid=$!

    log "Vector started with PID: $new_pid"

    # Wait for Vector to start up and stabilize
    log "Waiting for Vector to stabilize..."
    local startup_timeout=15
    local elapsed=0

    while [ $elapsed -lt $startup_timeout ]; do
        if kill -0 "$new_pid" 2>/dev/null; then
            log "Vector startup successful and stable"
            return 0
        fi
        sleep 2
        elapsed=$((elapsed + 2))
    done

    log "Vector startup failed or process died during startup"
    return 1
}

# Vector restart function
restart_vector() {
    local restart_count="$1"

    log "Attempting to restart Vector (attempt $restart_count/$VECTOR_MAX_RESTART_ATTEMPTS)"

    # Stop Vector
    stop_vector

    # Start Vector
    if start_vector; then
        log "Vector restart successful"
        return 0
    else
        log "Vector restart failed"
        return 1
    fi
}

# Wait for Vector to be initially available
wait_for_vector() {
    log "Waiting for Vector to be available..."
    local wait_timeout=60
    local elapsed=0

    while [ $elapsed -lt $wait_timeout ]; do
        local vector_pid=$(get_vector_pid)
        if [ -n "$vector_pid" ] && kill -0 "$vector_pid" 2>/dev/null; then
            log "Vector is available (PID: $vector_pid), starting health monitoring"
            return 0
        fi
        sleep 2
        elapsed=$((elapsed + 2))
    done

    log "Vector is not available after $wait_timeout seconds"
    return 1
}

# Main monitoring loop
monitor_vector_health() {
    local restart_count=0
    local last_restart_time=0

    log "Starting Vector health monitoring - PID-based (checking every ${VECTOR_HEALTH_CHECK_INTERVAL}s)"

    while true; do
        sleep $VECTOR_HEALTH_CHECK_INTERVAL

        # Skip health check if Vector was just restarted
        local current_time=$(date +%s)
        if [ $((current_time - last_restart_time)) -lt $VECTOR_RESTART_COOLDOWN ]; then
            log "Skipping health check due to recent restart cooldown"
            continue
        fi

        if ! vector_health_check; then
            log "Vector health check failed!"

            # Check if we have exceeded max restart attempts
            if [ $restart_count -ge $VECTOR_MAX_RESTART_ATTEMPTS ]; then
                log "CRITICAL: Maximum restart attempts ($VECTOR_MAX_RESTART_ATTEMPTS) exceeded!"
                log "Vector health monitoring disabled. Manual intervention required."
                # Continue monitoring but don't restart
                sleep 300  # Wait 5 minutes before next check
                continue
            fi

            restart_count=$((restart_count + 1))
            last_restart_time=$(date +%s)

            if restart_vector $restart_count; then
                log "Vector restart successful"
                # Reset restart count on successful restart
                restart_count=0
            else
                log "Vector restart failed"
            fi
        else
            # Reset restart count on successful health check
            if [ $restart_count -gt 0 ]; then
                log "Vector health restored, resetting restart counter"
                restart_count=0
            fi
        fi
    done
}

# Signal handlers
cleanup() {
    log "Health monitor shutting down..."
    exit 0
}

trap cleanup EXIT INT TERM

# Main execution
log "Vector Health Monitor starting (PID-based monitoring)..."

# Wait for Vector to be available initially
if ! wait_for_vector; then
    log "ERROR: Vector is not available for initial health check"
    exit 1
fi

# Start health monitoring
monitor_vector_health

Include the Health Check Monitor in the Entrypoint#

Update entrypoint_vector_dev.sh to start Vector independently and launch the health monitor (example below).

Edit the entrypoint:

nano entrypoint_vector_dev.sh

Replace or update with the following (copy into entrypoint_vector_dev.sh):

#!/bin/bash

echo "[entrypoint_vector_dev.sh] Starting container..."

# Set required environment variables for Kit
export USER="ubuntu"
export LOGNAME="ubuntu"

# Check if Vector OTEL processing is enabled
if [ "$VECTOR_OTEL_ACTIVE" = "TRUE" ]; then
    echo "[Vector] Vector OTEL processing is ENABLED (VECTOR_OTEL_ACTIVE=TRUE)"
    echo "[Vector] Health check setting: VECTOR_HEALTH_CHECK=${VECTOR_HEALTH_CHECK:-"(not set)"}"

    # Create log files if they do not exist
    echo "[Vector] Setting up log files..."
    touch /tmp/kit_structured_logs.log
    chmod 666 /tmp/kit_structured_logs.log

    # Validate OTEL endpoint
    if [ -z "$OTEL_EXPORTER_OTLP_LOGS_ENDPOINT" ]; then
        echo "[Vector] ERROR: OTEL_EXPORTER_OTLP_LOGS_ENDPOINT is not set!"
        exit 1
    fi

    if [[ ! "$OTEL_EXPORTER_OTLP_LOGS_ENDPOINT" =~ ^https?:// ]]; then
        echo "[Vector] ERROR: Invalid OTEL endpoint format. Must start with http:// or https://"
        exit 1
    fi

    echo "[Vector] Using OTEL endpoint: $OTEL_EXPORTER_OTLP_LOGS_ENDPOINT"

    # Determine which Vector configuration to use
    if [ ! -z "$VECTOR_CONF_B64" ]; then
        echo "[Vector] Custom Vector configuration provided via VECTOR_CONF_B64"
        echo "[Vector] Decoding and using customer-provided configuration..."

        # Decode Vector config
        echo "$VECTOR_CONF_B64" | base64 -d > /tmp/vector_raw.toml

        # Replace OTEL endpoint
        sed "s|PLACEHOLDER_OTEL_ENDPOINT|$OTEL_EXPORTER_OTLP_LOGS_ENDPOINT|g" /tmp/vector_raw.toml > /tmp/vector.toml

        echo "[Vector] Using CUSTOM Vector configuration (from VECTOR_CONF_B64)"
    else
        echo "[Vector] No custom configuration provided. Using static/default Vector configuration..."

        # Copy static configuration and replace OTEL endpoint
        cp /opt/vector/static_config.toml /tmp/vector_raw.toml
        sed "s|PLACEHOLDER_OTEL_ENDPOINT|$OTEL_EXPORTER_OTLP_LOGS_ENDPOINT|g" /tmp/vector_raw.toml > /tmp/vector.toml

        echo "[Vector] Using STATIC Vector configuration (from /opt/vector/static_config.toml)"
    fi

    # Show the first few lines of the config for debug
    echo "[Vector] First 10 lines of /tmp/vector.toml:" && head -n 10 /tmp/vector.toml

    # Validate Vector config
    echo "[Vector] Verifying Vector configuration..."
    if [ -x "/opt/vector/bin/vector" ]; then
        /opt/vector/bin/vector validate /tmp/vector.toml
        if [ $? -ne 0 ]; then
            echo "[Vector] ERROR: Vector configuration validation failed!"
            exit 1
        fi
    fi

    # Test OTEL endpoint connectivity
    OTEL_HOST=$(echo "$OTEL_EXPORTER_OTLP_LOGS_ENDPOINT" | sed 's|http://||' | sed 's|https://||' | cut -d':' -f1)
    OTEL_PORT=$(echo "$OTEL_EXPORTER_OTLP_LOGS_ENDPOINT" | sed 's|.*:||' | cut -d'/' -f1)
    echo "[Vector] Testing connectivity to $OTEL_HOST:$OTEL_PORT"
    if command -v nc >/dev/null 2>&1; then
        timeout 5 nc -zv "$OTEL_HOST" "$OTEL_PORT" && echo "[Vector] Network connectivity: SUCCESS" || echo "[Vector] Network connectivity: FAILED"
    fi

    # Start Vector as completely independent background process
    echo "[Vector] Starting Vector as independent background process..."
    # Run Vector in background, keep stdout for transformed logs, redirect stderr
    /opt/vector/bin/vector --config /tmp/vector.toml 2>/dev/null &
    VECTOR_PID=$!
    echo "[Vector] Vector started independently with PID: $VECTOR_PID"

    # Start health monitor as completely independent background process (if enabled)
    if [ "$VECTOR_HEALTH_CHECK" = "TRUE" ]; then
        echo "[Vector] Health check is ENABLED (VECTOR_HEALTH_CHECK=TRUE)"
        if [ -x "/vector_health_monitor.sh" ]; then
            echo "[Vector] Starting health monitor as independent background process..."
            nohup /vector_health_monitor.sh 2>&1 &
            HEALTH_MONITOR_PID=$!
            echo "[Vector] Health monitor started independently with PID: $HEALTH_MONITOR_PID"
        else
            echo "[Vector] Health monitor not available at /vector_health_monitor.sh"
        fi
    else
        echo "[Vector] Health check is DISABLED (VECTOR_HEALTH_CHECK not set to TRUE)"
        echo "[Vector] Vector will run without health monitoring"
    fi

    # Give Vector a moment to start up
    sleep 2

    echo "[Vector] Starting Kit application - completely independent of Vector..."
    echo "[Vector] Kit app success/failure will not be affected by Vector issues"

    # Run Kit app completely independently - pipe logs to file for Vector to pick up

    # Kit app exit code is what matters for the container
    stdbuf -oL /entrypoint.sh 2>&1 | stdbuf -oL tee -a /tmp/kit_structured_logs.log

    # Capture Kit app exit code
    KIT_EXIT_CODE=$?

    echo "[Vector] Kit application completed with exit code: $KIT_EXIT_CODE"
    echo "[Vector] Vector and health monitor continue running independently"

    # Exit with Kit's exit code - Vector issues do not affect this
    exit $KIT_EXIT_CODE
else
    echo "[Vector] Vector OTEL processing is DISABLED (VECTOR_OTEL_ACTIVE=FALSE or not set)"
    echo "[Vector] Running Kit without log processing."
    exec /entrypoint.sh
fi

Dockerfile Updates (include health check script)#

Update your Dockerfile to copy the new health monitor and install netcat for connectivity tests. Here is an example Dockerfile snippet:

FROM kit_app_template:latest

USER root

RUN apt-get update && \
    apt-get install -y curl netcat-openbsd && \
    mkdir -p /opt/vector && \
    curl -L https://packages.timber.io/vector/0.46.1/vector-0.46.1-x86_64-unknown-linux-gnu.tar.gz -o /tmp/vector.tar.gz && \
    tar -xzf /tmp/vector.tar.gz -C /opt/vector --strip-components=2 && \
    rm -rf /tmp/vector*

RUN mkdir -p /logs

# Ensure ubuntu home directory exists for NVCF compatibility (user already exists in base image)
RUN mkdir -p /home/ubuntu && \
    chown -R ubuntu:ubuntu /home/ubuntu

# Create Vector data directory and give ubuntu user access
RUN mkdir -p /var/lib/vector && \
    chown -R ubuntu:ubuntu /var/lib/vector

COPY entrypoint_vector_dev.sh /entrypoint_vector_dev.sh
COPY vector_health_monitor.sh /vector_health_monitor.sh
COPY vector.toml /opt/vector/static_config.toml
RUN chmod +x /entrypoint.sh /entrypoint_vector_dev.sh /vector_health_monitor.sh

# Switch back to ubuntu user for runtime
USER ubuntu

ENTRYPOINT ["/entrypoint_vector_dev.sh"]

Build the Kit Vector Container#

Verify the required files exist in the working directory:

ls -la

You should see the following files (example):

total 28
drwxrwxr-x 2 horde horde 4096 Jan 15 14:32 .
drwxrwxr-x 8 horde horde 4096 Jan 15 14:15 ..
-rw-rw-r-- 1 horde horde 1284 Jan 15 14:28 Dockerfile
-rwxrwxr-x 1 horde horde 4856 Jan 15 14:22 entrypoint_vector_dev.sh
-rw-rw-r-- 1 horde horde 6042 Jan 15 14:24 vector_health_monitor.sh
-rw-rw-r-- 1 horde horde 2847 Jan 15 14:25 vector.toml

Build your enhanced Kit container with Vector integration:

docker build -t byoo_kit_vector:latest .

Push the container and create the function following the Container to Function (DRAFT) instructions.

Environment Variables#

The health check behavior is controlled by the following environment variables:

  • VECTOR_OTEL_ACTIVETRUE | FALSE / not set - When TRUE: container uses Vector for log processing and forwarding to NVCF collector. - When FALSE or unset: container bypasses Vector and runs Kit directly via /entrypoint.sh.

  • VECTOR_CONF_B64 — base64-encoded string - Provides a custom Vector configuration. If provided the entrypoint decodes and uses it; otherwise the default static vector.toml is used.

  • VECTOR_HEALTH_CHECKTRUE | FALSE (or not set) - When TRUE: enables Vector health monitoring with automatic restart capabilities. - When FALSE or unset, Vector runs without health monitoring (no automatic recovery).

Get Session UUID#

The documentation above includes information for retrieving a session identifier to tie a user to their application invocation logs. Use the Portal Sample or application-provided session identifiers (for example, session.id) and include them as attributes for correlation.

Sample KQL query for troubleshooting:

AppTraces
| where Properties.function_id == "xxxxxxxxxxxx"