Logging Guidelines for Carbonite and Kit Applications#

Overview#

This document provides practical logging guidelines for developers working on Carbonite and Kit-based applications. Good logging is essential for troubleshooting production issues, especially in environments where reproduction and live debugging are not feasible. The goal is to enable the potential for root-causing issues from logs alone, while still keeping the log volume manageable and any performance impact minimal.

Logging functionality will always be enabled by default in all app builds, including production builds. Due to this, it is important to consider how much logging will be emitted from a given system or app, whether it is part of a desktop app, streaming app, headless app, or cloud service. Excessive logging can also affect app or service performance if not used appropriately. Effective use of log levels, collecting information in memory to log later, and ensuring log messages include enough information to be a useful postmortem debugging tool are among some of the best practices that will be outlined here.

Log messages are intended to be targeted at developers as the audience. It can be assumed that end users will neither see log messages in most cases, nor will they have a need to see them. At least in the case of services, an end user will never see log messages from the service. Service log output must be written with this assumption in mind - end users will never be able to access log content for the service instance they interacted with and will therefore never be able to provide them to developers when a problem occurs. At best the end user may get a simple failure message or error code related to their failure. It would be the developer’s responsibility to be able to correlate that failed instance to a set of log messages that have been collected internally. Note that this does not only apply to NVIDIA hosted services, but also BYOC (Bring Your Own Cluster) or custom deployment services.

Table of Contents#

  1. Log Levels and When to Use Them

  2. Using Logging Macros and Functions

  3. What to Include in Log Messages

  4. Managing Log Volume

  5. Performance Considerations

  6. Common Patterns and Best Practices

  7. Examples: Bad vs Good Logging

  8. Configuration Guidelines

  9. Debugging with Logs

Log Levels and When to Use Them#

FATAL#

  • Use for: Conditions that require an immediate halt to the entire app or service. This level should only be used for conditions where all forward progress in the process must be stopped immediately or it is not safe to continue. This condition must be truly fatal and unrecoverable to the entire app, not just a single system or component of it.

  • Examples: crashes, system call failures that cannot be recovered from in any way, out of memory situations that cannot be avoided and prevent further processing from occurring, etc.

  • Rate: Should be used as sparingly as possible. All error conditions in the app should try their best to recover from failures, including system call failures, as gracefully as possible. In a worst case scenario if an error situation cannot be recovered from and the failure makes it either impossible or unsafe for any other portion of the app or service from continuing, a fatal log message detailing the problem should be output followed by aborting the process.

  • Notes:

    • Emitting a fatal log message will not automatically terminate or abort the process. This must be done manually in the process since that is typically the only remaining course of action after a fatal failure oddurs and the log message is emitted.

    • If the crash reporter plugin is enabled and configured for the app, aborting or terminating the process will result in a crash report being generated and potentially uploaded. This would often be a desirable action to take when a fatal condition is encountered.

ERROR#

  • Use for: Conditions that impact functionality, crashes in helper child processes, non-continuable assertion failures. In most cases, error log messages should only be used when a truly unexpected failure is encountered that cannot be recovered from, but is not seen as fatal to the entire process. This will often be failures that occur that are in no way due to user, client, or caller input errors, but rather something that should have otherwise worked as expected failed. It may be used in cases where the failure only stops progress in a handful of components of the app, but other components can still continue functioning. In other words, the process or service is still in a healthy state but had experienced an unexpected problem that caused an operation to fail. In this case, continuing the process should be left as a decision for the user(s).

  • Examples: Unrecoverable file corruption, network timeouts that break functionality, assertion failures, unexpected system call failures, exhausted system resources (ie: run out of file descriptors, file watch handles, synchronization primitives, etc).

  • Rate: Should be low enough to always be available in production. At default log level configurations, error log messages will always be output.

  • Notes:

    • Any condition that could result in an unexpected failure must log an error message explaining the failure reason and any known conditions that may have lead to it. This includes system call failures, unexpected process exits, allocation failures, assertion failures, etc.

    • When available, any associated system level error codes (ie: errno, GetLastError(), returned error/status codes) must be included in the error log message.

    • For conditions such as allocation failures, file corruption, etc, any associated information such as sizes, file paths, bytes written, etc must be included with the error log message.

    • In general, if the process or service has the possibility of continuing with the same inputs just at a later time with no additional changes in state, consider using a warning message instead of an error.

WARNING#

  • Use for: Conditions that might become errors but are recoverable. These log messages should typically only be about problems that have a local effect and do not necessarily mean that some functionality may not be available. These warnings are often recoverable by either retrying an operation or just to make a note that the full functionality of a user request might not be presently available.

  • Examples: Transient network errors, resource thresholds exceeded, deprecated API usage, file/resource access issues, delayed resource access, etc.

  • Rate: Low enough for production, but higher than ERROR.

  • Notes:

    • Warning messages should contain enough information to be actionable as a postmortem debugging tool. This should include any applicaable error or status codes, context of how and where the warning occurred, and any local values that could have lead to the failure confition being hit.

    • By default, warning messages will be the least serious log messages that are emitted from a typical app or service. The developer should emit enough information with the warning message that a good idea of what went wrong could be determined without needing to repeat the issue or rely on additional info or verbose log messages to figure out most of a problem. This will not always be possible, but should be seen as a best practice when emitting log messages.

INFO#

  • Use for: Important application flow and decision points. This allows the log to have a record of some key code paths that were taken and why, and should contain enough information to follow through the code and determine what happened to a reasonable degree.

  • Examples: Service startup/shutdown, user session creation, major state transitions, incoming service function requests, service function results, major system entry/exit events, input validation failures, etc.

  • Rate: Must be sustainable in production (less than 10 lines/second per process).

  • Notes:

    • Info log messages are emitted to a local log file by default in Kit based apps and services. These will be filtered to the level specified in /observability/logs/emitLevel which will default to warn.

    • Info messages are intended to be used to provide more context to how an app or service is behaving, but should not be necessary in order to triage an issue.

VERBOSE#

  • Use for: Detailed diagnostic information. These log messages may contain significant details of how the app is behaving and be emitted frequently. By default, verbose log messages will not be transmitted over OTLP.

  • Examples: Function entry/exit (including internal-only functions), parameter values, internal state changes, resource loading and unload behavior, etc.

  • Rate: May not be available in production, can be more verbose.

  • Notes:

    • Developers may freely use verbose log messages to provide detailed context and state information about the app or service. These will not be emitted by default.

    • Developers should never assume verbose log messages will be collected or transmitted. In general verbose log messages will only be used for local, non-production debugging.

Assertions#

Assertions such as CARB_ASSERT() and assert() are a helpful debugging tool but can be very easily misused. An assertion must never be used as a replacement for robust error checking and handling, especially since that check would often disappear in non-debug builds. In general, assertions should be used sparingly and ideally only in sitations where the rules of the universe would need to be violated first in order for it to fail. Assertions can be used to verify invariant conditions, call parameter validity, etc if and only if they are followed by robust handling of those error conditions.

Assertion failures should always be considered to be continuable. If an assertion fails repeatedly in a process, it should be considered a bug and fixed. Since assertions are often only functional in debug builds, they should also never be used in a way that behavior could change between a debug and a non-debug build with the same inputs.

Best Practices When Printing Values#

When printing values in a log message, it is highly suggested that the log message content be as easy to read, concise, and informational as possible. Some best practices to help with this would be:

  • Always include any local or object level values in the log messsage that would help in analyzing the state of the code at the point of the log message. This aids in debugging efforts by providing hints about where control flow would have gone in the function.

  • Provide any data values in the log message in a structured format. This should include the variable names or some obvious high level names for the value to make it obvious to connect where a given value came from in the log message and how it would have affected behaviour. This structured output can take the form of a series of comma separated key/value pairs, where each key and its corresponding value is separated by an equal sign (‘=’).

  • Always quote or delimit printed string values in log messages. Since strings can contain blank space, it is not always clear in a log message where the string starts or ends if it is not delimited in some way. Using a format string such as '%s' or [%s] for string values can make the output far more readable and easier to spot non-printable characters in a string.

  • Always prefix numerical values appropriately. If something is printed as hexidecimal or octal, always prefix the value with ‘0x’ or ‘0’ respectively. This prefixing is not always done automatically with all format strings (especially on Windows), so be wary of this potential difference.

  • Always print numerical values in a format that is easiest to compare to other values. For example, memory address values are almost always seen in hexidecimal format. Printing an address in decimal makes it very difficult to compare values between log messages or within a debugger since the printed value would need to be converted. If a value is often expected to be printed in a certain way (ie: base 16 vs 10, three decimal places versus scientific notation, etc) or is often seen in that format, print those types of values in a consistent format across the board.

  • Lists of values should be separated by a common delimiting character such as a comma (‘,’).

  • In general, consistency in log output formatting is important and makes it easier for developers to identify and compare values at a glance.

  • Whenever possible, use the provided helper functions and macros for logging. These provide a lot of additional functionality and boilerplate code that can make adding extra information to log messages in code much easier.

  • When including a timestamp in a log message, always format it using a consistent epoch, resolution, and time zone. Ideally all timestamps should use UTC time and be clearly suffixed to indicate the time zone that is being used (ie: ‘Z’ for UTC or an hours offset for non-UTC times). When possible, the ISO8601 or RFC3339 timestamp format should be used.

  • When adding a log message to code, take into account the rate at which it might be emitted when trying to determine which log level to use or how frequently to emit it. For example:

    • Logging an entry event for an incoming work request could be at either info or verbose level depending on the call rate.

    • Logging periodically in tight loops should prefer logging periodically in time rather than every so many iterations. The logging rate from the former will stay steady whereas the latter could scale with loop performance.

  • When adding a failure log message to code, take into account the overall severity of the failure for the entire app or service, not just the component that encountered the failure. For example:

    • If the failure can be recovered from immediately within the same component and the process is still healthy, consider using an info level or lower log message.

    • If the failure affects the functionality of other components and the process is otherwise healthy, consider using a warning level log message.

    • If the failure affects the app or service as a whole and the process is otherwise healthy, consider using an error level log message.

    • If the failure affects the app or service as a whole but the process is no longer healthy, a fatal level log message shoudl be used. Fatal log messages in general should only be used when the only remaining possibility is to terminate the process.

Using Logging Macros and Functions#

Basic Usage#

// Use appropriate log level
CARB_LOG_FATAL("No GPU devices available: 0x%08x {driverVersion = '%s'}", hresult, driverVersion);
CARB_LOG_ERROR("Failed to load configuration file: '%s'", filename);
CARB_LOG_WARN("Connection retry attempt %d/%d", attempt, maxAttempts);
CARB_LOG_INFO("User session started: sessionId='%s'", sessionId);
CARB_LOG_VERBOSE("Processing item %d of %d", current, total);

Include Context and Identifiers#

// Good: Include relevant identifiers and context
CARB_LOG_INFO("File operation completed: operation=save, file='%s', size=%zu bytes, "
              "duration=%dms", filename, fileSize, durationMs);

// Bad: Vague message without context
CARB_LOG_INFO("Operation completed");

Use Structured Information#

// Good: Structured, searchable information
CARB_LOG_ERROR("Database connection failed: host='%s', port=%d, error='%s', retryCount=%d",
               host, port, errorMsg, retryCount);

// Bad: Unstructured prose
CARB_LOG_ERROR("Could not connect to the database on %s because %s", host, errorMsg);

What to Include in Log Messages#

Essential Elements#

  1. Identifiers: IDs, names, keys, handles that help correlation.

  2. Parameters: Input values, configuration settings, state information.

  3. Context: Thread ID, session ID, request ID when available.

  4. Timing: Duration for operations, timestamps for state changes.

  5. Results: Return codes, status, outcomes.

Security Considerations#

  • Never log: Passwords, API keys, authentication tokens, encryption keys.

  • Be careful with: User data that might contain personally identifiable information (PII), intellectual property, file contents, network payloads. Note that full filenames and their paths can include PII such as usernames, and intellectual property such as internal project or product names.

  • Safe to log: Resource IDs, non-sensitive configuration, error codes, timing information.

Example: Comprehensive Log Message#

CARB_LOG_INFO("Asset load completed: assetId='%s', type='%s', size=%zu, "
              "cacheHit='%s', loadTime=%dms, memoryUsed=%zuMB",
              assetId, assetType, assetSize,
              cacheHit ? "true" : "false", loadTimeMs, memoryMB);

Managing Log Volume#

Rate Limiting Strategies#

1. Sampling for High-Frequency Operations#

// Log every 60 seconds instead of every call
static auto lastLogTime = std::chrono::steady_clock::now();
auto now = std::chrono::steady_clock::now();
if (now - lastLogTime > std::chrono::seconds(60))
{
    CARB_LOG_INFO("High frequency operation status: success=%d, failed=%d, avgTime=%dms",
                  successCount, failureCount, avgTimeMs);
    lastLogTime = now;
}

2. Conditional Logging Based on Outliers#

// Only log slow operations
if (operationTimeMs > SLOW_OPERATION_THRESHOLD)
{
    CARB_LOG_WARN("Slow operation detected: operation='%s', time=%dms, threshold=%dms",
                  operationName, operationTimeMs, SLOW_OPERATION_THRESHOLD);
}

3. Bucketing Similar Operations#

// Use different log levels based on frequency
if (isRareError)
{
    CARB_LOG_WARN("Rare validation error: %s", errorDetails);
}
else
{
    CARB_LOG_INFO("Common validation error: %s", errorDetails);
}

4. Emitting Frequent Messages Once#

// only emit this common error once for the process.
CARB_LOG_WARN_ONCE("failed to open the process status file '%s' {errno = %d}", filename, errno);

// only emit this message once per service call.
static bool fileWarningShown = false;
if (!fileWarningShown)
{
    CARB_LOG_WARN("failed to open the process status file '%s' {errno = %d}", filename, errno);
    fileWarningShown = true;
}

if (newServiceCall)
{
    fileWarningShown = false;
}

State Change Logging#

// Log state transitions with context
CARB_LOG_INFO("State transition: entity='%s', from='%s', to='%s', reason='%s', duration=%dms",
              entityId, oldState, newState, reason, transitionTime);

Performance Considerations#

Where to Avoid Logging#

High-Frequency Loops#

// BAD: Logging in tight loops.  This produces a lot of unnecessary log output.  In most
//      tight loops, a summary at the end of the loop will often suffice.  In cases where
//      a loop is long running, reporting a collected status summary of the loop every <n>
//      seconds or <m> iterations should be used.
for (int i = 0; i < 1000000; ++i)
{
    processItem(items[i]);
    CARB_LOG_VERBOSE("Processed item %d", i); // Don't do this!
}

// GOOD: Log summary or use sampling.  Add a summary report after the loop to reduce overall
//       log output.
auto startTime = getCurrentTime();
for (int i = 0; i < itemCount; ++i)
{
    processItem(items[i]);
}
auto duration = getCurrentTime() - startTime;
CARB_LOG_INFO("Batch processing completed: items=%d, duration=%dms, avgPerItem=%.2fms",
              itemCount, duration, (float)duration / itemCount);

Performance-Critical Paths#

// BAD: Logging in render loop.  Doing this will negatively impact rendering performance and
//      should be avoided at all costs.
void renderFrame()
{
    CARB_LOG_VERBOSE("Starting frame render"); // Too frequent!
    // render logic...
    CARB_LOG_VERBOSE("Frame render complete"); // Too frequent!
}

// GOOD: Log frame statistics periodically.  Logging the frame rate information every <n>
//       frames or <m> seconds is far more efficient and still provides useful information.
//       Note however that emitting a log message every <n> frames would also scale with
//       frame rate or loop iteration speed, and would still lead to a higher volume of log
//       messages.  In timing or performance related loops such as this, emitting a log
//       message with a summary every <m> seconds is much preferred since that won't scale
//       with performance.
void renderFrame()
{
    // render logic...
    frameCount++;

    static auto lastStatsTime = getCurrentTime();
    auto now = getCurrentTime();
    if (now - lastStatsTime > 5000) // Every 5 seconds
    {
        CARB_LOG_INFO("Render stats: fps=%.1f, avgFrameTime=%.2fms, frameCount=%d",
                      frameCount / 5.0f, 5000.0f / frameCount, frameCount);
        frameCount = 0;
        lastStatsTime = now;
    }
}

Common Patterns and Best Practices#

Error Logging#

// Log errors once at the point they occur.  Deferring logging an error until a call further up
// the call stack makes it harder to track down the actual failure point and reason from the log,
// and also makes the log output unnecessarily noisy in most cases.
Result openProcessStatusFile()
{
    std::string filename = "/proc/" + std::to_string(getpid()) + /status";
    FILE* file = fopen(filename.c_str(), "r");
    if (!file)
    {
        // in theory this open attempt should never fail.  If it ever does, something serious
        // has already gone wrong such as we've run out of file descriptors for the process.
        // In this case we'll want to log an error when failing to open the file.
        CARB_LOG_ERROR("Configuration load failed: file='%s', error=%s (%d)",
                       filename, strerror(errno), errno);
        return Result::FileNotFound;
    }
    // Don't log the same error again when propagating up the call stack
    return Result::Success;
}

Request/Response Logging#

// Log request attempts and results.  This helps to track the high level behaviour of the app
// or service.
void handleApiRequest(const Request& request)
{
    // Note: that this could also be a verbose level log message if this type of request is
    //       called at a high rate.
    CARB_LOG_INFO("API request started: endpoint='%s', method='%s', clientId='%s', requestId='%s'",
                  request.endpoint, request.method, request.clientId, request.requestId);

    auto startTime = getCurrentTime();
    auto result = processRequest(request);
    auto duration = getCurrentTime() - startTime;

    if (result.isSuccess())
    {
        CARB_LOG_INFO("API request completed: requestId='%s', status=success, duration=%dms, "
                      "responseSize=%zu", request.requestId, duration, result.responseSize);
    }
    else
    {
        CARB_LOG_ERROR("API request failed: requestId='%s', status=error, error='%s', "
                       "duration=%dms", request.requestId, result.errorMessage, duration);
    }
}

Background Task Logging#

// Generate unique IDs for background tasks.  Since a task _could_ execute on multiple threads
// during its lifetime, it is a good practice to log a unique task ID to track its behaviour
// in the logs instead of depending on something like a thread ID that may change during its
// execution.
void startBackgroundTask(const TaskParams& params)
{
    std::string taskId = generateUniqueId();
    CARB_LOG_INFO("Background task started: taskId='%s', type='%s', priority=%d",
                  taskId.c_str(), params.type, params.priority);

    // Store taskId in task context for later logging
    executeAsyncTask([taskId, params]()
    {
        auto result = performTask(params);
        if (result.isSuccess())
        {
            CARB_LOG_INFO("Background task completed: taskId='%s', duration=%dms, "
                          "itemsProcessed=%d", taskId.c_str(), result.duration,
                          result.itemsProcessed);
        }
        else
        {
            CARB_LOG_ERROR("Background task failed: taskId='%s', error='%s', itemsProcessed=%d",
                           taskId.c_str(), result.error, result.itemsProcessed);
        }
    });
}

Examples: Bad vs Good Logging#

Example 1: Function Entry/Exit Logging#

Bad Approach#

This logs several messages during the execution of a function. This leads to an unnecessary level of detail in the logged messages but doesn’t provide any useful state context for the call.

void processUserData(const UserData& userData)
{
    CARB_LOG_VERBOSE("Entering processUserData");

    CARB_LOG_VERBOSE("Validating user data");
    if (!validateUserData(userData))
    {
        CARB_LOG_VERBOSE("User data validation failed");
        CARB_LOG_VERBOSE("Exiting processUserData");
        return;
    }
    CARB_LOG_VERBOSE("User data validation passed");

    CARB_LOG_VERBOSE("Processing user data");
    auto result = performProcessing(userData);
    CARB_LOG_VERBOSE("User data processing complete");

    CARB_LOG_VERBOSE("Exiting processUserData");
}

Problems: Too verbose, no useful information, high log volume.

Good Approach#

Log a single summary at the end of the operation and ensure all early return points have a log message that can indicate that at least part of the operation was skipped. Also make sure to provide enough state context in the log message to be able to track back to what may have happened in the code during execution.

void processUserData(const UserData& userData)
{
    auto startTime = getCurrentTime();

    if (!validateUserData(userData))
    {
        CARB_LOG_WARN("User data validation failed: userId='%s', reason='%s'",
                      userData.userId, getValidationError());
        return;
    }

    auto result = performProcessing(userData);
    auto duration = getCurrentTime() - startTime;

    CARB_LOG_INFO("User data processed: userId='%s', itemCount=%d, duration=%dms, status='%s'",
                  userData.userId, userData.itemCount, duration,
                  result.isSuccess() ? "success" : "failed");
}

Benefits: Includes identifiers, timing, outcome; much lower volume; more useful for debugging.

Example 2: Error Handling#

Bad Approach#

This example is showing an error message that is reported multiple times each the call chain of a function as the error is handled. This is undesirable since it outputs what is essentially the same message multiple times in multiple stack frames.

Result connectToDatabase()
{
    CARB_LOG_INFO("Trying to connect to database");
    auto connection = database::connect(connectionString);
    if (!connection)
    {
        CARB_LOG_ERROR("Failed to connect to database");
        return Result::Failed;
    }
    CARB_LOG_INFO("Connected to database successfully");
    return Result::Success;
}

void initializeApp()
{
    CARB_LOG_INFO("Initializing application");
    auto result = connectToDatabase();
    if (result != Result::Success)
    {
        CARB_LOG_ERROR("Database connection failed during app initialization");
        // Log the same error again!
    }
}

Problems: Same error logged multiple times, no useful details, redundant messages.

Good Approach#

An error message should only be logged at the point where it occurs, not in each stack frame of its call chain. It is however acceptable to log the error a second time if the call chain propogates through multiple subsystems or even past a process boundary. In those cases, the error could be logged again at each subsystem or process boundary to make it more clear which systems the call and error propogated through.

Result connectToDatabase()
{
    auto startTime = getCurrentTime();
    auto connection = database::connect(connectionString);
    auto duration = getCurrentTime() - startTime;

    if (!connection)
    {
        CARB_LOG_ERROR("Database connection failed: host='%s', port=%d, timeout=%dms, error='%s'",
                       dbConfig.host, dbConfig.port, duration, database::getLastError());
        return Result::Failed;
    }

    CARB_LOG_INFO("Database connected: host='%s', port=%d, connectTime=%dms, version='%s'",
                  dbConfig.host, dbConfig.port, duration, connection->getVersion());
    return Result::Success;
}

void initializeApp()
{
    auto result = connectToDatabase();
    if (result != Result::Success)
    {
        // Don't log the error again, just handle it
        showUserErrorMessage("Database unavailable.  Please check configuration.");
        return;
    }

    CARB_LOG_INFO("Application initialization completed successfully");
}

Benefits: Error logged once with details, no duplication, useful diagnostic information.

Example 3: Loop Processing#

Bad Approach#

This loop has a variable length that could also depend on user input. Logging at this granularity would result in a large number of log messages that will scale as the input gets larger. This can overwhelm the logging data backend trying to keep up with the messages.

void processFileList(const std::vector<std::string>& files)
{
    CARB_LOG_INFO("Starting to process file list");

    for (const auto& filename : files)
    {
        CARB_LOG_INFO("Processing file: %s", filename.c_str());

        auto result = processFile(filename);
        if (result.isSuccess())
        {
            CARB_LOG_INFO("Successfully processed file: %s", filename.c_str());
        }
        else
        {
            CARB_LOG_ERROR("Failed to process file: %s", filename.c_str());
        }
    }

    CARB_LOG_INFO("Finished processing file list");
}

Problems: Log volume scales with input size, overwhelming in production.

Good Approach#

Reporting a summary of what happened in the loading loop after it has completed can help reduce the overall logging footprint of the function and also simplify the output.

void processFileList(const std::vector<std::string>& files)
{
    auto startTime = getCurrentTime();
    int successCount = 0;
    int failureCount = 0;
    std::vector<std::string> failedFiles;

    for (const auto& filename : files)
    {
        auto result = processFile(filename);
        if (result.isSuccess())
        {
            successCount++;
        }
        else
        {
            failureCount++;
            failedFiles.push_back(filename);

            // Only log individual failures for important errors
            if (result.errorCode == ErrorCode::Corruption)
            {
                CARB_LOG_ERROR("File corruption detected: file='%s', error='%s'",
                               filename.c_str(), result.errorMessage);
            }
        }
    }

    auto duration = getCurrentTime() - startTime;

    CARB_LOG_INFO("File processing completed: total=%zu, success=%d, failed=%d, "
                  "duration=%dms", files.size(), successCount, failureCount, duration);

    if (!failedFiles.empty())
    {
        // Log summary of failures
        CARB_LOG_WARN("Failed files: count=%d, examples='%s%s'",
                      failureCount,
                      failedFiles[0].c_str(),
                      failedFiles.size() > 1 ? "..." : "");
    }
}

Benefits: Fixed log volume regardless of input size, summary information, detailed logs only for critical issues.

Configuration Guidelines#

Logging settings can be changed through settings, environment variables, or programmatically at runtime. Some logging settings, such as active log channels levels, will be automatically applied when the setting changes. Most other logging settings require explicit refreshes of the logging settings using carb::logging::configureLogging() and carb::logging::configureDefaultLogger(). Programmatic logging setting changes will be applied immediately.

Note that a system or component must never change logging settings that affect logging globally in the app or service. This should be left up to the app or service initialization code or its configuration. A component may only modify its own logging settings such as the logging levels for its own log channels. A single component may never make an assumption about logging settings on behalf of the entire app or service.

Environment Variable Configuration#

# Set global log level
OMNI_KIT_log_level=INFO

Production vs Development Settings#

auto logging = carb::getCachedInterface<carb::logging::ILogging>();

if (logging != nullptr)
{
#ifdef CARB_DEBUG
    // More verbose logging in debug builds
    logging->setLevelThreshold(carb::logging::kLevelVerbose);
#else
    // Conservative logging in release builds
    logging->setLevelThreshold(carb::logging::kLevelWarn);
#endif
}

Debugging with Logs#

Correlation Techniques#

  • Use request IDs to trace operations across multiple components. Having a common ID emitted with log messages allows a way for multiple telemetry signals (ie: traces, logs, metrics) that are referring to the same operation to be connected together during post mortem data analysis.

  • Include thread IDs for multi-threaded debugging. Without thread IDs, it is nearly impossible to unravel a log in a multi-threaded process. The Carbonite logging system always includes the thread and process ID with each log message, however it is the responsibility of each log consumer to use those IDs when emitting a log message to a destination (ie: log file, console, GUI window, data collection backend, etc). Carbonite’s standard logger will only output the thread ID if the /log/includeThreadId boolean setting is enabled.

  • Log state transitions with timestamps for timing analysis. When doing a postmortem analysis of performance, it is important to have high resolution timestamps or an elapsed time value included in log messages. The Carbonite logging system will deliver two timestamps with each log message in its metadata. Both timestamps will have at least 100ns resolution. One timestamp (carb::logging::LogMessage::timestampNs) will be a monotonic elapsed time since an arbitrary epoch (ususally system boot time), and the other (carb::logging::LogMessage::timestampNsSinceUnixEpoch) will be the wall clock relative to the Unix epoch (ie: midnight January 1, 1970). Each log consumer implementation should use at least one of these timestamps to report the time the log message was emitted at. Note that these values will not necessarily have any relation to the log message ingestion timestamp that most data backends will automatically add.

  • Use structured key=value format for easy parsing and filtering. Structured log data is much easier to be parsed automatically by tooling and processed more easily. Using structured data also makes it much simpler for developers to follow values in the code versus inlining values in a log message in a friendly human readable sentence.

Log Analysis Tips#

  • Search for error messages first to identify failure points. The log messages that lead up to the error message may provide some further context information about what kind of operation was occurring. Some errors could have symptoms or related problems much earlier in the log as well.

  • Follow request IDs through the entire flow. This can give a better sense of context for what was specifically occurring around processing that request versus other requests that may have been getting handled simultaneously. This is especially a case where having thread IDs in log messages or their metadata is also important.

  • Look for timing patterns to identify performance bottlenecks. Log timestamps, especially the monotonic elapsed time timestamps can be useful for tracking durations of operations in a log. This could help identify operations that are taking an unusually or unexpectedly large amount of time.

  • Use log timestamps to correlate with external events. When trace information isn’t available for an operation, absolute timestamps can be used to correlate operations that are occurring around the same time index that could be leading to a problem. In these cases, it is important to have log messages output with an timestamp relative to a fixed epoch (ie: the Unix epoch) and in a consistent time zone.

Building Debuggable Applications#

Instrumenting code to have log messages that are helpful in debugging is almost always an iterative process. New log messages can be added or existing ones improved as new types of problems are encountered. Below are some additional guidelines that can help make log messages helpful in making log messages more debuggable in earlier debugging iterations:

  • Include enough context to understand the problem without source code. While it is possible to log each and every operation that occurs, that is often not feasible in practice. Doing so would be very likely to overwhelm a log collection backend or fill up storage quickly. Parsing such a large log would also be problematic in a lot of situations. Using higher quality but less verbose logging is often a better solution. If source code is available, a more detailed investigation of the issue could be done by following through the code and noting state changes reported by the log messages later. Being able to quickly triage an issue without needing to also look at the source code can help speed up debugging efforts.

  • Log decision points and why certain paths were taken. Log messages that are placed at strategic branching points in the code can help track down which path(s) were executed. The context information included in the log message can help figure out why an unexpected branch may have been taken in the code.

  • Provide clear and actionable error messages that explain what went wrong and why. These messages should include the values of any variables that may have contributed to code execution reaching an error condition. These would help figure out why the error may have occurred and make it easier to write a small unit test that attempts the same failing operation to invesigate further.

  • Include recovery actions taken automatically by the system. When an error or failure occurs and is logged, it is often useful for a developer to see following log messages that give an indication of whether a fallback path of some kind was taken to try to recover from the failure. This can help in triaging and prioritizing a fix for an issue.

Summary#

Good logging is about balance: providing enough information to debug issues while keeping volume manageable. Key principles:

  1. Be intentional: Every log message should serve a debugging purpose.

  2. Include context: IDs, parameters, timing, and outcomes.

  3. Manage volume: Use sampling, conditional logging, and appropriate levels.

  4. Avoid performance impact: Stay out of tight loops and critical paths.

  5. Think about correlation: Help trace problems across components.

  6. Learn from issues: Update logging based on what would have helped debug past problems.

Remember: The goal is to be able to troubleshoot production issues from logs alone, without requiring reproduction or live debugging access.