App Shutdown Overview#

Kit-kernel has two ways it can shutdown - ‘normal’ and ‘fast’. These modes are chosen using the /app/fastShutdown boolean setting. These shutdown methods have the following behavior:

The ‘fast’ shutdown method is chosen by setting /app/fastShutdown to true. This is the default value for the setting. This shutdown method will intentionally avoid shutting down any extensions or plugins before explicitly terminating the process. This will call each Carbonite C++ plugin’s carbOnPluginQuickShutdown() callback function (if exported), or the callback registered with OMNI_MODULE_ON_MODULE_LAST_CHANCE_SHUTDOWN() in an ONI plugin, and allow each plugin to do only the bare minimum amount of clean up work before the process shuts down. The process will explicitly terminate itself as well so none of the loaded modules (plugins, extensions, or third party modules) will have their terminator or atexit() functions run during process shutdown.
The ‘normal’ shutdown method is chosen by setting /app/fastShutdown to false. This shutdown method can take considerably longer but will attempt to fully shutdown and unload all plugins and extensions, reclaim as many system resources as possible for the process, and completely shutdown the Carbonite framework for the process. Using this method does not explicitly exit or terminate the process but rather returns to the bootstrap loader that started the process (ie: either the main() function in kit/kit.exe or a Python script that bootstrapped Kit). In the case of returning to main(), the process will be exited immediately after unloading all the plugins and shutting down the Carbonite framework.

The default ‘fast’ shutdown method is by far the fastest, but only is so because it avoids doing most of shutdown work. This could result in data loss if a plugin or extension does not get a notification of the shutdown event. This shutdown method can also lead to broken shutdown code if the normal shutdown path is not also regularly exercised with it. This ‘fast’ shutdown method will not return from any call to shutdown the app.

In the ‘normal’ shutdown method, each plugin and extension will do a lot more work during app shutdown. Aside from other plugin or extension specific cleanup, each plugin and extension that needs to do some kind of clean up is expected to implement its carbOnPluginShutdown() (or register an unload handler for an ONI plugin) and must clean up at least the following:

Any and all running threads owned by the plugin or extension that is shutting down. If a thread is left running it will prevent the owning module from unloading from memory.
Unregister any callbacks, listeners, notifiers, etc that could still reference code in the module being shutdown. Some of these notifiers and callbacks can be run as an expected part of process shutdown.
Cancel any pending carb.tasking or omni.job tasks that have been scheduled but not yet completed. Any other similar tasking system (ie: TBB) must also have any pending tasks cancelled.
Close or release any open files or devices. The process itself may not be shutting down so these cannot be held onto by the process simply by leaking handles or file descriptors during shutdown.
Deallocate all memory previously allocated and owned by the plugin or extension that is shutting down.
Ensure that any ONI plugin is actually being unloaded. It is possible to register a ‘can unload’ callback using OMNI_MODULE_ON_MODULE_CAN_UNLOAD() that will be called by the framework to check whether an ONI plugin believes it can be fully unloaded. If this call fails it may be retried later as part of the unload of a dependency of another plugin. If one or more ONI plugins fails to report that they can be unloaded and all other plugins have been unloaded, the unload callback registered with OMNI_MODULE_ON_MODULE_UNLOAD() will never be called but the ‘last chance unload’ callback (registered with OMNI_MODULE_ON_MODULE_LAST_CHANCE_SHUTDOWN()) will be called instead.

Note that during a ‘normal’ app shutdown, some systems will not be unloaded or shutdown intentionally:

Python bindings modules will not be unloaded. Python does not offer any mechanism to do this so bindings modules will never be notified of shutdown events.
USD and USD related libraries and plugins will not be unloaded. This is because USD is not intended to ever be shutdown or reinitialized within a process.
Python does not unload or shutdown even if it was loaded by a carb.scripting-python plugin. Python will however hold on to any objects that have not been destroyed or garbage collected. These objects will be destroyed during process shutdown when Python itself cleans up.
Some of the OmniClient libraries do not unload or shutdown. This is due to its close integration with the USD libraries and plugins.

Because these components do not shutdown, it is highly possible that some system resources such as allocated memory, loaded modules space, and running worker threads will not be released on app shutdown. This means that if these components are loaded into the Kit process, shutting down Kit and the Carbonite framework will never get the process back to its original memory use, thread count, or module count that was seen when the Carbonite framework was first started up.

Tips For Debugging ‘Normal’ Shutdown Issues#

During ‘normal’ shutdown there are a lot of things that could go wrong that could lead to problems. A plugin that doesn’t clean up all of its resources on shutdown for example could cause a crash or hang in another system when it tries to shutdown.

Some of the most common issues seen during ‘normal’ shutdown are:

A crash or hang while shutting down a plugin. This is often related to a dependency order between plugins or extensions causing a resource to not be released in the expected order or when it should have been. Dangling pointers (ie: a registered callback/listener, a running thread, etc) to the plugin that is shutting down could also be left and cause problems when shutting down another plugin or extension. This could also result in the plugin module not being unloaded from memory.
A hang on process shutdown. This can especially occur on Windows during shutdown if a thread is terminated by NTDLL while it still holds a lock, then that lock is waited on during shutdown. Such a wait will never succeed and the process will appear to be hung.
A crash trying to shutdown Python during process shutdown. This triggers a garbage collection (GC) pass which can in turn cause a bindings module to try to access an interface or function in a plugin that is no longer present in the process.

Diagnosing each of these shutdown failures can be challenging. Regardless, it often helps to first ensure all plugins and extensions perform a full and complete shutdown and deallocation of resources. Gathering a list of potential suspects is a good starting point:

This involves putting a breakpoint in one of the latest possible spots in our code to check on existing system resources that may be owned by the plugin or extension that is shutting down. Depending on the app launch method (ie: direct through kit/kit.exe versus launched through Python), the best spot for a breakpoint is either at the end of Kit-kernel’s main() function (for a direct launch), or at the end of carb::releaseFrameworkAndShutdown() in the omni.kit.app plugin.
Once the breakpoint is hit, take a look at the debugger’s threads and modules lists. Identify any threads that should not still be running and ensure they are shutdown properly by the extension or plugin that owns them. Similarly identify any modules that should have been unloaded but weren’t.
Address any extra threads first since cleaning them up can also cause some module unload issues to also be cleared up as a result.

Some preventative measures to check for at shutdown time to help avoid some of these shutdown situations could be:

Name all threads owned by your plugin or extension. This can be done with carb::this_thread::setName() or carb::thread::setName(). This will allow the threads of your plugin or extension to be easily identified in a debugger.
Ensure each plugin’s carbOnPluginShutdown() (or any registered ONI plugin’s shutdown function) is shutting down all known threads, closing connections, saving files, cleaning up cached assets, deallocating memory, etc. It is each plugin and extension’s responsibility to be a good neighbor in the process and clean up everything that it had allcoated. If an asset or resource allocated directly by a plugin or extension does not have a way to on-demand clean up that asset or resource, consider a redesign of the system that that is allowed.
any Python scripts included with an extension should ensure that all of its objects are cleaned up when any given Python object is destroyed. Leaving objects to be garbage collected later can easily result in shutdown issues. Basically, if a Python script or Python object allocates or creates something, especially through a bindings module, that returned object must be destroyed with the object or script that allocated it.
If a bindings module is loaded as part of a plugin or extension’s functionality, it is that bindings module’s responsibility to know if and when it is safe to destroy a object it had previously returned. Since bindings modules are never unloaded, Python’s shutdown at process exit can cause any pending Python objects to be destroyed after the Carbonite framework has shutdown which in turn can lead to calls being made into an unloaded module.

Some things to check for in a debugger late during the shutdown process (potential breakpoint locations are noted above) that can help to narrow down a shutdown problem:

Are any threads belonging to the plugin being shutdown still running? Check the MSVC ‘Threads’ window or GDB’s info threads listing for this information. On Windows, the NTDLL library will explicitly terminate all threads of a process during shutdown. The state of all running threads will be undefined at this time. This means that if a thread holds a lock when NTDLL kills it, anything else that attempts to wait on that lock will hang or fail later. The only exception to that is OS level locking primitives such as critical sections, mutexes, semaphores, etc. In those cases, any wait on those objects will trivially succeed during process shutdown. In the case of either std::mutex or any of our custom synchronization primitives however, these locks will hang indefinitely during shutdown. Ensuring that all threads are in a known good state or exited during process shutdown should help clear up these types of issues.
Are any plugins still unexpectedly loaded? Most of the time these modules remain pinned in the process because they still have one or more threads running in the process. A warning message should be printed to the log if a plugin was expected to unload but it did not for some reason. Ensuring that any running threads owned by a plugin have exited is a good way of avoiding this type of issue.
Is a Python object being cleaned up at the hang or crash point? This is probably the most common and problematic cause of shutdown issues. See below for more information.
On Linux, LD_DEBUG=libs will show when module finalizer functions are being called. This can help determine which modules are actually unloading and running their finalizers and which could not be unloaded for some reason. Other LD_DEBUG options can also help reveal more information about which modules are loaded, initialized, and being searched for symbols.

Dangling Python Objects#

Whenever a Python object gets created that represents an object (ie: handle, resource, wrapped pointer, etc) that is exposed through a bindings module, and that object is leaked from a Python script or class, it will be cleaned up later by a garbage collection pass. These passes can happen at any time during normal execution. However, the C++ object that the Python object wraps through the bindings typically needs to use functionality from the related C++ plugin in order to destroy the object properly later. Unfortunately during a ‘normal’ shutdown, these related plugins will be unloaded and the destruction code the object referenced in the bindings module will disappear. This will lead to a crash when Python tries to clean up all outstanding objects during its own shutdown at process exit time. Running a GC pass just before shutdown can sometimes fix these types of issues. However it does not fix all cases since the Python object could still be referenced somewhere.

When these types of issues are found, there are a few steps that could be taken to track down the object and ensure it is either cleaned up or the object destruction is short circuited. Some steps that could help narrow these down are:

Attempt to track down which Python bindings object is being leaked. Often times the crash stack for where the object destruction fails can be revealing to the object type. This will often hit a destruction function in the bindings module that could give a hint which object type is in use.
After identifying the object type, add some logging on the Python side to figure out where the object(s) is (are) being created and should be getting destroyed. If at all possible, these leaked objects should be cleaned up properly on the Python side.
If proper destruction of the Python object(s) is (are) not possible, the wrapped C++ object destruction should be short circuited on process shutdown. Basically, if the related plugin has been unloaded, simply intentionally leak the wrapped C++ object. The plugin unload can be detected in one of a few ways:
- Have an interface unload callback for the dependent interface(s). These can be registered with the framework through the carb::Framework::addReleaseHook(). If the required interface is no longer present, the object destruction should be skipped.
- Try to acquire the required interface before object destruction. This can be done using carb::Framework::tryAcquireExistingInterface(). If the required interface cannot be acquired, simply skip the object destruction and leak it.
Unfortunately these checks need to be at the bindings module level since the prescence of the related plugin cannot be guaranteed during shutdown.