Fibers vs. Threads

Overview

The carb.tasking system is a fiber-based tasking system. A fiber is like a thread in that it has its own stack, but unlike a thread in that the operating system never schedules a fiber. Instead, the carb.tasking worker threads run fibers as they become available.

In this document, we will use the following terminology:

  • Thread - The smallest sequence of programmed instructions that can be managed independently by the Operating System scheduler. A thread represents memory for an execution stack, the set of CPU registers required to execute instructions, and any associated kernel structures to manage the thread. Threads may be independently scheduled to run by the Operating System, or preempted by the Operating System in order to run other tasks.

  • Fiber - The smallest sequence of programmed instructions that can be managed in user-mode external to the Operating System scheduler. A fiber represents memory for an execution stack and the set of CPU registers required to execute instructions. Fibers may only be executed explicitly by threads and will never be scheduled by the Operating System scheduler. Running a fiber is cooperative, that is, a thread must explicitly select a fiber to run and the fiber must communicate to the thread running it that it is finished or needs to be suspended.

  • Job - The smallest unit of execution (or work) that cannot be suspended or resumed, though it may wait. Can be compared to a function or a subroutine that is called and them completed.

  • Co-routine - The smallest unit of execution (or work) that may be suspended or resumed. Can be thought of as a function or subroutine that can be paused at points before it returns.

  • Task - The concept that encompasses both Job and Co-routine. The smallest unit of execution (or work) that may be provided to a system designed to execute work concurrently.

  • Concurrency - An interleaving of tasks such that a single thread may do several things in quick succession, perhaps without completing any one task. Also multi-tasking.

  • Parallelism - Utilizing multiple threads running on different CPUs to perform work at the same time. Also multi-processing.

The carb.tasking system is architected to achieve parallelism through co-routines. As evident in the definitions above, a co-routine is a superset of a job: a job is a co-routine, but not vice versa. In this case, a system that allows execution of co-routines can also execute any small job as well.

Job Systems

While Carbonite does have a simple thread pool (IThreadPool) for executing simple jobs, the carb.tasking system seeks to provide the ability to utilize co-routines.

Other available task systems such as Intel’s Threaded Building Blocks (TBB) and Grand Central Dispatch (GDC or libdispatch) have many concepts similar to carb.tasking but do not allow for co-routines. In some cases, solving for dependent jobs requires waiting for every dependent job to complete as there is no way to context-switch them, or requires nesting jobs within other jobs 1.

Consider the example where a worker thread is running a job Foo which has a dependency on job Bar and needs to wait. If Bar has not yet started, the thread can (recursively) start the Bar task immediately. If Bar is running on another thread, then the worker thread must wait for Bar. The worker thread blocks while waiting, then the system loses throughput as the thread waits. If instead the worker thread starts recursively executing another unrelated job Bat, the thread does not have the full stack space to give to Bat, and it must wait for Bat to finish before it can resume Bar and finally Foo. Co-routines do not have this issue as they can be paused and even handed off to another thread.

In order to achieve co-routine-like behavior from other systems, an implementation of a state machine with wait states, splitting a job into multiple smaller jobs, or transitioning to a subsequent job are required.

Co-routines

As co-routines are a superset of jobs (that is, a job is also a co-routine, but not vice versa), supporting them allows for the execution of simple jobs as well as more complex co-routines.

The use of co-routines has been well-demonstrated even within Omniverse, one of the largest examples being Python’s asyncio library. C++ is also adding support for co-routines in C++20. Many other popular languages such as Go, Rust, Lua and Javascript have support for co-routines. However, many of these co-routines exist to provide concurrency, not necessarily parallelism. carb.tasking seeks to provide both.

“Stackful” vs. “Stackless”

A discussion of co-routines would be remiss without discussion of “stackful” and “stackless” co-routines. Though this is generally an implementation concept, it can have repercussions to consumers as well.

In a C/C++ program, most programmers are familiar with the concept of a stack. The stack is a set of memory assigned to a thread and is used to store variables local to a function and return addresses. On aarch64 and x86-64 hardware, the CPU has a Stack Pointer register which points to an address within this memory block. When a function is called, the current Program Counter (execution address) is pushed on to the stack: it is written at the current Stack Pointer address and the Stack Pointer address is automatically decremented to the next slot. The Program Counter then jumps to the starting address of the function. The called function is free to further decrement the Stack Pointer to reserve space for local variables. When the function returns to the caller, the reverse happens: the Stack Pointer is incremented to free the space for the local variables, and then the return addresses is popped from the stack into the Program Counter, which then resumes at the point where the function was called.

Now imagine that the function called is akin to a Python await directive (or Lua’s yield). The current execution needs to be paused, but the stack belongs to the thread. In a “stackful” implementation, the entire stack is swapped out for a different stack. In a “stackless” implementation, the current stack contents are copied to a different location or exist in a segmented nature on the heap.

“Stackful” implementations effectively work how the machine intends to work. Each co-routine requires its own stack, and these stacks can be swapped in and out of a thread, and then “returned” from to resume or start. Incidentally this is how threads work at the Operating System level. However, this requires a stack per co-routine, which can be costly in terms of memory depending on the stack space needs of the co-routine. Fibers are a means of providing this stack per co-routine 3.

“Stackless” implementations require a bit more trickery and typically require support within the language. Such is the case with the C++20 co-routine implementation. These implementations can be more memory-efficient, but can also have drawbacks in terms of performance (such as heap usage). Since the stackless implementation either copies memory to/from the stack, or sets up guard regions around calls to ensure that enough “stack” memory is available, it can cause difficulties with interop (GoLang 1.3 abandoned its segmented stack approach due to similar reasons 2).

Both “stackful” and “stackless” methods of implementing co-routines have pros and cons (2 3 4).

Given the difficulties and need of compiler support for supporting “stackless” co-routine implementations, our C++ support focuses around a “stackful” implementation using fibers. Fibers are effectively a thread stack that is cooperatively executed in user mode.

Using Threads as Parallelized Co-Routines?

Theoretically speaking, threads could be used or thought of as co-routines. However, there are several downsides to this idea. Threads can be suspended and resumed in a manner desired by co-routines by using synchronization primitives.

Let’s assume that we want to multiplex a very large number M of ready co-routines (as threads) on a relatively small number N of CPUs. If we immediately start a thread for each count of M, there is significant startup cost for each thread, plus we overwhelm the Operating System scheduler with many more threads than it can reasonably run. Assuming that every co-routine is equal priority, the Operating System will likely context switch between the threads in order to assure that they are running somewhat equally; this context switch incurs a heavy cost as well.

Instead, let’s start up only N threads–the same number of CPUs that we have available. There is still a heavy cost to starting threads, but at least we have now not overloaded the Operating System scheduler. As one of these co-routine threads needs to suspend (wait), we will leave the CPU that was running it idle unless we can tell our system to start up another thread. Let’s assume that we can do that with a special function that tells our system that it can start another co-routine. What happens now when the new co-routine is running but the original co-routine’s wait condition resolves and it is resumed? Unless we keep it suspended somehow, we still run the risk of over-committing our CPU resources and overwhelming the scheduler.

Using Fibers as Parallelized Co-Routines

While fibers are like threads in that they possess a stack, they must be explicitly started, suspended and resumed by a user-mode scheduler. Creating a fiber is merely allocating some memory for the stack; no system calls need be involved. When a thread is created, the operating system must be told about it, which means system calls that elevate privilege levels. In other words, it is more costly in terms of time to create a thread. Fibers can also be reused immediately after a task has completed; threads cannot be reused in the same manner since they require kernel tracking.

Assume, like above, that we want to multiplex a very large number M of ready co-routines (as fibers) on a relatively small number N of CPUs. We can simply start the same number N of worker threads as we have CPUs. Our job now becomes trying to keep those worker threads as busy as possible. Each thread can look at a global list of co-routines (fibers) that are ready to be executed, take the highest priority one and run it until it either suspends or completes. We only ever have N threads running on N CPUs, so we do not risk overburdening the Operating System scheduler. Instead, our application makes progress with the most throughput possible.

This method has several advantages over strictly using threads as co-routines:

  • Thread startup and shutdown time can be elided.

  • Impossible to overwhelm the Operating System scheduler.

  • When a co-routine needs to yield, we immediately switch to other work for our application.

  • Co-routines cannot preempt each other, incurring an unnecessary context switch cost.

  • System calls and elevation only happen when a thread runs out of work and needs to wait.

There are a few things to be aware of though:

  • Thread-specific data and thread_local variables do not work as expected since a co-routine can switch threads.

  • Synchronization primitives that are tied to a thread (such as a mutex or shared_mutex) may result in undefined behavior since a co-routine can switch threads.

Any parallelized co-routine system requires some means of telling the scheduler to run something else. In the thread case, this means telling the system to start an additional thread (though this can overwhelm the Operating System); in the fiber case, this means telling the current thread to switch to a different task.

History

Game development has often pushed the envelope in terms of performance. NVIDIA projects (including Omniverse) tend to be game-industry-adjacent and often hold to the same tenets in terms of performance. The original architects of carb.tasking were from within the game industry and had successful use of parallelized and concurrent co-routine-based systems.

Other architects have used non-co-routine task systems and experienced difficulty promoting co-routine-like behavior with them.

In March 2015, a lead programmer from Naughty Dog gave a talk on using fibers to port their PS3 engine to the PS4 for use with The Last of Us: Remastered 1. In this talk the engineer describes their PS3 engine as being largely single- threaded and using the PS3’s SPUs for specific tasks. The PS4 has 8 CPU cores and more memory, and their target was 60 FPS. It was very quick for them to create the framework of the fiber-based task system because tasks could be moved to it over time. Within three months they were able to fully convert the game and render processes to the fiber-based task system and achieved gains from sub-30 FPS to 60 FPS.

This talk greatly encouraged the carb.tasking architects to develop a task system utilizing fibers with a simple sole Counter struct for synchronization, similar to the Naughty Dog engine which also achieved almost all synchronization with a similar Counter.

The Omniverse project has increasingly started using co-routines in Python, and their benefits are well known. Ideally, these same benefits could be realized in C++ as well. Throughout the following years, carb.tasking has matured to the point where co-routines are not only supported, they are encouraged. Further changes to carb.tasking have expanded on the types of synchronization primitives available, and treating every task as a Promise that returns a Future based on the return value of the task function, to promote generator-like capabilities.

It is very easy to go from a job-based task system to a co-routine-based task system. A job is effectively a co-routine that never suspends. However, going the other way, from a co-routine-based task system to a job-based task system presents significant difficulties. Co-routines present a means to do asynchronous operations that are free to suspend and resume at any point; while suspended, the thread executing the co-routine is free to start or resume another co-routine. Without co-routines, a job may still wait but continue to block the thread they’re assigned to until the wait completes.

Timings

A common benchmark of co-routine systems is the skynet benchmark. This benchmark starts 10 tasks, which each spawn 10 tasks and so on until 1,000,000 tasks are reached. In lieu of co-routines, this can theoretically be accomplished with threads, though most operating systems balk at the idea of 1,000,000 threads.

carb.tasking benchmarks are reported in unit tests: “Tasking Timing”, “Async Times” and “Skynet test”.

Benchmarks on an AMD Threadripper PRO 5975W 3.6 GHz running Windows 11 21H2

Skynet test results:

  • Using Future objects returned from addTask: 0.0163576 sec

  • Using TaskGroup objects gathering multiple tasks: 0.0176851 sec

  • Using Counter objects gathering multiple tasks: 0.03792142 sec

  • Using applyRange: 0.00622772 sec

  • Using std::async: Did not complete due to a maximum of 768 concurrent tasks (hang).

  • Using std::thread: Did not complete in 15 minutes. Spawned over 600,000 threads and used over 22 GB RAM making system unresponsive.

  • Using TBB task_group*: Runs recursively, not asynchronously. If a minimal 1 ns sleep is introduced to encourage asynchronicity: 8.98 sec.

carb.tasking fiber context switch time: ~40 ns

The following table lists startup times for various asynchronous methods. This is effectively the time difference between a block of code that initiates the asynchronous call and the asynchronous call starting.

Startup times (nanoseconds)

Test

Worst

Best

Average

std::async

176,151

3,408

8,426

std::thread

152,370

52,650

78,007

IThreadPool

47,819

452

2,179

addTask

23,443

160

1,493

TBB task_group::run

110,136*

280*

1,742*

The following table lists result times for various asynchronous methods. For a thread, this is the time to join a thread when it has completed. For std::async and carb.tasking it is the time to read the value from the future.

Result times (nanoseconds)

Test

Worst

Best

Average

std::async

22,871

549

2,797

std::thread

112,305

19,927

30,997

IThreadPool

23,944

280

4,928

addTask

11,095

80

296

TBB task_group::run

285,189*

65*

1,136*

* TBB can try to run tasks synchronously. These are asynchronous times.

Benchmarks on an Intel i9-7900X @ 3.30GHz running Ubuntu 22.04.02 LTS

Skynet test results:

  • Using Future objects returned from addTask: 0.0334191 sec

  • Using TaskGroup objects gathering multiple tasks: 0.0482351 sec

  • Using Counter objects gathering multiple tasks: 0.06405198 sec

  • Using applyRange: 0.0132458 sec

  • Using std::async: Did not complete after 10+ minutes and ~20 GB RAM used. System remained responsive.

  • Using std::thread: Remarkably, completed in 25.5 seconds.

  • Using TBB task_group*: Runs recursively, not asynchronously. If a minimal 1 ns sleep is introduced to encourage asynchronicity: 0.308597 sec.

carb.tasking fiber context switch time: 16 ns

The following table lists startup times for various asynchronous methods. This is effectively the time difference between a block of code that initiates the asynchronous call and the asynchronous call starting.

Startup times (nanoseconds)

Test

Worst

Best

Average

std::async

58,186

15,171

18,668

std::thread

22,888

14,710

16,311

IThreadPool

10,022

422

1,136

addTask

18,181

346

893

TBB task_group::run

99,648*

833*

3,661*

The following table lists result times for various asynchronous methods. For a thread, this is the time to join a thread when it has completed. For std::async and carb.tasking it is the time to read the value from the future.

Result times (nanoseconds)

Test

Worst

Best

Average

std::async

33,918

8,815

9,998

std::thread

8,426

6,546

6,975

IThreadPool

7,202

978

2,168

addTask

7,179

74

467

TBB task_group::run

18,300*

1,510*

2,446*

* TBB can try to run tasks synchronously. These are asynchronous times.

Frequently Asked Questions

Why not use C++20 co-routines?

Carbonite’s minimum spec is C++14 and currently builds with C++17. Some initial review of the C++20 co-routines expresses discontent with their design. It should also be pointed out that C++20 co-routines achieve multi-tasking, but still require a thread scheduler to achieve multi-processing, a major goal of carb.tasking.

How about using stackless co-routines?

See above

What other languages use co-routines?

As mentioned above, C++20, Python, Go, Rust, Lua and Javascript to name a few. A full list is available on the Wikipedia page.

Can I use C++ standard synchronization primitives?

To a certain degree, yes. If a carb.tasking-specific wait (i.e. using one of the carb.tasking synchronization primitives) occurs while a mutex is held, undefined behavior can occur. But there are many cases where standard (i.e. non-fiber-aware) synchronization primitives are used just fine.

However, be aware that waiting (in a non-fiber-aware way) for a significant period of time will reduce system throughput. If the system is not busy, this is probably fine, but if the system is fully saturated with work, waiting in this way effectively blocks one of the N counts in the N x M example above. In general if there is any real possibility of waiting, it is better to use a carb-tasking-provided synchronization primitive so that the system can perform other work while waiting.

Can I perform I/O in co-routines

carb.tasking is centered around CPU-bound tasks. Waiting on I/O is not currently supported within any of the fiber-aware means provided by carb.tasking. Waiting on I/O should therefore be considered similarly to the previous question.

That said, carb.tasking does provide means for a co-routine to suspend itself. The I/O wait could then be accomplished in another thread (or std::async) and upon completion could then resume the task.

For example, carb.datasource-file.plugin uses Overlapped I/O on Windows and Asynchronous I/O on Linux if a synchronous request is made from a carb.tasking task.

Can I run out of fibers?

Like any resource, the system has a finite amount. For something like threads, the limit is based on the operating system, hardware, available memory, etc. For carb.tasking fibers, the limit is given by kMaxFibers. However, carb.tasking can be configured with a lower amount specified to changeParameters.

Tasks (or co-routines) are only assigned a fiber when necessary, such as when they are selected to start. If the system fiber pool is exhausted, a message will be logged and the system will wait until a fiber becomes available. Since the scheduler always chooses resumable co-routines first, this essentially means that no tasks are able to be resumed. It is possible for too many co-routines to cause a carb.tasking deadlock if the fiber pool is exhausted, though the limit is high enough that this should be very uncommon.

How do I debug carb.tasking?

See the Debugging page.

How should I understand the scheduling of fibers?

In general, tasks are executed in first-in-first-out order, but given the multi-processing nature of carb.tasking this can vary slightly. Priority is used to select the task–highest priority tasks are always executed first, and resuming tasks are always executed before starting new tasks.

If a task will wait indefinitely on another task (i.e. a Future representing a task or the TaskContext for a task, not a TaskGroup) and the dependent task is ready to run, the current task will attempt to execute the task directly instead of waiting for it to be scheduled. This helps to speed completion along and avoid priority inversion.

The system also prioritizes “parallel-for” work (i.e. applyRange) over everything else as it must be completed synchronously.

To promote lower contention and greater throughput on machines with large numbers of CPUs, the worker threads are partitioned into smaller thread groups. When a new task is given to carb.tasking it is given to one of these thread groups randomly. When other task groups run out of work to do, they can steal tasks from other task groups.

What is the overhead of using fibers?

Unlike threads as co-routines, a fiber can be reused indefinitely. The carb.tasking system allocates fibers on demand and retains them until shutdown, allowing future tasks to use them.

The fiber context switch time is recorded above and is very fast (~40 ns on Windows, ~16 ns on Linux), much faster than entering the Operating System scheduler and switching threads.

How will this mix with standard threads, or another model like TBB?

It depends completely on how many CPU resources are available, how much work carb.tasking is given to do, how many worker threads carb.tasking has, etc. Ideally an application is designed with one system for performing CPU-intensive work, like carb.tasking, and every job request is passed to carb.tasking. Having multiple systems with worker threads equal to all CPU resources available can end up over-committing the CPU resources and burdening the Operating System scheduler. Conversely, having a balance of CPU threads (say 80% for carb.tasking and 20% for TBB) runs the risk of only getting 80% of the machine’s theoretical maximum throughput at times where carb.tasking is fully loaded and TBB has no work to do.

Conclusion

carb.tasking is designed to provide a feature that other tasking libraries are not: co-routines. Any method of providing co-routines in a parallel environment is going to require some method of notifying the co-routine scheduler when tasks are waiting or ready to run as this information is only available within the kernel.

However, carb.tasking is quite competitive in terms of executing even simple non-co-routine tasks.

Footnotes

1(1,2)

Parallelizing the Naughty Dog Engine Using Fibers / Christian Gyrling / March 2015

2(1,2)

Fibers under the magnifying glass / Gor Nishanov / November 20, 2018

3(1,2)

Response to “Fibers under the magnifying glass” / Nat Goodspeed & Oliver Kowalke / January 6, 2019

4

Response to response to “Fibers under the magnifying glass” / Gor Nishanov / March 8, 2019