Weighted Fair Task Queue #4960

thejohnfreeman · 2024-03-20T17:06:42Z

thejohnfreeman
Mar 20, 2024
Maintainer

Weighted Fair Task Queue

This document describes the design of a cooperative task scheduler with priority scheduling modeled after the weighted fair queueing (WFQ) network scheduling algorithm. I call it Weighted Fair Task Queue (WFTQ).

Motivation

rippled has a task scheduler called JobQueue. As a low-level component, it needs to be one of the first to come out of rippled and move to libxrpl as part of our modularization effort. This transfer gives us an opportunity to rethink its design and make improvements.

I believe the design of WFTQ is better than JobQueue in a few ways:

It is more general. JobQueue is tightly coupled to its notion of JobType / JobTypes, which are named after operations in rippled. WFTQ is suitable for applications other than rippled that want a priority task scheduler.
It is more flexible. The fixed set of JobTypes is how JobQueue assigns priorities. With WFTQ, the set of priorities can be determined at runtime, and dynamically expanded or shrunk.
It is more efficient. JobQueue will not simultaneously run more than a fixed number of each JobType, even if there are idle threads. WFTQ does not leave threads idle while tasks are waiting.
It is fairer. JobQueue always runs the highest priority task next, which can lead to starvation if callers are not careful. Many job types have no limit on simultaneous execution. JobQueue requires callers to share the responsibility of prioritization by making sure they do not oversubscribe the queue with these JobTypes. With WFTQ, starved tasks climb in priority until they reach the front of the queue.

Beyond these advantages, the transition is a good time to add scheduler support for thread-local globals and to prepare for std::execution.

Lastly, JobQueue has a few warts that should be removed. We measure wait time and execution time in two places:

Wait time in JobQueue::processTask(...) and LoadEvent::start()
Execution time in JobQueue::processTask(...) and LoadEvent::stop()

Workers is the thread pool underlying JobQueue. It supports changing the thread pool size, pausing and waking threads, but the feature is never used. In fact, threads are only "paused" when the thread pool is being stopped, and paused threads are never woken.

Design

WFQ¹ is an algorithm for scheduling packets that must share a wire for transmission. Packets are grouped into distinct sequences called flows. Each flow represents traffic from a common source, e.g. an application.

When a flow submits a packet to the scheduler we say that the packet arrives at the scheduler. Packets from different flows arrive at the scheduler in an unpredictable order. It is the scheduler's job to decide how to order these packets for transmission. With WFQ, each flow is assigned a weight representing its priority. Flows with heavier weight have higher priority. The scheduler chooses a transmission order such that, given perpetually non-empty flows, each flow's bandwidth consumption relative to the total available bandwidth is proportional to its weight relative to the total weight of all flows.

WFQ works by assigning to each packet a virtual finish time that factors in the virtual finish time of the previous packet in its flow, the arrival time of the packet, the flow's weight, and the size of the packet, which is directly proportional to the bandwidth it will consume. After a packet finishes transmission, the next packet chosen is the one with the earliest finish time.

A cooperative task scheduler faces a slightly different problem. First, the resource in contention is not bandwidth, but execution. These resources are both rates. Each packet (or task) in a flow monopolizes the resource for some amount of time. That time is the packet's (or task's) consumption and corresponds to the size of the packet (or task). For packets, the resource is bandwidth, measured in bytes per second.Thus, consumption corresponds to bytes, which is the same way we measure packet size. For tasks, the resource is execution, measured in cycles per second. Thus, consumption corresponds to cycles, which should be how we think of task size.

Second, tasks are functions, and a function's size (in cycles) cannot be measured until after it has consumed the resource (execution). In other words, unlike packets, tasks cannot be measured a priori. While WFQ can make its scheduling decisions by projecting forward in time, a cooperative task scheduler, where "cooperative" means it does not have the power to pre-empt tasks, does not have that luxury.

The intuition underlying WFTQ is that, at each moment it needs to select the next task for execution, it looks backward in time, within a limited duration, and tries to correct for the greatest disparity between actual and expected (weighted) resource consumption per flow. I will explain.

Within a flow, tasks are sorted in first-in, first-out (FIFO) order. The set of non-empty flows is sorted in this order:

Least weighted recent consumption
Greatest weighted wait time
Greatest weight
Indiscriminate

The recent consumption of a flow is the sum of execution timeconsumed by that flow's tasks within a recent duration of time called the window. For example, a flow with 3 tasks that have all been executing for the last 10 seconds has a recent consumption over a 5 second window of 3 * 5 = 15 seconds. Each flow has a weight. The weighted recent consumption (WRC) of a flow is its recent consumption divided by its weight.

In a perfectly fair execution, all never-empty flows have an equal WRC. The flow with the least WRC is the most starved and highest priority.

The minimum WRC is zero. It is possible, and may be quite common, for flows to have zero WRC. It happens whenever a flow with infrequent tasks queues one after not executing any for an entire window.

Ties are first broken by taking the flow with the greatest weighted wait time, which is the wait time of its first task multipled by its weight. Ties are then broken by choosing the flow with the greatest weight. Remaining ties are broken indiscriminately.

For each measure, it is possible to treat "near misses" as ties by treating all flows within some epsilon of the first as tied.

Details of the implementation are given in the next section. In order to minimize the overhead of the scheduler, WFTQ does not pay the cost to keep the flows sorted at all times. It only needs to sort them when a thread becomes free and multiple tasks are queued. Plus, it does not need to sort the flows at all. It only needs to know which flow has the highest priority.

Implementation

I'm still converting this section from outline to prose. I will update this comment when it is ready. I wanted to get the discussion going first.

The following paragraphs try to concisely explain WFQ, but if you have trouble following, I have found the lesson from University of Washington to be very good. WFQ is covered at the end of that video, but the rest of the video provides the context necessary to understand it. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weighted Fair Task Queue #4960

{{title}}

Replies: 0 comments

Select a reply

Weighted Fair Task Queue #4960

thejohnfreeman Mar 20, 2024 Maintainer

Weighted Fair Task Queue

Motivation

Design

Implementation

Footnotes

Replies: 0 comments

thejohnfreeman
Mar 20, 2024
Maintainer