CVE-2022-49394

iolatency needs to track the number of inflight IOs per cgroup. As this tracking can be expensive, it is disabled when no cgroup has iolatency configured for the device. To ensure that the inflight counters stay balanced, iolatencysetlimit() freezes the request_queue while manipulating the enabled counter, which ensures that no IO is in flight and thus all counters are zero.

Unfortunately, iolatencysetlimit() isn't the only place where the enabled counter is manipulated. iolatencypdoffline() can also dec the counter and trigger disabling. As this disabling happens without freezing the q, this can easily happen while some IOs are in flight and thus leak the counts.

This can be easily demonstrated by turning on iolatency on an one empty cgroup while IOs are in flight in other cgroups and then removing the cgroup. Note that iolatency shouldn't have been enabled elsewhere in the system to ensure that removing the cgroup disables iolatency for the whole device.

The following keeps flipping on and off iolatency on sda:

echo +io > /sys/fs/cgroup/cgroup.subtree_control while true; do mkdir -p /sys/fs/cgroup/test echo '8:0 target=100000' > /sys/fs/cgroup/test/io.latency sleep 1 rmdir /sys/fs/cgroup/test sleep 1 done

and there's concurrent fio generating direct rand reads:

fio --name test --filename=/dev/sda --direct=1 --rw=randread \ --runtime=600 --time_based --iodepth=256 --numjobs=4 --bs=4k

while monitoring with the following drgn script:

while True: for css in cssforeachdescendantpre(prog['blkcgroot'].css.addressof()): for pos in hlistforeach(containerof(css, 'struct blkcg', 'css').blkglist): blkg = containerof(pos, 'struct blkcggq', 'blkcgnode') pd = blkg.pd[prog['blkcgpolicyiolatency'].plid] if pd.value() == 0: continue iolat = containerof(pd, 'struct iolatencygrp', 'pd') inflight = iolat.rqwait.inflight.counter.value() if inflight: print(f'inflight={inflight} {diskname(blkg.q.disk).decode("utf-8")} ' f'{cgroup_path(css.cgroup).decode("utf-8")}') time.sleep(1)

The monitoring output looks like the following:

inflight=1 sda /user.slice inflight=1 sda /user.slice ... inflight=14 sda /user.slice inflight=13 sda /user.slice inflight=17 sda /user.slice inflight=15 sda /user.slice inflight=18 sda /user.slice inflight=17 sda /user.slice inflight=20 sda /user.slice inflight=19 sda /user.slice <- fio stopped, inflight stuck at 19 inflight=19 sda /user.slice inflight=19 sda /user.slice

If a cgroup with stuck inflight ends up getting throttled, the throttled IOs will never get issued as there's no completion event to wake it up leading to an indefinite hang.

This patch fixes the bug by unifying enable handling into a work item which is automatically kicked off from iolatencysetminlatnsec() which is called from both iolatencysetlimit() and iolatencypdoffline() paths. Punting to a work item is necessary as iolatencypdoffline() is called under spinlocks while freezing a request_queue requires a sleepable context.

This also simplifies the code reducing LOC sans the comments and avoids the unnecessary freezes which were happening whenever a cgroup's latency target is newly set or cleared.

References

Affected packages

Debian:11 / linux

Package

Name: linux
Purl: pkg:deb/debian/linux?arch=source

Affected ranges

Type: ECOSYSTEM
Events: Introduced

0Unknown introduced version / All previous versions are affected

Fixed

5.10.127-1

Affected versions

5.*

5.10.46-4

5.10.46-5

5.10.70-1~bpo10+1

5.10.70-1

5.10.84-1

5.10.92-1~bpo10+1

5.10.92-1

5.10.92-2

5.10.103-1~bpo10+1

5.10.103-1

5.10.106-1

5.10.113-1

5.10.120-1~bpo10+1

5.10.120-1

Ecosystem specific

{
    "urgency": "not yet assigned"
}

Debian:12 / linux

Package

Name: linux
Purl: pkg:deb/debian/linux?arch=source

Affected ranges

Type: ECOSYSTEM
Events: Introduced

0Unknown introduced version / All previous versions are affected

Fixed

5.18.5-1

Ecosystem specific

{
    "urgency": "not yet assigned"
}

Debian:13 / linux

Package

Name: linux
Purl: pkg:deb/debian/linux?arch=source

Affected ranges

Type: ECOSYSTEM
Events: Introduced

0Unknown introduced version / All previous versions are affected

Fixed

5.18.5-1

Ecosystem specific

{
    "urgency": "not yet assigned"
}