In the Linux kernel, the following vulnerability has been resolved:
blk-iolatency: Fix inflight count imbalances and IO hangs on offline
iolatency needs to track the number of inflight IOs per cgroup. As this tracking can be expensive, it is disabled when no cgroup has iolatency configured for the device. To ensure that the inflight counters stay balanced, iolatencysetlimit() freezes the request_queue while manipulating the enabled counter, which ensures that no IO is in flight and thus all counters are zero.
Unfortunately, iolatencysetlimit() isn't the only place where the enabled counter is manipulated. iolatencypdoffline() can also dec the counter and trigger disabling. As this disabling happens without freezing the q, this can easily happen while some IOs are in flight and thus leak the counts.
This can be easily demonstrated by turning on iolatency on an one empty cgroup while IOs are in flight in other cgroups and then removing the cgroup. Note that iolatency shouldn't have been enabled elsewhere in the system to ensure that removing the cgroup disables iolatency for the whole device.
The following keeps flipping on and off iolatency on sda:
echo +io > /sys/fs/cgroup/cgroup.subtree_control while true; do mkdir -p /sys/fs/cgroup/test echo '8:0 target=100000' > /sys/fs/cgroup/test/io.latency sleep 1 rmdir /sys/fs/cgroup/test sleep 1 done
and there's concurrent fio generating direct rand reads:
fio --name test --filename=/dev/sda --direct=1 --rw=randread \ --runtime=600 --time_based --iodepth=256 --numjobs=4 --bs=4k
while monitoring with the following drgn script:
while True: for css in cssforeachdescendantpre(prog['blkcgroot'].css.addressof()): for pos in hlistforeach(containerof(css, 'struct blkcg', 'css').blkglist): blkg = containerof(pos, 'struct blkcggq', 'blkcgnode') pd = blkg.pd[prog['blkcgpolicyiolatency'].plid] if pd.value() == 0: continue iolat = containerof(pd, 'struct iolatencygrp', 'pd') inflight = iolat.rqwait.inflight.counter.value() if inflight: print(f'inflight={inflight} {diskname(blkg.q.disk).decode("utf-8")} ' f'{cgroup_path(css.cgroup).decode("utf-8")}') time.sleep(1)
The monitoring output looks like the following:
inflight=1 sda /user.slice inflight=1 sda /user.slice ... inflight=14 sda /user.slice inflight=13 sda /user.slice inflight=17 sda /user.slice inflight=15 sda /user.slice inflight=18 sda /user.slice inflight=17 sda /user.slice inflight=20 sda /user.slice inflight=19 sda /user.slice <- fio stopped, inflight stuck at 19 inflight=19 sda /user.slice inflight=19 sda /user.slice
If a cgroup with stuck inflight ends up getting throttled, the throttled IOs will never get issued as there's no completion event to wake it up leading to an indefinite hang.
This patch fixes the bug by unifying enable handling into a work item which is automatically kicked off from iolatencysetminlatnsec() which is called from both iolatencysetlimit() and iolatencypdoffline() paths. Punting to a work item is necessary as iolatencypdoffline() is called under spinlocks while freezing a request_queue requires a sleepable context.
This also simplifies the code reducing LOC sans the comments and avoids the unnecessary freezes which were happening whenever a cgroup's latency target is newly set or cleared.