CVE-2024-53169

Source
https://nvd.nist.gov/vuln/detail/CVE-2024-53169
Import Source
https://storage.googleapis.com/osv-test-cve-osv-conversion/osv-output/CVE-2024-53169.json
JSON Data
https://api.test.osv.dev/v1/vulns/CVE-2024-53169
Related
Published
2024-12-27T14:15:24Z
Modified
2025-01-08T09:58:02.235479Z
Summary
[none]
Details

In the Linux kernel, the following vulnerability has been resolved:

nvme-fabrics: fix kernel crash while shutting down controller

The nvme keep-alive operation, which executes at a periodic interval, could potentially sneak in while shutting down a fabric controller. This may lead to a race between the fabric controller admin queue destroy code path (invoked while shutting down controller) and hw/hctx queue dispatcher called from the nvme keep-alive async request queuing operation. This race could lead to the kernel crash shown below:

Call Trace: autoremovewakefunction+0x0/0xbc (unreliable) _blkmqscheddispatchrequests+0x114/0x24c blkmqscheddispatchrequests+0x44/0x84 blkmqrunhwqueue+0x140/0x220 nvmekeepalivework+0xc8/0x19c [nvmecore] processonework+0x200/0x4e0 workerthread+0x340/0x504 kthread+0x138/0x140 startkernelthread+0x14/0x18

While shutting down fabric controller, if nvme keep-alive request sneaks in then it would be flushed off. The nvmekeepaliveendio function is then invoked to handle the end of the keep-alive operation which decrements the admin->qusagecounter and assuming this is the last/only request in the admin queue then the admin->qusagecounter becomes zero. If that happens then blk-mq destroy queue operation (blkmqdestroy_ queue()) which could be potentially running simultaneously on another cpu (as this is the controller shutdown code path) would forward progress and deletes the admin queue. So, now from this point onward we are not supposed to access the admin queue resources. However the issue here's that the nvme keep-alive thread running hw/hctx queue dispatch operation hasn't yet finished its work and so it could still potentially access the admin queue resource while the admin queue had been already deleted and that causes the above crash.

The above kernel crash is regression caused due to changes implemented in commit a54a93d0e359 ("nvme: move stopping keep-alive into nvmeuninitctrl()"). Ideally we should stop keep-alive before destroyin g the admin queue and freeing the admin tagset so that it wouldn't sneak in during the shutdown operation. However we removed the keep alive stop operation from the beginning of the controller shutdown code path in commit a54a93d0e359 ("nvme: move stopping keep-alive into nvmeuninitctrl()") and added it under nvmeuninitctrl() which executes very late in the shutdown code path after the admin queue is destroyed and its tagset is removed. So this change created the possibility of keep-alive sneaking in and interfering with the shutdown operation and causing observed kernel crash.

To fix the observed crash, we decided to move nvmestopkeepalive() from nvmeuninitctrl() to nvmeremoveadmintag_set(). This change would ensure that we don't forward progress and delete the admin queue until the keep- alive operation is finished (if it's in-flight) or cancelled and that would help contain the race condition explained above and hence avoid the crash.

Moving nvmestopkeepalive() to nvmeremoveadmintagset() instead of adding nvmestopkeepalive() to the beginning of the controller shutdown code path in nvmestopctrl(), as was the case earlier before commit a54a93d0e359 ("nvme: move stopping keep-alive into nvmeuninitctrl()"), would help save one callsite of nvmestopkeep_alive().

References

Affected packages

Debian:13 / linux

Package

Name
linux
Purl
pkg:deb/debian/linux?arch=source

Affected ranges

Type
ECOSYSTEM
Events
Introduced
0Unknown introduced version / All previous versions are affected
Fixed
6.12.3-1

Affected versions

6.*

6.1.27-1
6.1.37-1
6.1.38-1
6.1.38-2~bpo11+1
6.1.38-2
6.1.38-3
6.1.38-4~bpo11+1
6.1.38-4
6.1.52-1
6.1.55-1~bpo11+1
6.1.55-1
6.1.64-1
6.1.66-1
6.1.67-1
6.1.69-1~bpo11+1
6.1.69-1
6.1.76-1~bpo11+1
6.1.76-1
6.1.82-1
6.1.85-1
6.1.90-1~bpo11+1
6.1.90-1
6.1.94-1~bpo11+1
6.1.94-1
6.1.98-1
6.1.99-1
6.1.106-1
6.1.106-2
6.1.106-3
6.1.112-1
6.1.115-1
6.1.119-1
6.1.123-1
6.3.1-1~exp1
6.3.2-1~exp1
6.3.4-1~exp1
6.3.5-1~exp1
6.3.7-1~bpo12+1
6.3.7-1
6.3.11-1
6.4~rc6-1~exp1
6.4~rc7-1~exp1
6.4.1-1~exp1
6.4.4-1~bpo12+1
6.4.4-1
6.4.4-2
6.4.4-3~bpo12+1
6.4.4-3
6.4.11-1
6.4.13-1
6.5~rc4-1~exp1
6.5~rc6-1~exp1
6.5~rc7-1~exp1
6.5.1-1~exp1
6.5.3-1~bpo12+1
6.5.3-1
6.5.6-1
6.5.8-1
6.5.10-1~bpo12+1
6.5.10-1
6.5.13-1
6.6.3-1~exp1
6.6.4-1~exp1
6.6.7-1~exp1
6.6.8-1
6.6.9-1
6.6.11-1
6.6.13-1~bpo12+1
6.6.13-1
6.6.15-1
6.6.15-2
6.7-1~exp1
6.7.1-1~exp1
6.7.4-1~exp1
6.7.7-1
6.7.9-1
6.7.9-2
6.7.12-1~bpo12+1
6.7.12-1
6.8.9-1
6.8.11-1
6.8.12-1~bpo12+1
6.8.12-1
6.9.2-1~exp1
6.9.7-1~bpo12+1
6.9.7-1
6.9.8-1
6.9.9-1
6.9.10-1~bpo12+1
6.9.10-1
6.9.11-1
6.9.12-1
6.10-1~exp1
6.10.1-1~exp1
6.10.3-1
6.10.4-1
6.10.6-1~bpo12+1
6.10.6-1
6.10.7-1
6.10.9-1
6.10.11-1~bpo12+1
6.10.11-1
6.10.12-1
6.11~rc4-1~exp1
6.11~rc5-1~exp1
6.11-1~exp1
6.11.2-1
6.11.4-1
6.11.5-1~bpo12+1
6.11.5-1
6.11.6-1
6.11.7-1
6.11.9-1
6.11.10-1~bpo12+1
6.11.10-1
6.12~rc6-1~exp1

Ecosystem specific

{
    "urgency": "not yet assigned"
}