In the Linux kernel, the following vulnerability has been resolved:
ice: protect XDP configuration with a mutex
The main threat to data consistency in ice_xdp() is a possible asynchronous PF reset. It can be triggered by a user or by TX timeout handler.
XDP setup and PF reset code access the same resources in the following sections: * icevsiclose() in iceprepareforreset() - already rtnl-locked * icevsirebuild() for the PF VSI - not protected * icevsi_open() - already rtnl-locked
With an unfortunate timing, such accesses can result in a crash such as the one below:
[ +1.999878] ice 0000:b1:00.0: Registered XDP mem model MEMTYPEXSKBUFFPOOL on Rx ring 14 [ +2.002992] ice 0000:b1:00.0: Registered XDP mem model MEMTYPEXSKBUFFPOOL on Rx ring 18 [Mar15 18:17] ice 0000:b1:00.0 ens801f0np0: NETDEV WATCHDOG: CPU: 38: transmit queue 14 timed out 80692736 ms [ +0.000093] ice 0000:b1:00.0 ens801f0np0: txtimeout: VSInum: 6, Q 14, NTC: 0x0, HWHEAD: 0x0, NTU: 0x0, INT: 0x4000001 [ +0.000012] ice 0000:b1:00.0 ens801f0np0: txtimeout recovery level 1, txqueue 14 [ +0.394718] ice 0000:b1:00.0: PTP reset successful [ +0.006184] BUG: kernel NULL pointer dereference, address: 0000000000000098 [ +0.000045] #PF: supervisor read access in kernel mode [ +0.000023] #PF: errorcode(0x0000) - not-present page [ +0.000023] PGD 0 P4D 0 [ +0.000018] Oops: 0000 [#1] PREEMPT SMP NOPTI [ +0.000023] CPU: 38 PID: 7540 Comm: kworker/38:1 Not tainted 6.8.0-rc7 #1 [ +0.000031] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021 [ +0.000036] Workqueue: ice iceservicetask [ice] [ +0.000183] RIP: 0010:icecleantxring+0xa/0xd0 [ice] [...] [ +0.000013] Call Trace: [ +0.000016] <TASK> [ +0.000014] ? _die+0x1f/0x70 [ +0.000029] ? pagefaultoops+0x171/0x4f0 [ +0.000029] ? schedule+0x3b/0xd0 [ +0.000027] ? excpagefault+0x7b/0x180 [ +0.000022] ? asmexcpagefault+0x22/0x30 [ +0.000031] ? icecleantxring+0xa/0xd0 [ice] [ +0.000194] icefreetxring+0xe/0x60 [ice] [ +0.000186] icedestroyxdprings+0x157/0x310 [ice] [ +0.000151] icevsidecfg+0x53/0xe0 [ice] [ +0.000180] icevsirebuild+0x239/0x540 [ice] [ +0.000186] icevsirebuildbytype+0x76/0x180 [ice] [ +0.000145] icerebuild+0x18c/0x840 [ice] [ +0.000145] ? delaytsc+0x4a/0xc0 [ +0.000022] ? delaytsc+0x92/0xc0 [ +0.000020] icedoreset+0x140/0x180 [ice] [ +0.000886] iceservicetask+0x404/0x1030 [ice] [ +0.000824] processonework+0x171/0x340 [ +0.000685] workerthread+0x277/0x3a0 [ +0.000675] ? preemptcountadd+0x6a/0xa0 [ +0.000677] ? _rawspinlockirqsave+0x23/0x50 [ +0.000679] ? _pfxworkerthread+0x10/0x10 [ +0.000653] kthread+0xf0/0x120 [ +0.000635] ? _pfxkthread+0x10/0x10 [ +0.000616] retfromfork+0x2d/0x50 [ +0.000612] ? _pfxkthread+0x10/0x10 [ +0.000604] retfromforkasm+0x1b/0x30 [ +0.000604] </TASK>
The previous way of handling this through returning -EBUSY is not viable, particularly when destroying AF_XDP socket, because the kernel proceeds with removal anyway.
There is plenty of code between those calls and there is no need to create a large critical section that covers all of them, same as there is no need to protect icevsirebuild() with rtnl_lock().
Add xdpstatelock mutex to protect icevsirebuild() and ice_xdp().
Leaving unprotected sections in between would result in two states that have to be considered: 1. when the VSI is closed, but not yet rebuild 2. when VSI is already rebuild, but not yet open
The latter case is actually already handled through !netifrunning() case, we just need to adjust flag checking a little. The former one is not as trivial, because between icevsiclose() and icevsi_rebuild(), a lot of hardware interaction happens, this can make adding/deleting rings exit with an error. Luckily, VSI rebuild is pending and can apply new configuration for us in a managed fashion.
Therefore, add an additional VSI state flag ICEVSIREBUILDPENDING to indicate that icex ---truncated---