In the Linux kernel, the following vulnerability has been resolved:
virtiofs: use pages instead of pointer for kernel direct IO
When trying to insert a 10MB kernel module kept in a virtio-fs with cache disabled, the following warning was reported:
------------[ cut here ]------------ WARNING: CPU: 1 PID: 404 at mm/pagealloc.c:4551 ...... Modules linked in: CPU: 1 PID: 404 Comm: insmod Not tainted 6.9.0-rc5+ #123 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...... RIP: 0010:allocpages+0x2bf/0x380 ...... Call Trace: <TASK> ? _warn+0x8e/0x150 ? _allocpages+0x2bf/0x380 _kmalloclargenode+0x86/0x160 _kmalloc+0x33c/0x480 virtiofsenqueuereq+0x240/0x6d0 virtiofswakependingandunlock+0x7f/0x190 queuerequestandunlock+0x55/0x60 fusesimplerequest+0x152/0x2b0 fusedirectio+0x5d2/0x8c0 fusefilereaditer+0x121/0x160 _kernelread+0x151/0x2d0 kernelread+0x45/0x50 kernelreadfile+0x1a9/0x2a0 initmodulefromfile+0x6a/0xe0 idempotentinitmodule+0x175/0x230 _x64sysfinitmodule+0x5d/0xb0 x64syscall+0x1c3/0x9e0 dosyscall64+0x3d/0xc0 entrySYSCALL64after_hwframe+0x4b/0x53 ...... </TASK> ---[ end trace 0000000000000000 ]---
The warning is triggered as follows:
1) syscall finitmodule() handles the module insertion and it invokes kernelread_file() to read the content of the module first.
2) kernelreadfile() allocates a 10MB buffer by using vmalloc() and passes it to kernelread(). kernelread() constructs a kvec iter by using ioviterkvec() and passes it to fusefileread_iter().
3) virtio-fs disables the cache, so fusefilereaditer() invokes fusedirectio(). As for now, the maximal read size for kvec iter is only limited by fc->maxread. For virtio-fs, maxread is UINTMAX, so fusedirectio() doesn't split the 10MB buffer. It saves the address and the size of the 10MB-sized buffer in outargs[0] of a fuse request and passes the fuse request to virtiofswakependingandunlock().
4) virtiofswakependingandunlock() uses virtiofsenqueuereq() to queue the request. Because virtiofs need DMA-able address, so virtiofsenqueuereq() uses kmalloc() to allocate a bounce buffer for all fuse args, copies these args into the bounce buffer and passed the physical address of the bounce buffer to virtiofsd. The total length of these fuse args for the passed fuse request is about 10MB, so copyargstoargbuf() invokes kmalloc() with a 10MB size parameter and it triggers the warning in _allocpages():
if (WARN_ON_ONCE_GFP(order > MAX_PAGE_ORDER, gfp))
return NULL;
5) virtiofsenqueuereq() will retry the memory allocation in a kworker, but it won't help, because kmalloc() will always return NULL due to the abnormal size and finitmodule() will hang forever.
A feasible solution is to limit the value of maxread for virtio-fs, so the length passed to kmalloc() will be limited. However it will affect the maximal read size for normal read. And for virtio-fs write initiated from kernel, it has the similar problem but now there is no way to limit fc->maxwrite in kernel.
So instead of limiting both the values of maxread and maxwrite in kernel, introducing usepagesforkvecio in fuseconn and setting it as true in virtiofs. When usepagesforkvecio is enabled, fuse will use pages instead of pointer to pass the KVECIO data.
After switching to pages for KVECIO data, these pages will be used for DMA through virtio-fs. If these pages are backed by vmalloc(), {flush|invalidate}kernelvmaprange() are necessary to flush or invalidate the cache before the DMA operation. So add two new fields in fuseargspages to record the base address of vmalloc area and the condition indicating whether invalidation is needed. Perform the flush in fusegetuserpages() for write operations and the invalidation in fusereleaseuserpages() for read operations.
It may seem necessary to introduce another fie ---truncated---