In the Linux kernel, the following vulnerability has been resolved:
userfaultfd: fix checks for huge PMDs
Patch series "userfaultfd: fix races around pmdtranshuge() check", v2.
The pmdtranshuge() code in mfill_atomic() is wrong in three different ways depending on kernel version:
I decided to write two separate fixes for these (one fix for bugs 1+2, one fix for bug 3), so that the first fix can be backported to kernels affected by bugs 1+2.
This patch (of 2):
This fixes two issues.
I discovered that the following race can occur:
mfillatomic other thread ============ ============ <zap PMD> pmdpgetlockless() [reads none pmd] <bail if transhuge> <if none:> <pagefault creates transhuge zeropage> _ptealloc [no-op] <zap PMD> <bail if pmdtranshuge(dst_pmd)> BUG_ON(pmd_none(dst_pmd))
I have experimentally verified this in a kernel with extra mdelay() calls; the BUGON(pmdnone(*dst_pmd)) triggers.
On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow pteoffsetmaplock to fail"), this can't lead to anything worse than a BUGON(), since the page table access helpers are actually designed to deal with page tables concurrently disappearing; but on older kernels (<=6.4), I think we could probably theoretically race past the two BUG_ON() checks and end up treating a hugepage as a page table.
The second issue is that, as Qi Zheng pointed out, there are other types of huge PMDs that pmdtranshuge() can't catch: devmap PMDs and swap PMDs (in particular, migration PMDs).
On <=6.4, this is worse than the first issue: If mfillatomic() runs on a PMD that contains a migration entry (which just requires winning a single, fairly wide race), it will pass the PMD to pteoffsetmaplock(), which assumes that the PMD points to a page table.
Breakage follows: First, the kernel tries to take the PTE lock (which will crash or maybe worse if there is no "struct page" for the address bits in the migration entry PMD - I think at least on X86 there usually is no corresponding "struct page" thanks to the PTE inversion mitigation, amd64 looks different).
If that didn't crash, the kernel would next try to write a PTE into what it wrongly thinks is a page table.
As part of fixing these issues, get rid of the check for pmdtranshuge() before _ptealloc() - that's redundant, we're going to have to check for that after the _ptealloc() anyway.
Backport note: pmdpgetlockless() is pmdreadatomic() in older kernels.