In the Linux kernel, the following vulnerability has been resolved:
mm/MADV_COLLAPSE: catch !none !huge !bad pmd lookups
In commit 34488399fa08 ("mm/madvise: add file and shmem support to MADVCOLLAPSE") we make the following change to findpmdorthpornone():
- if (!pmd_present(pmde))
- return SCAN_PMD_NULL;
+ if (pmd_none(pmde))
+ return SCAN_PMD_NONE;
This was for-use by MADVCOLLAPSE file/shmem codepaths, where MADVCOLLAPSE might identify a pte-mapped hugepage, only to have khugepaged race-in, free the pte table, and clear the pmd. Such codepaths include:
A) If we find a suitably-aligned compound page of order HPAGEPMDORDER already in the pagecache. B) In retractpagetables(), if we fail to grab mmap_lock for the target mm/address.
In these cases, collapseptemappedthp() really does expect a none (not just !present) pmd, and we want to suitably identify that case separate from the case where no pmd is found, or it's a bad-pmd (of course, many things could happen once we drop mmaplock, and the pmd could plausibly undergo multiple transitions due to intervening fault, split, etc). Regardless, the code is prepared install a huge-pmd only when the existing pmd entry is either a genuine pte-table-mapping-pmd, or the none-pmd.
However, the commit introduces a logical hole; namely, that we've allowed !none- && !huge- && !bad-pmds to be classified as genuine pte-table-mapping-pmds. One such example that could leak through are swap entries. The pmd values aren't checked again before use in pteoffsetmap_lock(), which is expecting nothing less than a genuine pte-table-mapping-pmd.
We want to put back the !pmdpresent() check (below the pmdnone() check), but need to be careful to deal with subtleties in pmd transitions and treatments by various arch.
The issue is that _splithugepmdlocked() temporarily clears the present bit (or otherwise marks the entry as invalid), but pmdpresent() and pmdtranshuge() still need to return true while the pmd is in this transitory state. For example, x86's pmdpresent() also checks the PAGEPSE , riscv's version also checks the PAGELEAF bit, and arm64 also checks a PMDPRESENTINVALID bit.
Covering all 4 cases for x86 (all checks done on the same pmd value):
1) pmdpresent() && pmdtranshuge() All we actually know here is that the PSE bit is set. Either: a) We aren't racing with _splithugepage(), and PRESENT or PROTNONE is set. => huge-pmd b) We are currently racing with _splithuge_page(). The danger here is that we proceed as-if we have a huge-pmd, but really we are looking at a pte-mapping-pmd. So, what is the risk of this danger?
The only relevant path is:
madvise_collapse() -> collapse_pte_mapped_thp()
Where we might just incorrectly report back "success", when really
the memory isn't pmd-backed. This is fine, since split could
happen immediately after (actually) successful madvise_collapse().
So, it should be safe to just assume huge-pmd here.
2) pmdpresent() && !pmdtranshuge() Either: a) PSE not set and either PRESENT or PROTNONE is. => pte-table-mapping pmd (or PROTNONE) b) devmap. This routine can be called immediately after unlocking/locking mmaplock -- or called with no locks held (see khugepagedscanmmslot()), so previous VMA checks have since been invalidated.
3) !pmdpresent() && pmdtrans_huge() Not possible.
4) !pmdpresent() && !pmdtrans_huge() Neither PRESENT nor PROTNONE set => not present
I've checked all archs that implement pmdtranshuge() (arm64, riscv, powerpc, longarch, x86, mips, s390) and this logic roughly translates (though devmap treatment is unique to x86 and powerpc, and (3) doesn't necessarily hold in general -- but that doesn't matter since !pmd_present() always takes failure path).
Also, add a comment above findpmdorthpor_none() ---truncated---