Users can crash the vLLM engine serving multimodal models by passing multimodal embedding inputs with correct ndim but incorrect shape (e.g. hidden dimension is wrong), regardless of whether the model is intended to support such inputs (as defined in the Supported Models page).
The issue has existed ever since we added support for image embedding inputs, i.e. #6613 (released in v0.5.5)
Using image embeddings as an example:
inputs_embeds (mismatched shape)get_input_embeddings (validation fails).This happens because we only validate ndim of the tensor, but not the full shape, in input processor (via MultiModalDataParser).
--limit-mm-per-prompt to 0 for all non-text modalities to ban multimodal inputs, which includes multimodal embedding inputs. However, the model would then only accept text, defeating the purpose of using a multi-modal model.{
"nvd_published_at": "2025-11-21T02:15:43Z",
"cwe_ids": [
"CWE-129"
],
"github_reviewed_at": "2025-11-20T21:23:29Z",
"severity": "HIGH",
"github_reviewed": true
}