[Bugfix] Fix O(n²) multimodal string prompt processing #29667

mertunsall · 2025-11-28T10:18:59Z

Problem

_apply_matches() had O(n²) complexity - a 5000 token multimodal prompt took 2.6s for processing alone.

Cause

When no match found, the loop incremented start_idx but _find_matches() used prev_end_idx (unchanged), re-scanning the entire prompt each iteration.

Fix

Exit loop when no matches remain instead of incrementing uselessly.

Impact

InternVL: 2.6s → <10ms for 5k tokens
Models using token-based matching (Qwen3-VL) unaffected

Test Plan

python -m pytest tests/multimodal/test_processing.py::TestApplyMatchesPerformance -v

Test Result

Passed

github-actions · 2025-11-28T10:19:10Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

gemini-code-assist

Code Review

This pull request provides an excellent fix for a critical O(n²) performance issue in multimodal prompt processing. The problem was correctly identified as an inefficient loop that repeatedly scanned the prompt when no matches were found. The fix is simple, elegant, and effective, replacing the unnecessary loop increment with a break statement. The accompanying tests are thorough, including specific performance tests to validate the fix and regression tests to ensure no existing functionality is broken. This is a high-quality contribution that significantly improves performance.

_apply_matches() had O(n²) complexity when processing multimodal prompts. When no match was found, the loop incremented start_idx but _find_matches() used prev_end_idx (unchanged), re-scanning the entire prompt each iteration. Fix: Exit loop when no matches remain instead of incrementing uselessly. Impact: InternVL 5k token prompt: 2.6s → <10ms for processing. Signed-off-by: mertunsall <[email protected]>

Signed-off-by: mertunsall <[email protected]>

vllm/multimodal/processing.py

Signed-off-by: mertunsall <[email protected]>

mertunsall · 2025-11-28T10:48:29Z

Fixes #25327

mertunsall · 2025-11-28T10:50:33Z

After serving the model

vllm serve OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview --uvicorn-log-level=info --host 0.0.0.0 --port 8000 --max-model-len 32768 --trust-remote-code --limit-mm-per-prompt.image 1 --tensor-parallel-size 1

This benchmarking script gets significantly faster after this PR.

#!/usr/bin/env python3
"""
Script to query VLLM OpenAI-compatible server and measure response time.
"""

import time
import requests
import json
from typing import Dict, Any, Optional
import random
import base64

def encode_image_to_base64(image_path: str) -> str:
    """
    Encode an image file to base64 string.
    
    Args:
        image_path: Path to the image file
        
    Returns:
        Base64 encoded string of the image
    """
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def query_vllm_server(
    prompt: str = "Explain quantum computing in simple terms.",
    model: str = "OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview",
    base_url: str = "http://localhost:8000",
    max_tokens: int = 100,
    temperature: float = 0.7,
    image_path: Optional[str] = "PUT AN IMAGE PATH HERE"
) -> Dict[str, Any]:
    """
    Query the VLLM server and return response with timing info.
    
    Args:
        prompt: The input prompt
        model: Model name
        base_url: Server base URL
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature
        image_path: Optional path to image file to include in the request
        
    Returns:
        Dictionary with response text, timing info, and metadata
    """

    # Prepare the message content
    #image_path = None
    if image_path:
        # Encode image to base64
        base64_image = encode_image_to_base64(image_path)
        
        # Create multimodal content
        content = [
            {"type": "text", "text": prompt},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/png;base64,{base64_image}"
                }
            }
        ]
    else:
        # Text-only content
        content = prompt
    
    # Prepare the request payload
    payload = {
        "model": model,
        "messages": [
            {"role": "user", "content": content}
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stream": False
    }
    
    headers = {
        "Content-Type": "application/json",
        "Authorization": "Bearer aitorloveskebabs"
    }
    
    url = f"{base_url}/v1/chat/completions"
    
    print(f"🚀 Querying VLLM server at {url}")
    #print(f"📝 Prompt: {prompt}")
    #print(f"🤖 Model: {model}")
    print("-" * 60)
    
    # Time the request
    start_time = time.time()
    
    try:
        response = requests.post(url, json=payload, headers=headers, timeout=60)
        end_time = time.time()
        
        response.raise_for_status()
        
        # Parse response
        response_data = response.json()
        
        # Extract the generated text
        generated_text = response_data["choices"][0]["message"]["content"]
        
        # Calculate timing metrics
        total_time = end_time - start_time
        
        # Get token usage if available
        usage = response_data.get("usage", {})
        prompt_tokens = usage.get("prompt_tokens", 0)
        completion_tokens = usage.get("completion_tokens", 0)
        total_tokens = usage.get("total_tokens", 0)
        
        # Calculate tokens per second
        tokens_per_second = completion_tokens / total_time if total_time > 0 else 0
        
        result = {
            "success": True,
            "generated_text": generated_text,
            "timing": {
                "total_time_seconds": round(total_time, 3),
                "tokens_per_second": round(tokens_per_second, 2)
            },
            "token_usage": {
                "prompt_tokens": prompt_tokens,
                "completion_tokens": completion_tokens,
                "total_tokens": total_tokens
            },
            "model": model
        }
        
        return result
        
    except requests.exceptions.RequestException as e:
        end_time = time.time()
        total_time = end_time - start_time
        
        return {
            "success": False,
            "error": str(e),
            "timing": {
                "total_time_seconds": round(total_time, 3)
            }
        }

def main():
    """Main function to run benchmark with varying input/output sizes."""
    
    print("=" * 60)
    print("🔥 VLLM Server Benchmark")
    print("=" * 60)
    
    # Results table
    results = []
    
    # Base prompt (repeat to get different character counts)
    base_prompt = "Write a detailed analysis of artificial intelligence and its impact on society, economy, technology, and human relationships. Discuss the benefits, challenges, and future implications."
    
    # Test parameters
    char_counts = list(range(10000, 80000, 20000))  # 4k to 60k in 4k increments
    max_tokens_list = [1] + list(range(10, 200, 40))  # 1, then 10-100 in 10 increments
    
    test_num = 0
    total_tests = len(char_counts) * len(max_tokens_list)
    
    for idx1, char_count in enumerate(char_counts):
        for idx2, max_tokens in enumerate(max_tokens_list):
            test_num += 1
            
            # Create prompt with target character count and unique prefix for cache busting
            unique_prefix = f"[TEST_{test_num}_{idx1}_{idx2}_{random.randint(1, 1000000)}] "
            repeat_count = max(1, (char_count - len(unique_prefix)) // len(base_prompt))
            test_prompt = unique_prefix + base_prompt * repeat_count
            
            print(f"[{test_num}/{total_tests}] Chars: {len(test_prompt)}, Max tokens: {max_tokens}")
            
            result = query_vllm_server(
                prompt=test_prompt,
                max_tokens=max_tokens,
                temperature=0.7
            )
            
            if result["success"]:
                results.append({
                    'test_id': test_num,
                    'prompt_chars': len(test_prompt),
                    'input_tokens': result['token_usage']['prompt_tokens'],
                    'output_tokens': result['token_usage']['completion_tokens'],
                    'max_tokens': max_tokens,
                    'latency_seconds': result['timing']['total_time_seconds'],
                    'tokens_per_second': result['timing']['tokens_per_second']
                })
                print(f"  ✅ {result['token_usage']['prompt_tokens']} → {result['token_usage']['completion_tokens']} tokens, {result['timing']['total_time_seconds']:.3f}s")
            else:
                results.append({
                    'test_id': test_num,
                    'prompt_chars': len(test_prompt),
                    'input_tokens': 0,
                    'output_tokens': 0,
                    'max_tokens': max_tokens,
                    'latency_seconds': result['timing']['total_time_seconds'],
                    'tokens_per_second': 0,
                    'error': result.get('error', 'Unknown')
                })
                print(f"  ❌ Error: {result.get('error', 'Unknown')}")
            
            time.sleep(0.1)  # Brief pause
    
    # Print results table
    print("\n" + "=" * 80)
    print("📊 BENCHMARK RESULTS")
    print("=" * 80)
    print(f"{'Test':<4} {'Chars':<6} {'InTok':<6} {'OutTok':<7} {'MaxTok':<7} {'Latency':<8} {'Tok/s':<8}")
    print("-" * 80)
    
    for r in results:
        if 'error' not in r:
            print(f"{r['test_id']:<4} {r['prompt_chars']:<6} {r['input_tokens']:<6} {r['output_tokens']:<7} {r['max_tokens']:<7} {r['latency_seconds']:<8.3f} {r['tokens_per_second']:<8.1f}")
        else:
            print(f"{r['test_id']:<4} {r['prompt_chars']:<6} {'ERR':<6} {'ERR':<7} {r['max_tokens']:<7} {r['latency_seconds']:<8.3f} {'ERR':<8}")
    
    # Save to CSV
    import csv
    filename = f"vllm_benchmark_{int(time.time())}.csv"
    with open(filename, 'w', newline='') as f:
        if results:
            writer = csv.DictWriter(f, fieldnames=results[0].keys())
            writer.writeheader()
            writer.writerows(results)
    
    print(f"\n💾 Results saved to: {filename}")
    successful = [r for r in results if 'error' not in r]
    print(f"✅ Successful: {len(successful)}/{len(results)} tests")

if __name__ == "__main__":
    main()

Signed-off-by: mertunsall <[email protected]>

mertunsall · 2025-11-28T10:58:49Z

@DarkLight1337 I edited the PR to get rid of start_idx as well!

DarkLight1337 · 2025-11-28T12:24:12Z

main branch

no_match_exits_quickly 0.373644601553678
long_prompt_with_match 0.00014125369489192963

#29668

no_match_exits_quickly 0.3878521788865328
long_prompt_with_match 0.00017918087542057037

#29667

no_match_exits_quickly 0.0003753677010536194
long_prompt_with_match 0.00015436485409736633

Yep, still works. It makes sense since my PR only tackles the case when MM data is passed in, while yours applies to text-only inputs as well.

tests/multimodal/test_processing.py

Signed-off-by: mertunsall <[email protected]>

mertunsall · 2025-11-28T12:43:07Z

Should be good to merge!

DarkLight1337 · 2025-11-28T12:53:20Z

Probably you have to fix some merge conflicts after #29668 is merged (which should happen quite soon)

mergify · 2025-11-28T13:30:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mertunsall.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: mertunsall <[email protected]>

mertunsall · 2025-11-28T20:06:05Z

Hi @DarkLight1337 @Isotr0py - fixed the merge conflicts, would be very happy if we can merge!

mertunsall · 2025-11-28T21:39:52Z

What’s the reason checks don’t pass?

mertunsall · 2025-11-28T22:48:48Z

should we update? @ywang96

ywang96 · 2025-11-28T23:15:47Z

should we update? @ywang96

Should be fine - the CI's still running and we don't need the PR to be completely aligned with main to be merged

DarkLight1337 · 2025-11-29T00:10:26Z

Known failure on main

mertunsall requested review from DarkLight1337, NickLucche, tjtanaa and ywang96 as code owners November 28, 2025 10:19

mergify bot added the multi-modality Related to multi-modality (#4194) label Nov 28, 2025

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

mertunsall force-pushed the fix-multimodal-on2-perf branch from 2f835ce to 147d97f Compare November 28, 2025 10:20

mertunsall added 2 commits November 28, 2025 10:22

Fix linter

513557d

Signed-off-by: mertunsall <[email protected]>

mertunsall force-pushed the fix-multimodal-on2-perf branch from 147d97f to 513557d Compare November 28, 2025 10:22

DarkLight1337 reviewed Nov 28, 2025

View reviewed changes

vllm/multimodal/processing.py Outdated Show resolved Hide resolved

mertunsall added 3 commits November 28, 2025 10:29

Add tests for non-consecutive multimodal placeholders

8bfdd36

Signed-off-by: mertunsall <[email protected]>

Add reviewer's exact test case for INSERT not at start

8257b42

Signed-off-by: mertunsall <[email protected]>

Merge branch 'main' into fix-multimodal-on2-perf

7e8c5e4

Remove unused start_idx variable, simplify loop

66b2bd8

Signed-off-by: mertunsall <[email protected]>

DarkLight1337 changed the title ~~[Bugfix] Fix O(n²) multimodal prompt processin~~ [Bugfix] Fix O(n²) multimodal prompt processing Nov 28, 2025

DarkLight1337 reviewed Nov 28, 2025

View reviewed changes

tests/multimodal/test_processing.py Outdated Show resolved Hide resolved

mertunsall added 3 commits November 28, 2025 12:39

cleaner tests

989be1d

Signed-off-by: mertunsall <[email protected]>

Merge branch 'main' into fix-multimodal-on2-perf

e94c181

Merge branch 'main' into fix-multimodal-on2-perf

693c2c4

mergify bot added the needs-rebase label Nov 28, 2025

Merge remote-tracking branch 'origin/main' into fix-multimodal-on2-perf

00d653e

Signed-off-by: mertunsall <[email protected]>

mergify bot removed the needs-rebase label Nov 28, 2025

Merge branch 'main' into fix-multimodal-on2-perf

755a45b

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 28, 2025

ywang96 approved these changes Nov 28, 2025

View reviewed changes

Merge branch 'main' into fix-multimodal-on2-perf

715cc42

ywang96 enabled auto-merge (squash) November 28, 2025 21:57

ywang96 changed the title ~~[Bugfix] Fix O(n²) multimodal prompt processing~~ [Bugfix] Fix O(n²) multimodal string prompt processing Nov 28, 2025

vllm-bot merged commit c625d7b into vllm-project:main Nov 29, 2025
44 of 47 checks passed

Uh oh!

[Bugfix] Fix O(n²) multimodal string prompt processing #29667

[Bugfix] Fix O(n²) multimodal string prompt processing #29667

Conversation

mertunsall commented Nov 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Cause

Fix

Impact

Test Plan

Test Result

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mertunsall commented Nov 28, 2025

Uh oh!

mertunsall commented Nov 28, 2025

Uh oh!

mertunsall commented Nov 28, 2025

Uh oh!

DarkLight1337 commented Nov 28, 2025

Uh oh!

Uh oh!

mertunsall commented Nov 28, 2025

Uh oh!

DarkLight1337 commented Nov 28, 2025

Uh oh!

mergify bot commented Nov 28, 2025

Uh oh!

mertunsall commented Nov 28, 2025

Uh oh!

mertunsall commented Nov 28, 2025

Uh oh!

mertunsall commented Nov 28, 2025

Uh oh!

ywang96 commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Nov 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mertunsall commented Nov 28, 2025 •

edited by github-actions bot

Loading

ywang96 commented Nov 28, 2025 •

edited

Loading