Skip to content

Conversation

@mertunsall
Copy link
Contributor

@mertunsall mertunsall commented Nov 28, 2025

Problem

_apply_matches() had O(n²) complexity - a 5000 token multimodal prompt took 2.6s for processing alone.

Cause

When no match found, the loop incremented start_idx but _find_matches() used prev_end_idx (unchanged), re-scanning the entire prompt each iteration.

Fix

Exit loop when no matches remain instead of incrementing uselessly.

Impact

  • InternVL: 2.6s → <10ms for 5k tokens
  • Models using token-based matching (Qwen3-VL) unaffected

Test Plan

python -m pytest tests/multimodal/test_processing.py::TestApplyMatchesPerformance -v

Test Result

Passed

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Nov 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides an excellent fix for a critical O(n²) performance issue in multimodal prompt processing. The problem was correctly identified as an inefficient loop that repeatedly scanned the prompt when no matches were found. The fix is simple, elegant, and effective, replacing the unnecessary loop increment with a break statement. The accompanying tests are thorough, including specific performance tests to validate the fix and regression tests to ensure no existing functionality is broken. This is a high-quality contribution that significantly improves performance.

@mertunsall mertunsall force-pushed the fix-multimodal-on2-perf branch from 2f835ce to 147d97f Compare November 28, 2025 10:20
_apply_matches() had O(n²) complexity when processing multimodal prompts.
When no match was found, the loop incremented start_idx but _find_matches()
used prev_end_idx (unchanged), re-scanning the entire prompt each iteration.

Fix: Exit loop when no matches remain instead of incrementing uselessly.

Impact: InternVL 5k token prompt: 2.6s → <10ms for processing.
Signed-off-by: mertunsall <[email protected]>
Signed-off-by: mertunsall <[email protected]>
@mertunsall mertunsall force-pushed the fix-multimodal-on2-perf branch from 147d97f to 513557d Compare November 28, 2025 10:22
@mertunsall
Copy link
Contributor Author

Fixes #25327

@mertunsall
Copy link
Contributor Author

After serving the model

vllm serve OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview --uvicorn-log-level=info --host 0.0.0.0 --port 8000 --max-model-len 32768 --trust-remote-code --limit-mm-per-prompt.image 1 --tensor-parallel-size 1

This benchmarking script gets significantly faster after this PR.

#!/usr/bin/env python3
"""
Script to query VLLM OpenAI-compatible server and measure response time.
"""

import time
import requests
import json
from typing import Dict, Any, Optional
import random
import base64

def encode_image_to_base64(image_path: str) -> str:
    """
    Encode an image file to base64 string.
    
    Args:
        image_path: Path to the image file
        
    Returns:
        Base64 encoded string of the image
    """
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

def query_vllm_server(
    prompt: str = "Explain quantum computing in simple terms.",
    model: str = "OpenGVLab/InternVL3_5-GPT-OSS-20B-A4B-Preview",
    base_url: str = "http://localhost:8000",
    max_tokens: int = 100,
    temperature: float = 0.7,
    image_path: Optional[str] = "PUT AN IMAGE PATH HERE"
) -> Dict[str, Any]:
    """
    Query the VLLM server and return response with timing info.
    
    Args:
        prompt: The input prompt
        model: Model name
        base_url: Server base URL
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature
        image_path: Optional path to image file to include in the request
        
    Returns:
        Dictionary with response text, timing info, and metadata
    """

    # Prepare the message content
    #image_path = None
    if image_path:
        # Encode image to base64
        base64_image = encode_image_to_base64(image_path)
        
        # Create multimodal content
        content = [
            {"type": "text", "text": prompt},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/png;base64,{base64_image}"
                }
            }
        ]
    else:
        # Text-only content
        content = prompt
    
    # Prepare the request payload
    payload = {
        "model": model,
        "messages": [
            {"role": "user", "content": content}
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stream": False
    }
    
    headers = {
        "Content-Type": "application/json",
        "Authorization": "Bearer aitorloveskebabs"
    }
    
    url = f"{base_url}/v1/chat/completions"
    
    print(f"🚀 Querying VLLM server at {url}")
    #print(f"📝 Prompt: {prompt}")
    #print(f"🤖 Model: {model}")
    print("-" * 60)
    
    # Time the request
    start_time = time.time()
    
    try:
        response = requests.post(url, json=payload, headers=headers, timeout=60)
        end_time = time.time()
        
        response.raise_for_status()
        
        # Parse response
        response_data = response.json()
        
        # Extract the generated text
        generated_text = response_data["choices"][0]["message"]["content"]
        
        # Calculate timing metrics
        total_time = end_time - start_time
        
        # Get token usage if available
        usage = response_data.get("usage", {})
        prompt_tokens = usage.get("prompt_tokens", 0)
        completion_tokens = usage.get("completion_tokens", 0)
        total_tokens = usage.get("total_tokens", 0)
        
        # Calculate tokens per second
        tokens_per_second = completion_tokens / total_time if total_time > 0 else 0
        
        result = {
            "success": True,
            "generated_text": generated_text,
            "timing": {
                "total_time_seconds": round(total_time, 3),
                "tokens_per_second": round(tokens_per_second, 2)
            },
            "token_usage": {
                "prompt_tokens": prompt_tokens,
                "completion_tokens": completion_tokens,
                "total_tokens": total_tokens
            },
            "model": model
        }
        
        return result
        
    except requests.exceptions.RequestException as e:
        end_time = time.time()
        total_time = end_time - start_time
        
        return {
            "success": False,
            "error": str(e),
            "timing": {
                "total_time_seconds": round(total_time, 3)
            }
        }

def main():
    """Main function to run benchmark with varying input/output sizes."""
    
    print("=" * 60)
    print("🔥 VLLM Server Benchmark")
    print("=" * 60)
    
    # Results table
    results = []
    
    # Base prompt (repeat to get different character counts)
    base_prompt = "Write a detailed analysis of artificial intelligence and its impact on society, economy, technology, and human relationships. Discuss the benefits, challenges, and future implications."
    
    # Test parameters
    char_counts = list(range(10000, 80000, 20000))  # 4k to 60k in 4k increments
    max_tokens_list = [1] + list(range(10, 200, 40))  # 1, then 10-100 in 10 increments
    
    test_num = 0
    total_tests = len(char_counts) * len(max_tokens_list)
    
    for idx1, char_count in enumerate(char_counts):
        for idx2, max_tokens in enumerate(max_tokens_list):
            test_num += 1
            
            # Create prompt with target character count and unique prefix for cache busting
            unique_prefix = f"[TEST_{test_num}_{idx1}_{idx2}_{random.randint(1, 1000000)}] "
            repeat_count = max(1, (char_count - len(unique_prefix)) // len(base_prompt))
            test_prompt = unique_prefix + base_prompt * repeat_count
            
            print(f"[{test_num}/{total_tests}] Chars: {len(test_prompt)}, Max tokens: {max_tokens}")
            
            result = query_vllm_server(
                prompt=test_prompt,
                max_tokens=max_tokens,
                temperature=0.7
            )
            
            if result["success"]:
                results.append({
                    'test_id': test_num,
                    'prompt_chars': len(test_prompt),
                    'input_tokens': result['token_usage']['prompt_tokens'],
                    'output_tokens': result['token_usage']['completion_tokens'],
                    'max_tokens': max_tokens,
                    'latency_seconds': result['timing']['total_time_seconds'],
                    'tokens_per_second': result['timing']['tokens_per_second']
                })
                print(f"  ✅ {result['token_usage']['prompt_tokens']} → {result['token_usage']['completion_tokens']} tokens, {result['timing']['total_time_seconds']:.3f}s")
            else:
                results.append({
                    'test_id': test_num,
                    'prompt_chars': len(test_prompt),
                    'input_tokens': 0,
                    'output_tokens': 0,
                    'max_tokens': max_tokens,
                    'latency_seconds': result['timing']['total_time_seconds'],
                    'tokens_per_second': 0,
                    'error': result.get('error', 'Unknown')
                })
                print(f"  ❌ Error: {result.get('error', 'Unknown')}")
            
            time.sleep(0.1)  # Brief pause
    
    # Print results table
    print("\n" + "=" * 80)
    print("📊 BENCHMARK RESULTS")
    print("=" * 80)
    print(f"{'Test':<4} {'Chars':<6} {'InTok':<6} {'OutTok':<7} {'MaxTok':<7} {'Latency':<8} {'Tok/s':<8}")
    print("-" * 80)
    
    for r in results:
        if 'error' not in r:
            print(f"{r['test_id']:<4} {r['prompt_chars']:<6} {r['input_tokens']:<6} {r['output_tokens']:<7} {r['max_tokens']:<7} {r['latency_seconds']:<8.3f} {r['tokens_per_second']:<8.1f}")
        else:
            print(f"{r['test_id']:<4} {r['prompt_chars']:<6} {'ERR':<6} {'ERR':<7} {r['max_tokens']:<7} {r['latency_seconds']:<8.3f} {'ERR':<8}")
    
    # Save to CSV
    import csv
    filename = f"vllm_benchmark_{int(time.time())}.csv"
    with open(filename, 'w', newline='') as f:
        if results:
            writer = csv.DictWriter(f, fieldnames=results[0].keys())
            writer.writeheader()
            writer.writerows(results)
    
    print(f"\n💾 Results saved to: {filename}")
    successful = [r for r in results if 'error' not in r]
    print(f"✅ Successful: {len(successful)}/{len(results)} tests")

if __name__ == "__main__":
    main()

@mertunsall
Copy link
Contributor Author

@DarkLight1337 I edited the PR to get rid of start_idx as well!

@DarkLight1337 DarkLight1337 changed the title [Bugfix] Fix O(n²) multimodal prompt processin [Bugfix] Fix O(n²) multimodal prompt processing Nov 28, 2025
@DarkLight1337
Copy link
Member

main branch

no_match_exits_quickly 0.373644601553678
long_prompt_with_match 0.00014125369489192963

#29668

no_match_exits_quickly 0.3878521788865328
long_prompt_with_match 0.00017918087542057037

#29667

no_match_exits_quickly 0.0003753677010536194
long_prompt_with_match 0.00015436485409736633

Yep, still works. It makes sense since my PR only tackles the case when MM data is passed in, while yours applies to text-only inputs as well.

@mertunsall
Copy link
Contributor Author

Should be good to merge!

@DarkLight1337
Copy link
Member

Probably you have to fix some merge conflicts after #29668 is merged (which should happen quite soon)

@mergify
Copy link

mergify bot commented Nov 28, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mertunsall.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 28, 2025
@mergify mergify bot removed the needs-rebase label Nov 28, 2025
@mertunsall
Copy link
Contributor Author

Hi @DarkLight1337 @Isotr0py - fixed the merge conflicts, would be very happy if we can merge!

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 28, 2025
@mertunsall
Copy link
Contributor Author

What’s the reason checks don’t pass?

@ywang96 ywang96 enabled auto-merge (squash) November 28, 2025 21:57
@ywang96 ywang96 changed the title [Bugfix] Fix O(n²) multimodal prompt processing [Bugfix] Fix O(n²) multimodal string prompt processing Nov 28, 2025
@mertunsall
Copy link
Contributor Author

should we update? @ywang96

@ywang96
Copy link
Member

ywang96 commented Nov 28, 2025

should we update? @ywang96

Should be fine - the CI's still running and we don't need the PR to be completely aligned with main to be merged

@DarkLight1337
Copy link
Member

Known failure on main

@vllm-bot vllm-bot merged commit c625d7b into vllm-project:main Nov 29, 2025
44 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants