feat: customizable per-model rate limiting #812

leonardmq · 2025-11-14T10:19:42Z

What does this PR do?

Issue:

rate limits are complex and depend on the user's account tier on the provider; it would be very hard to model that accurately across all providers
we currently do not rate limit / throttle consistently (users could run the same model in eval and in RAG at the same time for example)
we limit concurrency based on the lowest common denominator rate limit, which is 10-50x lower than some users may be able to support

Changes:

add a settings page for rate limits, which lists out all the models and allows users to set a custom rate limit for each one (or for an entire provider); for all of LLMs, embedding models, rerankers, extractors
instantiate a global rate limiter that provides a semaphore for each model
in each adapter, we guard the underlying LiteLLM call with the semaphore - this queues calls up according to the rate limit for the model

Checklists

Tests have been run locally and passed
New tests have been added to any work in /lib

Summary by CodeRabbit

New Features
- Added a "Rate Limits" Settings page to view and manage per-provider and per-model concurrency limits.
- New API endpoints to read and update rate limits (used by the UI).
Behavior Changes
- Runtime enforcement of configured rate limits across embeddings, text-generation, and reranker requests.
Tests
- Expanded test coverage for the rate limits API, persistence, enforcement, updates, and UI flows.
- Updated tool-listing tests to ensure archived tools are excluded.

coderabbitai · 2025-11-14T10:20:03Z

Walkthrough

Adds a ModelRateLimiter (with YAML persistence and semaphore-based concurrency), exposes GET/POST /api/rate_limits, integrates rate limiting into LiteLLM adapters/embeddings/rerankers, adds frontend schema and a Settings UI for rate limits, and includes unit and API tests.

Changes

Cohort / File(s)	Change Summary
Server wiring & API `app/desktop/desktop_server.py`, `app/desktop/studio_server/rate_limits_api.py`	Register `connect_rate_limits(app)` during app creation; new module exposes `GET /api/rate_limits` and `POST /api/rate_limits` backed by `ModelRateLimiter` and `RateLimits`, mapping failures to HTTP 500.
Backend tests & server tests `app/desktop/studio_server/test_rate_limits_api.py`, `app/desktop/studio_server/test_tool_api.py`	New comprehensive tests for rate-limits API (read/update/persistence/validation/roundtrips); updated tool tests to assert archived RAG/Kiln Task tools are excluded.
Frontend API schema & settings entry `app/web_ui/src/lib/api_schema.d.ts`, `app/web_ui/src/routes/(app)/settings/+page.svelte`	Add `/api/rate_limits` operations and `RateLimits` schema to API types; add "Rate Limits" item to settings grid linking to the new settings page.
Settings UI page `app/web_ui/src/routes/(app)/settings/rate_limits/+page.svelte`	New Svelte page: loads models and current limits, groups by provider, shows provider and per-model controls, validates inputs, tracks unsaved changes, posts updates to `/api/rate_limits`.
Rate limiter core `libs/core/kiln_ai/utils/model_rate_limiter.py`, `libs/core/kiln_ai/utils/test_model_rate_limiter.py`	New `RateLimits` pydantic model and `ModelRateLimiter` singleton with YAML persistence, semaphore-based concurrency, get/set/update/reload APIs, and extensive unit tests for behavior and concurrency.
Adapter integrations `libs/core/kiln_ai/adapters/.../litellm_adapter.py`, `libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py`, `libs/core/kiln_ai/adapters/extractors/litellm_extractor.py`, `libs/core/kiln_ai/adapters/rerankers/litellm_reranker_adapter.py`	Add optional `rate_limiter` constructor parameter (defaults to `ModelRateLimiter.shared()`); wrap external async calls (`acompletion`, `aembedding`, `arerank`) in `async with self.rate_limiter.limit(provider, model)` contexts.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant UI as Frontend (Settings)
    participant BE as Backend API
    participant RL as ModelRateLimiter
    participant AD as Adapter

    UI->>BE: GET /api/rate_limits
    BE->>RL: load_rate_limits()
    RL-->>BE: RateLimits
    BE-->>UI: 200 RateLimits

    UI->>BE: POST /api/rate_limits (new limits)
    BE->>RL: shared().update_rate_limits(...)
    RL-->>BE: persisted
    BE-->>UI: 200 RateLimits

    Note over BE,AD: Runtime model request flow
    BE->>AD: handle request -> call adapter
    AD->>RL: async with limit(provider, model)
    RL->>RL: acquire semaphore (or no-op if unlimited)
    AD->>AD: perform external async model call
    RL-->>AD: release semaphore
    AD-->>BE: result
    BE-->>UI: response

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas to focus on:

libs/core/kiln_ai/utils/model_rate_limiter.py — semaphore lifecycle, persistence, concurrency and race conditions.
Adapter changes — ensure correct scoping of async with, error paths release semaphores.
Frontend rate_limits/+page.svelte — data normalization, validation, race conditions on save/load.
app/desktop/studio_server/test_rate_limits_api.py — filesystem isolation and test determinism.

Possibly related PRs

feat: add rerankers to remote_config #780 — Modifies LitellmRerankerAdapter constructor; likely to overlap with added rate_limiter parameter.
Test all models for tool calling support #488 — Changes to LiteLlmAdapter; may conflict with the added rate_limiter parameter and call sites.
Settings UI Updates #440 — Changes to settings UI structure that may overlap with the new "Rate Limits" settings entry/page.

Suggested labels

codex

Suggested reviewers

sfierro
scosman
chiang-daniel

Poem

🐇 I counted hops and callers, one by one,
Semaphores and YAML keep the race well-run.
Per-provider carrots, per-model little treats,
I guard the queues while code and frontend meet.
Hop, save, and limit — now concurrency is neat.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 64.49% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'feat: customizable per-model rate limiting' clearly summarizes the main change: adding customizable rate limiting on a per-model basis across the application.
Description check	✅ Passed	The PR description covers the 'What does this PR do?' section with detailed issue and changes, and both checklists are marked as complete with tests run and new tests added.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch leonard/kil-276-feat-per-model-app-wide-rate-limiting

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48e6e75 and 4791c58.

📒 Files selected for processing (2)

app/desktop/studio_server/test_rate_limits_api.py (1 hunks)
libs/core/kiln_ai/utils/test_model_rate_limiter.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

libs/core/kiln_ai/utils/test_model_rate_limiter.py
app/desktop/studio_server/test_rate_limits_api.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)

GitHub Check: Build, Typecheck, and Test Python (3.13)
GitHub Check: Build, Typecheck, and Test Python (3.12)
GitHub Check: Build, Typecheck, and Test Python (3.10)
GitHub Check: Build, Typecheck, and Test Python (3.11)
GitHub Check: Generate Coverage Report
GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
GitHub Check: Build Desktop Apps (windows-latest)
GitHub Check: Build Desktop Apps (ubuntu-22.04)
GitHub Check: Build Desktop Apps (macos-latest)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-11-14T10:21:57Z

📊 Coverage Report

Overall Coverage: 92%

Diff: origin/main...HEAD

app/desktop/desktop_server.py (100%)
app/desktop/studio_server/rate_limits_api.py (100%)
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (100%)
libs/core/kiln_ai/adapters/extractors/litellm_extractor.py (100%)
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (100%)
libs/core/kiln_ai/adapters/rerankers/litellm_reranker_adapter.py (100%)
libs/core/kiln_ai/utils/model_rate_limiter.py (100%)

Summary

Total: 150 lines
Missing: 0 lines
Coverage: 100%

📊 HTML Coverage Report - Interactive coverage report
📈 Diff Coverage Report - Detailed diff analysis
Github Actions Run - View the full coverage report

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (3)

libs/core/kiln_ai/utils/model_rate_limiter.py (1)

38-49: Consider logging or narrowing the exception in _load_rate_limits

Catching bare Exception and returning {} can silently mask YAML parse errors or permission issues. Consider at least logging the exception (or narrowing to OSError | yaml.YAMLError) so unexpected failures are discoverable while still treating them as “no limits configured”.
libs/core/kiln_ai/adapters/rerankers/litellm_reranker_adapter.py (1)
18-32: Defensively normalize provider name for rate limiting

Here you pass self.reranker_config.model_provider_name directly into the rate limiter. Given other code wraps this field in ModelProviderName(...) when looking up providers, there’s a risk it can be either a raw string or an enum, while the rate-limits YAML uses plain provider-name strings.

To avoid mismatches and mirror the embedding adapter, consider:
-        async with self.rate_limiter.limit(
-            self.reranker_config.model_provider_name,
-            self.reranker_config.model_name,
-        ):
-            response = await litellm.arerank(**rerank_kwargs)
+        provider_name = getattr(
+            self.reranker_config.model_provider_name,
+            "value",
+            self.reranker_config.model_provider_name,
+        )
+
+        async with self.rate_limiter.limit(
+            provider_name,
+            self.reranker_config.model_name,
+        ):
+            response = await litellm.arerank(**rerank_kwargs)
This keeps the API tolerant of both enum and string values while aligning with the string keys persisted in rate_limits.yaml.

Also applies to: 58-63
app/web_ui/src/lib/api_schema.d.ts (1)

8557-8637: Rate-limits operations are consistent; consider stronger typing via backend schema

read_rate_limits_api_rate_limits_get returns a generic Record<string, unknown>, while update_rate_limits_api_rate_limits_post accepts a nested Record<string, number>. That’s functionally fine and matches a “settings blob” style, but it loses structure on the read side, so TS callers won’t get much type safety.

If you later want stricter typing (e.g., explicit provider/model keys), it will need to be modeled in the FastAPI/Pydantic schema (e.g., in rate_limits_api.py) and then regenerated into this file rather than edited here. For now, the signatures are coherent and safe to use as-is.
Based on learnings

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a6c997c and 3dc4aef.

📒 Files selected for processing (12)

app/desktop/desktop_server.py (2 hunks)
app/desktop/studio_server/rate_limits_api.py (1 hunks)
app/desktop/studio_server/test_rate_limits_api.py (1 hunks)
app/web_ui/src/lib/api_schema.d.ts (4 hunks)
app/web_ui/src/routes/(app)/settings/+page.svelte (1 hunks)
app/web_ui/src/routes/(app)/settings/rate_limits/+page.svelte (1 hunks)
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (3 hunks)
libs/core/kiln_ai/adapters/extractors/litellm_extractor.py (5 hunks)
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (3 hunks)
libs/core/kiln_ai/adapters/rerankers/litellm_reranker_adapter.py (2 hunks)
libs/core/kiln_ai/utils/model_rate_limiter.py (1 hunks)
libs/core/kiln_ai/utils/test_model_rate_limiter.py (1 hunks)

🧰 Additional context used

🧠 Learnings (6)

📓 Common learnings

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 546
File: app/web_ui/src/routes/(app)/docs/library/[project_id]/upload_file_dialog.svelte:107-120
Timestamp: 2025-09-10T08:32:18.688Z
Learning: leonardmq prefers to work within the constraints of their SDK codegen for API calls, even when typing is awkward (like casting FormData to match expected types), rather than using alternative approaches like native fetch that would cause compiler errors with their generated types.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 413
File: libs/core/kiln_ai/datamodel/chunk.py:138-160
Timestamp: 2025-08-22T11:17:56.862Z
Learning: leonardmq prefers to avoid redundant validation checks when upstream systems already guarantee preconditions are met. He trusts the attachment system to ensure paths are properly formatted and prefers letting the attachment method (resolve_path) handle any edge cases rather than adding defensive precondition checks.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 390
File: libs/core/kiln_ai/adapters/provider_tools.py:494-499
Timestamp: 2025-07-23T08:58:45.769Z
Learning: leonardmq prefers to keep tightly coupled implementation details (like API versions) hardcoded when they are not user-facing and could break other modules if changed. The shared Config class is reserved for user-facing customizable values, not internal implementation details.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 341
File: libs/server/kiln_server/document_api.py:44-51
Timestamp: 2025-06-18T08:22:58.510Z
Learning: leonardmq prefers to defer fixing blocking I/O in async handlers when: the operation is very fast (milliseconds), user-triggered rather than automated, has no concurrency concerns, and would require additional testing to fix properly. He acknowledges such issues as valid but makes pragmatic decisions about timing the fixes.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 487
File: libs/core/kiln_ai/datamodel/rag.py:33-35
Timestamp: 2025-09-04T06:45:44.212Z
Learning: leonardmq requires vector_store_config_id to be a mandatory field in RagConfig (similar to extractor_config_id, chunker_config_id, embedding_config_id) for consistency. He prefers fixing dependent code that breaks due to missing required fields rather than making fields optional to accommodate incomplete data.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 654
File: app/web_ui/src/routes/(app)/docs/rag_configs/[project_id]/create_rag_config/edit_rag_config_form.svelte:141-152
Timestamp: 2025-09-25T06:38:14.854Z
Learning: leonardmq prefers simple onMount initialization patterns over reactive statements when possible, and is cautious about maintaining internal state for idempotency in Svelte components. He values simplicity and safety in component lifecycle management.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 402
File: libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py:0-0
Timestamp: 2025-07-14T03:43:07.283Z
Learning: leonardmq prefers to keep defensive validation checks even when they're technically redundant, viewing them as useful "quick sanity checks" that provide additional safety nets. He values defensive programming over strict DRY (Don't Repeat Yourself) principles when the redundant code serves as a safeguard.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 546
File: app/web_ui/src/lib/api_schema.d.ts:0-0
Timestamp: 2025-09-10T08:27:47.227Z
Learning: leonardmq correctly identifies auto-generated files and prefers not to manually edit them. When suggesting changes to generated TypeScript API schema files, the fix should be made in the source OpenAPI specification in the backend, not in the generated TypeScript file.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 487
File: libs/core/kiln_ai/adapters/vector_store/lancedb_adapter.py:136-150
Timestamp: 2025-09-04T06:34:12.318Z
Learning: leonardmq prefers brute force re-insertion of all chunks when partial chunks exist in the LanceDB vector store, rather than selectively inserting only missing chunks. His reasoning is that documents typically have only dozens of chunks (making overwriting cheap), and partial insertion likely indicates data corruption or conflicts that should be resolved by re-inserting all chunks to ensure consistency.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 487
File: libs/core/kiln_ai/adapters/vector_store/lancedb_adapter.py:98-103
Timestamp: 2025-09-04T07:08:04.248Z
Learning: leonardmq prefers to let TableNotFoundError bubble up in delete_nodes_by_document_id operations rather than catching it, as this error indicates operations are being done in the wrong order on the caller side and should be surfaced as a programming error rather than handled gracefully.

📚 Learning: 2025-08-08T16:13:26.526Z

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 418
File: libs/core/kiln_ai/adapters/ml_model_list.py:1875-1890
Timestamp: 2025-08-08T16:13:26.526Z
Learning: In libs/core/kiln_ai/adapters/ml_model_list.py (Python), do not blanket-add r1_thinking/optional_r1_thinking parsers for R1-style models. Parser usage is provider-specific and must be based on observed responses in tests. For PR Kiln-AI/Kiln#418, deepseek_r1_0528_distill_qwen3_8b providers were validated without needing a parser.

Applied to files:

libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
libs/core/kiln_ai/adapters/rerankers/litellm_reranker_adapter.py
libs/core/kiln_ai/adapters/extractors/litellm_extractor.py

📚 Learning: 2025-10-29T06:59:55.635Z

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 736
File: app/web_ui/src/lib/api_schema.d.ts:8143-8180
Timestamp: 2025-10-29T06:59:55.635Z
Learning: app/web_ui/src/lib/api_schema.d.ts is generated by openapi-typescript; do not propose manual edits. Schema changes should be made in the FastAPI backend (e.g., app/desktop/studio_server/data_gen_api.py or libs/server/kiln_server/*), then re-generate the TS types.

Applied to files:

app/web_ui/src/lib/api_schema.d.ts

📚 Learning: 2025-09-04T06:27:58.852Z

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 487
File: app/web_ui/src/lib/api_schema.d.ts:3315-3320
Timestamp: 2025-09-04T06:27:58.852Z
Learning: The file app/web_ui/src/lib/api_schema.d.ts is auto-generated by openapi-typescript from the backend API schema and should never be manually edited. Changes to API types should be made in the backend, which will then be reflected in the regenerated TypeScript definitions.

Applied to files:

app/web_ui/src/lib/api_schema.d.ts

📚 Learning: 2025-09-10T08:30:56.973Z

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 546
File: app/web_ui/src/lib/api_schema.d.ts:4173-4218
Timestamp: 2025-09-10T08:30:56.973Z
Learning: leonardmq correctly identified that app/web_ui/src/lib/api_schema.d.ts is auto-generated and should not be manually edited. The proper fix for type inconsistencies between TypeScript API schema and backend models is to update the backend Python models and regenerate the TypeScript definitions.

Applied to files:

app/web_ui/src/lib/api_schema.d.ts

📚 Learning: 2025-10-29T11:19:31.059Z

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 759
File: app/web_ui/src/lib/api_schema.d.ts:2485-2490
Timestamp: 2025-10-29T11:19:31.059Z
Learning: In libs/server/kiln_server/document_api.py and the generated app/web_ui/src/lib/api_schema.d.ts, CreateVectorStoreConfigRequest.properties is now a union of three distinct public schemas: LanceDBConfigFTSPropertiesPublic, LanceDBConfigVectorPropertiesPublic, and LanceDBConfigHybridPropertiesPublic. These are intentionally split to allow independent evolution; do not suggest merging them.

Applied to files:

app/web_ui/src/lib/api_schema.d.ts

🧬 Code graph analysis (9)

app/desktop/desktop_server.py (2)

app/desktop/studio_server/test_rate_limits_api.py (1)

app (20-23)

app/desktop/studio_server/rate_limits_api.py (1)

connect_rate_limits (46-109)

libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (3)

libs/core/kiln_ai/utils/model_rate_limiter.py (3)

ModelRateLimiter (11-132)

get_global_rate_limiter (139-152)

limit (76-94)

libs/core/kiln_ai/datamodel/embedding.py (1)

EmbeddingConfig (18-40)

libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (1)

litellm_model_id (474-487)

libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (1)

libs/core/kiln_ai/utils/model_rate_limiter.py (3)

ModelRateLimiter (11-132)

get_global_rate_limiter (139-152)

limit (76-94)

libs/core/kiln_ai/adapters/rerankers/litellm_reranker_adapter.py (3)

libs/core/kiln_ai/utils/model_rate_limiter.py (3)

ModelRateLimiter (11-132)

get_global_rate_limiter (139-152)

limit (76-94)

libs/core/kiln_ai/adapters/rerankers/base_reranker.py (1)

BaseReranker (28-43)

libs/core/kiln_ai/adapters/provider_tools.py (1)

LiteLlmCoreConfig (420-423)

libs/core/kiln_ai/utils/test_model_rate_limiter.py (2)

libs/core/kiln_ai/utils/config.py (2)

Config (31-288)

settings_dir (232-236)

libs/core/kiln_ai/utils/model_rate_limiter.py (8)

ModelRateLimiter (11-132)

reset_global_rate_limiter (155-162)

get_limit (96-102)

set_limit (104-127)

reload (129-132)

limit (76-94)

_get_semaphore (55-73)

_get_semaphore_key (51-53)

app/desktop/studio_server/test_rate_limits_api.py (3)

libs/core/kiln_ai/utils/config.py (1)

settings_dir (232-236)

app/desktop/studio_server/rate_limits_api.py (1)

connect_rate_limits (46-109)

libs/core/kiln_ai/utils/test_model_rate_limiter.py (1)

temp_home (24-26)

libs/core/kiln_ai/utils/model_rate_limiter.py (1)

libs/core/kiln_ai/utils/config.py (2)

Config (31-288)

settings_dir (232-236)

libs/core/kiln_ai/adapters/extractors/litellm_extractor.py (1)

libs/core/kiln_ai/utils/model_rate_limiter.py (3)

ModelRateLimiter (11-132)

get_global_rate_limiter (139-152)

limit (76-94)

app/desktop/studio_server/rate_limits_api.py (1)

libs/core/kiln_ai/utils/config.py (2)

Config (31-288)

settings_dir (232-236)

🪛 GitHub Actions: Build and Test

libs/core/kiln_ai/utils/test_model_rate_limiter.py

[error] 59-59: AssertionError in test_init_with_no_file: expected None but got 5 from limiter.get_limit('openai', 'gpt_5'). This suggests ModelRateLimiter.get_limit() returns a default limit when no file exists; needs fix. (pytest run failed: exit code 1)

🪛 GitHub Actions: Debug Detector

libs/core/kiln_ai/adapters/extractors/litellm_extractor.py

[error] 210-219: Developer content found. Step 'Checking for TODO or FIXME' failed the pipeline and exited with code 1.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: Generate Coverage Report
GitHub Check: Build Desktop Apps (macos-15-intel)
GitHub Check: Build Desktop Apps (ubuntu-22.04)
GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
GitHub Check: Build Desktop Apps (macos-latest)
GitHub Check: Build Desktop Apps (windows-latest)

🔇 Additional comments (21)

app/web_ui/src/routes/(app)/settings/+page.svelte (1)

62-69: Settings tile wiring looks consistent

The new “Rate Limits” settings item correctly points to /settings/rate_limits and the description matches the new page’s purpose. No issues here.

app/desktop/desktop_server.py (1)

24-27: Rate-limits API wiring is in the right place

Hooking connect_rate_limits(app) alongside the other connect_* calls (before webhost) integrates the new endpoints cleanly without affecting existing routing. Looks good.

Also applies to: 58-67

libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (1)

91-103: The review comment is incorrect and should be disregarded.

The current code already works correctly. ModelProviderName inherits from both str and Enum, which means enum members behave identically to strings in dict operations. When rate_limiter.limit(model_provider_name, ...) is called with a ModelProviderName enum, the _get_semaphore() method's dict lookup self._rate_limits.get(provider, {}) succeeds because:

Enum members have the same hash as their string value: hash(ModelProviderName.openai) == hash("openai")

Enum members compare equal to strings: ModelProviderName.openai == "openai" returns True

Python's dict uses hash and equality for key matching, so the enum member finds the string key from the YAML config

The .value normalization is unnecessary and adds confusion. The code is already robust due to Python's design of str-inheriting enums, specifically to enable this use case.

Likely an incorrect or invalid review comment.

libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (4)

43-43: LGTM!

The import of ModelRateLimiter and get_global_rate_limiter is appropriate for integrating rate limiting into the adapter.

70-76: LGTM!

The constructor signature change is consistent with the rate limiting pattern used in other adapters. The optional rate_limiter parameter allows for dependency injection while maintaining backward compatibility.

83-85: LGTM!

The rate limiter initialization correctly defaults to the global singleton, ensuring all adapters share the same rate limiter and enforce limits consistently across the application.

294-297: LGTM!

The rate limiter context manager correctly wraps the litellm.acompletion call, ensuring concurrent requests are throttled according to the configured per-model limits.

libs/core/kiln_ai/adapters/extractors/litellm_extractor.py (4)

27-27: LGTM!

The import of ModelRateLimiter and get_global_rate_limiter is appropriate for integrating rate limiting into the extractor.

101-101: LGTM!

The constructor signature change is consistent with the rate limiting pattern used in other adapters.

121-123: LGTM!

The rate limiter initialization correctly defaults to the global singleton, ensuring consistent rate limiting across the application.

382-386: LGTM!

The rate limiter context manager correctly wraps the litellm.acompletion call for non-PDF extraction paths.

app/desktop/studio_server/test_rate_limits_api.py (4)

1-29: LGTM!

The test fixtures are well-structured. The temp_home fixture correctly patches the home directory for isolated testing, and the app/client fixtures follow standard FastAPI testing patterns.

31-92: LGTM!

The test coverage for rate limits CRUD operations is comprehensive, covering:

Reading empty rate limits

Reading existing rate limits from file

Creating new rate limits

Overwriting existing rate limits

Clearing rate limits with an empty payload

The tests correctly validate both API responses and filesystem persistence.

94-191: LGTM!

The model listing tests are thorough, validating:

Response structure (normal_models, embedding_models, reranker_models)

Model field structure (model_name, friendly_name, provider_name, model_id)

Presence of known models

Grouping by provider

This provides good confidence that the /api/models/all endpoint returns the expected data shape.

193-225: LGTM!

The roundtrip and nested structure tests validate end-to-end persistence and retrieval, ensuring that complex rate limit configurations are correctly stored and retrieved.

app/desktop/studio_server/rate_limits_api.py (3)

13-24: LGTM!

The Pydantic models ModelInfo and AllModelsResponse provide clear structure for the API responses and enable automatic validation.

46-62: LGTM!

The rate limits endpoints correctly handle loading and saving rate limits with appropriate error handling via HTTPException.

64-109: LGTM!

The /api/models/all endpoint correctly aggregates models from all three built-in model lists, flattening provider information into separate ModelInfo entries. The error handling is appropriate.

app/web_ui/src/lib/api_schema.d.ts (3)

1312-1346: New /api/rate_limits and /api/models/all paths are wired correctly

The path entries correctly reference the corresponding operations and follow the same pattern as existing endpoints (GET + POST for settings-like resources, GET-only for listings). No changes needed here.
Based on learnings

2125-2133: AllModelsResponse schema cleanly structures model lists

The split into normal_models, embedding_models, and reranker_models, all reusing ModelInfo, is consistent and should make the UI consumption straightforward. Looks good.
Based on learnings

4356-4366: ModelInfo shape looks appropriate; confirm model_id nullability

The ModelInfo fields (model_name, friendly_name, provider_name, nullable model_id) match how a UI would present models. Allowing model_id to be null seems intended for providers without a stable ID, but it’s worth confirming this aligns with the backend datamodel and any assumptions in the new settings UI.
Based on learnings

app/desktop/studio_server/rate_limits_api.py

app/web_ui/src/routes/(app)/settings/rate_limits/+page.svelte

libs/core/kiln_ai/adapters/extractors/litellm_extractor.py

libs/core/kiln_ai/utils/test_model_rate_limiter.py

leonardmq · 2025-11-14T10:35:15Z

Something wrong in tests - we should probably remove the limiter in tests or mock / reset it on every test somehow

scosman · 2025-11-18T21:47:34Z

Late to ask: but how much work to make this "Customize per model settings"?

Add in JSON mode, etc.

leonardmq · 2025-11-19T10:05:19Z

Late to ask: but how much work to make this "Customize per model settings"?

Add in JSON mode, etc.

@scosman

You mean putting it alongside JSON Mode dropdown and such so that users can pick different rates as part of the RunConfig properties, so they can pick on a case by case basis?

In this PR, we track the total number of in flight requests across the entire app because for cases like RAG, doing one user action such as adding a doc can trigger multiple independent job runners from many different configs concurrently, all of which contribute to the global number of requests. A user could be running an eval + synth data + RAG at the same time and other such combos.

If we had the rate limit on a per-run config basis, it would probably be hard to share the semaphore across everything that is going on. We would probably have competing long running tasks running concurrently with distinct semaphores. Unless maybe we wrap that under the current semaphore; like: (1) acquire provider-wide/model-wide semaphore -> (2) then acquire semaphore from run config so it stays under both - but that seems error prone / dodgy.

We can move away from this later though if we figure out something better. The current PR is an easy / not too risky way of giving users a way out of getting stuck if they get stuck - lower for people hitting rate limits, and cranking it up to the max if people have higher account tiers.

Maybe a hybrid could be like:

show a little dialog somewhere (e.g. button you click in RAG Progress dialog) where they can edit the current rate limit for whichever model or provider it is about
that we could do easily, it replaces the current semaphore with a new one (of the new size); in-flight jobs complete and new ones get enqueued into the newly instantiated semaphore

…eat-per-model-app-wide-rate-limiting

leonardmq · 2025-11-19T10:34:28Z

@coderabbitai review

coderabbitai · 2025-11-19T10:34:33Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

app/desktop/studio_server/rate_limits_api.py (1)

9-36: Consider logging exceptions before returning HTTP 500

Both handlers wrap ModelRateLimiter calls in a broad except Exception and immediately raise HTTPException(500, ...), but the defined logger isn’t used. For diagnosability, consider logging the exception (e.g. logger.exception("Failed to read/update rate limits")) before raising, so filesystem or parsing issues show up in logs while keeping the API surface unchanged.
app/web_ui/src/routes/(app)/settings/rate_limits/+page.svelte (1)
324-345: Tighten POST body typing for /api/rate_limits

The request body is typed as components["schemas"]["RateLimits"] in the API schema, which is:
{
    provider_limits?: Record<string, number>;
    model_limits?: Record<string, Record<string, number>>;
}
But the code casts to Record<string, Record<string, number>>, which is structurally incorrect. This silently bypasses type checking and risks accidental mismatches with the actual API contract.

Use the generated RateLimits type instead:
import type { components } from "$lib/api_schema"
// ...
const { error: api_error } = await client.POST("/api/rate_limits", {
  body: rate_limits as components["schemas"]["RateLimits"],
})
This aligns TypeScript with the backend contract and prevents shape drift.
libs/core/kiln_ai/utils/model_rate_limiter.py (2)
54-75: Clarify default_provider_limit semantics and consider a small testing hook for the singleton

Two small points here:

__init__’s docstring only mentions rate_limits, but not default_provider_limit. It would help future readers to explicitly document that this is the fallback per‑provider cap used when there is no explicit limit, and that it feeds _default_provider_limits.

shared() creates a process‑global singleton that holds on to semaphores and reads/writes a real file in settings_dir. This is fine for production, but it makes tests and short‑lived scripts a bit awkward (state leakage between tests, reliance on the real user settings directory). Consider exposing a minimal, clearly‑named helper for tests, e.g.:
@classmethod
def _reset_shared_for_tests(cls) -> None:
    cls._shared_instance = None
or similar, so tests don’t have to monkeypatch private attributes.

Also applies to: 76-90

91-103: Normalize provider keys to str and handle unknown providers gracefully in defaults

Two related subtleties around provider keys and defaults:

_initialize_default_provider_limits seeds the dict with ModelProviderName.ollama as a key, but elsewhere the type is annotated as Dict[str, int] and lookups are done with provider: str. If ModelProviderName isn’t strictly interchangeable with str at the dict level, the special‑case for Ollama (limit 1) might never be hit. Normalizing to str is cheap and removes the ambiguity:
-        limits: Dict[str, int] = defaultdict(lambda: default_provider_limit)
-        limits[ModelProviderName.ollama] = 1
+        limits: Dict[str, int] = defaultdict(lambda: default_provider_limit)
+        limits[str(ModelProviderName.ollama)] = 1
In get_model_default_max_concurrent_requests, ModelProviderName(provider) will raise ValueError if provider isn’t a valid enum value (e.g. any future/3rd‑party provider or a typo). That turns a “no explicit limit configured” case into a hard failure instead of falling back to safe defaults. You can keep the strong typing for known providers but guard the unknown case:
-        model_provider = built_in_models_from_provider(
-            ModelProviderName(provider), model
-        )
+        try:
+            provider_name = ModelProviderName(provider)
+        except ValueError:
+            # Unknown provider; fall back to the generic default.
+            return self._default_provider_limits[provider]
+
+        model_provider = built_in_models_from_provider(provider_name, model)
The rest of the function can stay as‑is, still using self._default_provider_limits[provider] for the minimum.

Also applies to: 310-334

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3dc4aef and fa05fb9.

📒 Files selected for processing (11)

app/desktop/studio_server/rate_limits_api.py (1 hunks)
app/desktop/studio_server/test_rate_limits_api.py (1 hunks)
app/desktop/studio_server/test_tool_api.py (1 hunks)
app/web_ui/src/lib/api_schema.d.ts (3 hunks)
app/web_ui/src/routes/(app)/settings/rate_limits/+page.svelte (1 hunks)
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (3 hunks)
libs/core/kiln_ai/adapters/extractors/litellm_extractor.py (5 hunks)
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (3 hunks)
libs/core/kiln_ai/adapters/rerankers/litellm_reranker_adapter.py (2 hunks)
libs/core/kiln_ai/utils/model_rate_limiter.py (1 hunks)
libs/core/kiln_ai/utils/test_model_rate_limiter.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

libs/core/kiln_ai/utils/test_model_rate_limiter.py
app/desktop/studio_server/test_rate_limits_api.py

🧰 Additional context used

🧠 Learnings (6)

📓 Common learnings

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 546
File: app/web_ui/src/routes/(app)/docs/library/[project_id]/upload_file_dialog.svelte:107-120
Timestamp: 2025-09-10T08:32:18.688Z
Learning: leonardmq prefers to work within the constraints of their SDK codegen for API calls, even when typing is awkward (like casting FormData to match expected types), rather than using alternative approaches like native fetch that would cause compiler errors with their generated types.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 413
File: libs/core/kiln_ai/datamodel/chunk.py:138-160
Timestamp: 2025-08-22T11:17:56.862Z
Learning: leonardmq prefers to avoid redundant validation checks when upstream systems already guarantee preconditions are met. He trusts the attachment system to ensure paths are properly formatted and prefers letting the attachment method (resolve_path) handle any edge cases rather than adding defensive precondition checks.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 390
File: libs/core/kiln_ai/adapters/provider_tools.py:494-499
Timestamp: 2025-07-23T08:58:45.769Z
Learning: leonardmq prefers to keep tightly coupled implementation details (like API versions) hardcoded when they are not user-facing and could break other modules if changed. The shared Config class is reserved for user-facing customizable values, not internal implementation details.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 341
File: libs/server/kiln_server/document_api.py:44-51
Timestamp: 2025-06-18T08:22:58.510Z
Learning: leonardmq prefers to defer fixing blocking I/O in async handlers when: the operation is very fast (milliseconds), user-triggered rather than automated, has no concurrency concerns, and would require additional testing to fix properly. He acknowledges such issues as valid but makes pragmatic decisions about timing the fixes.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 487
File: libs/core/kiln_ai/datamodel/rag.py:33-35
Timestamp: 2025-09-04T06:45:44.212Z
Learning: leonardmq requires vector_store_config_id to be a mandatory field in RagConfig (similar to extractor_config_id, chunker_config_id, embedding_config_id) for consistency. He prefers fixing dependent code that breaks due to missing required fields rather than making fields optional to accommodate incomplete data.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 654
File: app/web_ui/src/routes/(app)/docs/rag_configs/[project_id]/create_rag_config/edit_rag_config_form.svelte:141-152
Timestamp: 2025-09-25T06:38:14.854Z
Learning: leonardmq prefers simple onMount initialization patterns over reactive statements when possible, and is cautious about maintaining internal state for idempotency in Svelte components. He values simplicity and safety in component lifecycle management.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 402
File: libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py:0-0
Timestamp: 2025-07-14T03:43:07.283Z
Learning: leonardmq prefers to keep defensive validation checks even when they're technically redundant, viewing them as useful "quick sanity checks" that provide additional safety nets. He values defensive programming over strict DRY (Don't Repeat Yourself) principles when the redundant code serves as a safeguard.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 546
File: app/web_ui/src/lib/api_schema.d.ts:0-0
Timestamp: 2025-09-10T08:27:47.227Z
Learning: leonardmq correctly identifies auto-generated files and prefers not to manually edit them. When suggesting changes to generated TypeScript API schema files, the fix should be made in the source OpenAPI specification in the backend, not in the generated TypeScript file.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 487
File: libs/core/kiln_ai/adapters/vector_store/lancedb_adapter.py:136-150
Timestamp: 2025-09-04T06:34:12.318Z
Learning: leonardmq prefers brute force re-insertion of all chunks when partial chunks exist in the LanceDB vector store, rather than selectively inserting only missing chunks. His reasoning is that documents typically have only dozens of chunks (making overwriting cheap), and partial insertion likely indicates data corruption or conflicts that should be resolved by re-inserting all chunks to ensure consistency.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 487
File: libs/core/kiln_ai/adapters/vector_store/lancedb_adapter.py:98-103
Timestamp: 2025-09-04T07:08:04.248Z
Learning: leonardmq prefers to let TableNotFoundError bubble up in delete_nodes_by_document_id operations rather than catching it, as this error indicates operations are being done in the wrong order on the caller side and should be surfaced as a programming error rather than handled gracefully.

📚 Learning: 2025-08-08T16:13:26.526Z

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 418
File: libs/core/kiln_ai/adapters/ml_model_list.py:1875-1890
Timestamp: 2025-08-08T16:13:26.526Z
Learning: In libs/core/kiln_ai/adapters/ml_model_list.py (Python), do not blanket-add r1_thinking/optional_r1_thinking parsers for R1-style models. Parser usage is provider-specific and must be based on observed responses in tests. For PR Kiln-AI/Kiln#418, deepseek_r1_0528_distill_qwen3_8b providers were validated without needing a parser.

Applied to files:

libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
libs/core/kiln_ai/adapters/extractors/litellm_extractor.py
libs/core/kiln_ai/adapters/rerankers/litellm_reranker_adapter.py
libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py
libs/core/kiln_ai/utils/model_rate_limiter.py

📚 Learning: 2025-10-29T06:59:55.635Z

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 736
File: app/web_ui/src/lib/api_schema.d.ts:8143-8180
Timestamp: 2025-10-29T06:59:55.635Z
Learning: app/web_ui/src/lib/api_schema.d.ts is generated by openapi-typescript; do not propose manual edits. Schema changes should be made in the FastAPI backend (e.g., app/desktop/studio_server/data_gen_api.py or libs/server/kiln_server/*), then re-generate the TS types.

Applied to files:

app/web_ui/src/lib/api_schema.d.ts

📚 Learning: 2025-09-04T06:27:58.852Z

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 487
File: app/web_ui/src/lib/api_schema.d.ts:3315-3320
Timestamp: 2025-09-04T06:27:58.852Z
Learning: The file app/web_ui/src/lib/api_schema.d.ts is auto-generated by openapi-typescript from the backend API schema and should never be manually edited. Changes to API types should be made in the backend, which will then be reflected in the regenerated TypeScript definitions.

Applied to files:

app/web_ui/src/lib/api_schema.d.ts

📚 Learning: 2025-09-10T08:30:56.973Z

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 546
File: app/web_ui/src/lib/api_schema.d.ts:4173-4218
Timestamp: 2025-09-10T08:30:56.973Z
Learning: leonardmq correctly identified that app/web_ui/src/lib/api_schema.d.ts is auto-generated and should not be manually edited. The proper fix for type inconsistencies between TypeScript API schema and backend models is to update the backend Python models and regenerate the TypeScript definitions.

Applied to files:

app/web_ui/src/lib/api_schema.d.ts

📚 Learning: 2025-10-29T11:19:31.059Z

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 759
File: app/web_ui/src/lib/api_schema.d.ts:2485-2490
Timestamp: 2025-10-29T11:19:31.059Z
Learning: In libs/server/kiln_server/document_api.py and the generated app/web_ui/src/lib/api_schema.d.ts, CreateVectorStoreConfigRequest.properties is now a union of three distinct public schemas: LanceDBConfigFTSPropertiesPublic, LanceDBConfigVectorPropertiesPublic, and LanceDBConfigHybridPropertiesPublic. These are intentionally split to allow independent evolution; do not suggest merging them.

Applied to files:

app/web_ui/src/lib/api_schema.d.ts

🔇 Additional comments (11)

app/desktop/studio_server/test_tool_api.py (1)

3566-3668: Archived RAG and kiln-task coverage looks solid

The new test cleanly exercises the archived-vs-active filtering for both RAG configs and kiln task tools and matches the endpoint’s intended behavior; no changes needed.

libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py (1)

91-104: Rate limiter integration in embedding adapter looks correct

Constructor wiring and the async with self.rate_limiter.limit(...) wrapper around litellm.aembedding are consistent with other adapters and correctly key limits by provider/model; no changes needed.

Also applies to: 142-171

libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (1)

69-85: Model rate limiting around acompletion is wired correctly

The adapter now cleanly accepts an optional rate_limiter and defaults to the shared instance; the acompletion_checking_response wrapper correctly scopes the LiteLLM call under limit(provider, model). This matches the other adapter integrations.

Also applies to: 291-297

libs/core/kiln_ai/adapters/rerankers/litellm_reranker_adapter.py (1)

21-32: Reranker rate limiter integration is consistent and safe

The optional rate_limiter parameter and the async with self.rate_limiter.limit(...) around litellm.arerank mirror the other adapters and correctly scope concurrency per provider/model; nothing further to change here.

Also applies to: 58-63

libs/core/kiln_ai/adapters/extractors/litellm_extractor.py (1)

94-123: Extractor now uses shared rate limiter correctly

The extractor’s constructor, per-page _extract_single_pdf_page, and non-PDF _extract path all consistently wrap litellm.acompletion in the model rate limiter, layered on top of the existing max_parallel_requests_for_model batching. The previous debug prints are gone and behavior looks good.

Also applies to: 188-212, 350-378

app/web_ui/src/lib/api_schema.d.ts (4)

1312-1341: New /api/rate_limits path is consistent with existing API patterns

GET/POST shape, parameters typed as never, and operation IDs match the established naming convention (e.g., /api/settings). No issues from the type-layer; assuming these came from the backend OpenAPI, this looks good.

As per prior learnings, any future adjustments should be made in the FastAPI schema and re‑generated here, not edited manually. [Based on learnings]

4809-4833: RateLimits schema matches the described provider/model semantics

The provider_limits and nested model_limits dictionaries cleanly represent provider‑wide vs per‑model concurrency limits, and both being optional aligns with allowing partial configs. No inconsistencies with the PR’s described behavior from the type perspective.

Reminder: if the Python RateLimits model changes (fields, defaults, docstrings), regenerate this file from OpenAPI rather than editing it directly. [Based on learnings]

8557-8576: read_rate_limits_api_rate_limits_get operation correctly returns RateLimits

No params, no request body, and a 200 response typed as RateLimits is exactly what the UI needs for loading the config. This is consistent with the /api/rate_limits path entry.

8577-8609: update_rate_limits_api_rate_limits_post wiring looks correct

Request body and 200 response both use the RateLimits schema, and 422 reuses the standard HTTPValidationError, matching patterns used elsewhere (e.g., settings endpoints). Types are coherent with the new path and schema.

libs/core/kiln_ai/utils/model_rate_limiter.py (2)

18-33: RateLimits schema matches the intended config shape

The Pydantic model cleanly captures provider-wide and per-model limits with sensible defaults and descriptions. No issues from a correctness or design standpoint.

116-171: YAML persistence and reload behavior look robust

The load/save/update/reload path is defensive in a good way: invalid or unreadable YAML results in empty RateLimits with clear logging, writes are guarded by a lock, and update_rate_limits/reload both clear semaphores so new calls respect updated config. This is a solid foundation for the API/UI wired on top.

Also applies to: 300-309

libs/core/kiln_ai/utils/model_rate_limiter.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (4)

libs/core/kiln_ai/utils/test_model_rate_limiter.py (1)
1-19: Incomplete fix: temp_home fixture still doesn't follow the reviewed recommendation

The past review identified that Config.settings_dir() uses Path.home() and provided a specific fix (import Path from pathlib and use patch.object(Path, "home", return_value=tmp_path)). The current implementation uses string-based patching patch("pathlib.Path.home", ...) which may not reliably patch Path.home() where it's used in the Config module.

Apply the recommended diff from the previous review:
 import asyncio
 import os
+from pathlib import Path
 from unittest.mock import patch

 import pytest
 import yaml

 from kiln_ai.datamodel.datamodel_enums import ModelProviderName
 from kiln_ai.utils.config import Config
 from kiln_ai.utils.model_rate_limiter import ModelRateLimiter, RateLimits


 @pytest.fixture
 def temp_home(tmp_path):
-    with (
-        patch.object(os.path, "expanduser", return_value=str(tmp_path)),
-        patch("pathlib.Path.home", return_value=tmp_path),
-    ):
+    with patch.object(Path, "home", return_value=tmp_path):
         yield tmp_path
This ensures Config.settings_dir() resolves into the isolated temporary directory, preventing test cross-contamination and pipeline failures.
libs/core/kiln_ai/utils/model_rate_limiter.py (3)
198-208: Docstring mismatch already flagged in past review.

The docstrings for _get_model_limit (line 203) and _get_provider_limit (line 214) claim these methods return None for "unlimited" concurrency, but the actual behavior is that None means "no explicit override" and the system falls back to defaults (still finite). This was already identified in a previous review comment.

Also applies to: 210-216

155-184: Potential race condition when creating semaphores in _get_semaphore.

The pattern if key not in self._semaphores followed by self._semaphores[key] = asyncio.Semaphore(...) (lines 167-168, 174-175, 180-183) is not protected by a lock. If this method is called concurrently from multiple async tasks or threads for the same provider/model combination, multiple semaphores could be created and the last one overwrites previous ones. Tasks that already acquired a reference to an earlier semaphore would be using a different semaphore instance, breaking the global rate limit enforcement.

While asyncio tasks in a single event loop cooperatively yield, the PR objectives mention "concurrent job runners," suggesting potential multi-threaded usage. Even in single-threaded async, the lack of synchronization is a latent bug that could surface if execution context changes.

Consider one of these approaches:

Use an asyncio.Lock to protect semaphore creation (requires method to be async)

Use the existing threading.Lock if all callers are guaranteed to be in the same thread

Accept the tradeoff and document that semaphore creation is eventually consistent

If you choose option 2, apply this diff:
     def _get_semaphore(self, provider: str, model: str) -> asyncio.Semaphore:
         """
         Get or create a semaphore for the given provider/model.
 
         Checks model-specific limit first, then falls back to provider-wide limit.
         If using provider-wide limit, all models from that provider share the same semaphore.
         If no limit is configured, uses a default value or the model's max_parallel_requests
         defined in the model list.
         """
         model_limit = self._rate_limits.model_limits.get(provider, {}).get(model)
         if model_limit is not None and model_limit > 0:
             key = self._get_semaphore_key(provider, model)
-            if key not in self._semaphores:
-                self._semaphores[key] = asyncio.Semaphore(model_limit)
+            with self._lock:
+                if key not in self._semaphores:
+                    self._semaphores[key] = asyncio.Semaphore(model_limit)
             return self._semaphores[key]
 
         provider_limit = self._rate_limits.provider_limits.get(provider)
         if provider_limit is not None and provider_limit > 0:
             key = self._get_semaphore_key(provider, None)
-            if key not in self._semaphores:
-                self._semaphores[key] = asyncio.Semaphore(provider_limit)
+            with self._lock:
+                if key not in self._semaphores:
+                    self._semaphores[key] = asyncio.Semaphore(provider_limit)
             return self._semaphores[key]
 
         # if no limit is configured, set a default of 10
         key = self._get_semaphore_key(provider, model)
-        if key not in self._semaphores:
-            self._semaphores[key] = asyncio.Semaphore(
-                self._get_model_default_max_concurrent_requests(provider, model)
-            )
+        with self._lock:
+            if key not in self._semaphores:
+                self._semaphores[key] = asyncio.Semaphore(
+                    self._get_model_default_max_concurrent_requests(provider, model)
+                )
         return self._semaphores[key]
73-86: The singleton pattern in shared() is not thread-safe.

If multiple threads call shared() simultaneously before _shared_instance is set, they could all check cls._shared_instance is None, see True, and each create a new instance. The last assignment wins, but multiple instances will have been created, breaking the singleton guarantee and potentially causing inconsistent rate limiting across the application.

Apply this diff to make shared() thread-safe:
+    _shared_lock = threading.Lock()
+
     @classmethod
     def shared(cls) -> "ModelRateLimiter":
         """
         Get the shared singleton instance of the rate limiter.
 
         This ensures all adapters share the same rate limiter and semaphores,
         properly enforcing global rate limits across the application.
 
         Returns:
             ModelRateLimiter: The shared rate limiter instance
         """
-        if cls._shared_instance is None:
-            cls._shared_instance = cls()
+        if cls._shared_instance is None:
+            with cls._shared_lock:
+                if cls._shared_instance is None:
+                    cls._shared_instance = cls()
         return cls._shared_instance

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fa05fb9 and 0f72c9a.

📒 Files selected for processing (4)

app/desktop/studio_server/test_rate_limits_api.py (1 hunks)
app/web_ui/src/lib/api_schema.d.ts (3 hunks)
libs/core/kiln_ai/utils/model_rate_limiter.py (1 hunks)
libs/core/kiln_ai/utils/test_model_rate_limiter.py (1 hunks)

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 546
File: app/web_ui/src/routes/(app)/docs/library/[project_id]/upload_file_dialog.svelte:107-120
Timestamp: 2025-09-10T08:32:18.688Z
Learning: leonardmq prefers to work within the constraints of their SDK codegen for API calls, even when typing is awkward (like casting FormData to match expected types), rather than using alternative approaches like native fetch that would cause compiler errors with their generated types.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 413
File: libs/core/kiln_ai/datamodel/chunk.py:138-160
Timestamp: 2025-08-22T11:17:56.862Z
Learning: leonardmq prefers to avoid redundant validation checks when upstream systems already guarantee preconditions are met. He trusts the attachment system to ensure paths are properly formatted and prefers letting the attachment method (resolve_path) handle any edge cases rather than adding defensive precondition checks.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 390
File: libs/core/kiln_ai/adapters/provider_tools.py:494-499
Timestamp: 2025-07-23T08:58:45.769Z
Learning: leonardmq prefers to keep tightly coupled implementation details (like API versions) hardcoded when they are not user-facing and could break other modules if changed. The shared Config class is reserved for user-facing customizable values, not internal implementation details.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 341
File: libs/server/kiln_server/document_api.py:44-51
Timestamp: 2025-06-18T08:22:58.510Z
Learning: leonardmq prefers to defer fixing blocking I/O in async handlers when: the operation is very fast (milliseconds), user-triggered rather than automated, has no concurrency concerns, and would require additional testing to fix properly. He acknowledges such issues as valid but makes pragmatic decisions about timing the fixes.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 487
File: libs/core/kiln_ai/datamodel/rag.py:33-35
Timestamp: 2025-09-04T06:45:44.212Z
Learning: leonardmq requires vector_store_config_id to be a mandatory field in RagConfig (similar to extractor_config_id, chunker_config_id, embedding_config_id) for consistency. He prefers fixing dependent code that breaks due to missing required fields rather than making fields optional to accommodate incomplete data.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 654
File: app/web_ui/src/routes/(app)/docs/rag_configs/[project_id]/create_rag_config/edit_rag_config_form.svelte:141-152
Timestamp: 2025-09-25T06:38:14.854Z
Learning: leonardmq prefers simple onMount initialization patterns over reactive statements when possible, and is cautious about maintaining internal state for idempotency in Svelte components. He values simplicity and safety in component lifecycle management.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 402
File: libs/core/kiln_ai/adapters/embedding/litellm_embedding_adapter.py:0-0
Timestamp: 2025-07-14T03:43:07.283Z
Learning: leonardmq prefers to keep defensive validation checks even when they're technically redundant, viewing them as useful "quick sanity checks" that provide additional safety nets. He values defensive programming over strict DRY (Don't Repeat Yourself) principles when the redundant code serves as a safeguard.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 546
File: app/web_ui/src/lib/api_schema.d.ts:0-0
Timestamp: 2025-09-10T08:27:47.227Z
Learning: leonardmq correctly identifies auto-generated files and prefers not to manually edit them. When suggesting changes to generated TypeScript API schema files, the fix should be made in the source OpenAPI specification in the backend, not in the generated TypeScript file.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 487
File: libs/core/kiln_ai/adapters/vector_store/lancedb_adapter.py:136-150
Timestamp: 2025-09-04T06:34:12.318Z
Learning: leonardmq prefers brute force re-insertion of all chunks when partial chunks exist in the LanceDB vector store, rather than selectively inserting only missing chunks. His reasoning is that documents typically have only dozens of chunks (making overwriting cheap), and partial insertion likely indicates data corruption or conflicts that should be resolved by re-inserting all chunks to ensure consistency.

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 487
File: libs/core/kiln_ai/adapters/vector_store/lancedb_adapter.py:98-103
Timestamp: 2025-09-04T07:08:04.248Z
Learning: leonardmq prefers to let TableNotFoundError bubble up in delete_nodes_by_document_id operations rather than catching it, as this error indicates operations are being done in the wrong order on the caller side and should be surfaced as a programming error rather than handled gracefully.

📚 Learning: 2025-08-08T16:13:26.526Z

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 418
File: libs/core/kiln_ai/adapters/ml_model_list.py:1875-1890
Timestamp: 2025-08-08T16:13:26.526Z
Learning: In libs/core/kiln_ai/adapters/ml_model_list.py (Python), do not blanket-add r1_thinking/optional_r1_thinking parsers for R1-style models. Parser usage is provider-specific and must be based on observed responses in tests. For PR Kiln-AI/Kiln#418, deepseek_r1_0528_distill_qwen3_8b providers were validated without needing a parser.

Applied to files:

libs/core/kiln_ai/utils/test_model_rate_limiter.py
libs/core/kiln_ai/utils/model_rate_limiter.py

📚 Learning: 2025-10-29T06:59:55.635Z

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 736
File: app/web_ui/src/lib/api_schema.d.ts:8143-8180
Timestamp: 2025-10-29T06:59:55.635Z
Learning: app/web_ui/src/lib/api_schema.d.ts is generated by openapi-typescript; do not propose manual edits. Schema changes should be made in the FastAPI backend (e.g., app/desktop/studio_server/data_gen_api.py or libs/server/kiln_server/*), then re-generate the TS types.

Applied to files:

app/web_ui/src/lib/api_schema.d.ts

📚 Learning: 2025-09-04T06:27:58.852Z

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 487
File: app/web_ui/src/lib/api_schema.d.ts:3315-3320
Timestamp: 2025-09-04T06:27:58.852Z
Learning: The file app/web_ui/src/lib/api_schema.d.ts is auto-generated by openapi-typescript from the backend API schema and should never be manually edited. Changes to API types should be made in the backend, which will then be reflected in the regenerated TypeScript definitions.

Applied to files:

app/web_ui/src/lib/api_schema.d.ts

📚 Learning: 2025-10-29T11:19:31.059Z

Learnt from: leonardmq
Repo: Kiln-AI/Kiln PR: 759
File: app/web_ui/src/lib/api_schema.d.ts:2485-2490
Timestamp: 2025-10-29T11:19:31.059Z
Learning: In libs/server/kiln_server/document_api.py and the generated app/web_ui/src/lib/api_schema.d.ts, CreateVectorStoreConfigRequest.properties is now a union of three distinct public schemas: LanceDBConfigFTSPropertiesPublic, LanceDBConfigVectorPropertiesPublic, and LanceDBConfigHybridPropertiesPublic. These are intentionally split to allow independent evolution; do not suggest merging them.

Applied to files:

app/web_ui/src/lib/api_schema.d.ts

🧬 Code graph analysis (3)

libs/core/kiln_ai/utils/test_model_rate_limiter.py (2)

libs/core/kiln_ai/utils/config.py (2)

Config (31-288)

settings_dir (232-236)

libs/core/kiln_ai/utils/model_rate_limiter.py (11)

ModelRateLimiter (36-276)

RateLimits (18-33)

rate_limits_path (99-104)

_get_model_limit (198-208)

_set_model_limit (218-239)

reload (241-249)

limit (187-196)

_get_semaphore (155-184)

_get_semaphore_key (143-153)

_get_provider_limit (210-216)

_get_model_default_max_concurrent_requests (251-276)

libs/core/kiln_ai/utils/model_rate_limiter.py (3)

libs/core/kiln_ai/utils/config.py (1)

settings_dir (232-236)

app/desktop/studio_server/rate_limits_api.py (1)

update_rate_limits (19-36)

libs/core/kiln_ai/adapters/ml_model_list.py (1)

built_in_models_from_provider (5320-5328)

app/desktop/studio_server/test_rate_limits_api.py (3)

libs/core/kiln_ai/utils/config.py (2)

Config (31-288)

settings_dir (232-236)

libs/core/kiln_ai/utils/model_rate_limiter.py (6)

ModelRateLimiter (36-276)

rate_limits_path (99-104)

shared (74-86)

_get_provider_limit (210-216)

_get_model_limit (198-208)

update_rate_limits (132-141)

app/desktop/studio_server/rate_limits_api.py (2)

connect_rate_limits (9-36)

update_rate_limits (19-36)

🪛 GitHub Actions: Check API Bindings

app/web_ui/src/lib/api_schema.d.ts

[error] 4821-4821: OpenAPI schema is not current. Run generate_schema.sh to update. Differences indicate provider-related description change in RateLimits: 'Max concurrent requests per provider ... unless overridden by a model-specific limit'

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: Generate Coverage Report
GitHub Check: Build Desktop Apps (windows-latest)
GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
GitHub Check: Build Desktop Apps (macos-15-intel)
GitHub Check: Build Desktop Apps (ubuntu-22.04)
GitHub Check: Build Desktop Apps (macos-latest)
GitHub Check: Build, Typecheck, and Test Python (3.13)
GitHub Check: Build, Typecheck, and Test Python (3.12)
GitHub Check: Build, Typecheck, and Test Python (3.11)
GitHub Check: Build, Typecheck, and Test Python (3.10)

🔇 Additional comments (3)

app/web_ui/src/lib/api_schema.d.ts (2)

1312-1341: New /api/rate_limits path wiring matches existing REST patterns

The new path entry correctly follows the existing conventions: no params, GET/POST mapped to read_rate_limits_api_rate_limits_get / update_rate_limits_api_rate_limits_post, and response slots reserved as expected. No issues from a typings/layout perspective.

8562-8614: Rate limits operations typings are consistent with schema and conventions

read_rate_limits_api_rate_limits_get correctly returns RateLimits with no parameters or body; update_rate_limits_api_rate_limits_post accepts and returns RateLimits and includes a 422 HTTPValidationError branch, matching surrounding API patterns. Once the OpenAPI/TS regeneration issue is fixed, these look ready.

Based on learnings

app/desktop/studio_server/test_rate_limits_api.py (1)

32-314: LGTM! Comprehensive test coverage.

The test suite thoroughly covers:

Empty and existing rate limits

Updates, overwrites, and clears

Roundtrip persistence

Nested structures and mixed provider/model limits

Validation and error handling

Integration with the shared limiter

Well-structured and comprehensive.

app/desktop/studio_server/test_rate_limits_api.py

app/web_ui/src/lib/api_schema.d.ts

leonardmq added 2 commits November 14, 2025 18:10

refactor: per-model app-wide rate limiter

a1837ce

fix: test

3dc4aef

leonardmq marked this pull request as draft November 14, 2025 10:20

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

leonardmq added 11 commits November 17, 2025 19:31

fix: test

8a94f5f

chore: remove debug prints

91f90b6

refactor: explicit provider limit

6e979cd

refactor: reuse existing endpoints to get custom models out of the box

f048f78

refactor: use models from store

1d12433

refactor: reload rate limiter on update

3095f9c

refactor: rate limit modeling and file loading

fd07db9

refactor: fallback default per provider and reuse model list property

9fb2cce

fix: remove lingering debug print

d7b85f3

fix: test state pollution

ca20cda

refactor: safer defaults

fa05fb9

leonardmq marked this pull request as ready for review November 18, 2025 11:56

Merge branch 'main' of github.com:Kiln-AI/Kiln into leonard/kil-276-f…

dc3a6dc

…eat-per-model-app-wide-rate-limiting

coderabbitai bot reviewed Nov 19, 2025

View reviewed changes

libs/core/kiln_ai/utils/model_rate_limiter.py Outdated Show resolved Hide resolved

leonardmq added 2 commits November 19, 2025 18:59

refactor: remove deprecated methods

e2cdde4

docstring

0f72c9a

coderabbitai bot reviewed Nov 19, 2025

View reviewed changes

app/desktop/studio_server/test_rate_limits_api.py Show resolved Hide resolved

app/web_ui/src/lib/api_schema.d.ts Show resolved Hide resolved

leonardmq added 2 commits November 19, 2025 19:07

update api bindings

48e6e75

test: clear singleton after each test to avoid pollution

4791c58

leonardmq requested review from chiang-daniel, scosman and sfierro November 19, 2025 18:07

sfierro approved these changes Nov 20, 2025

View reviewed changes

feat: customizable per-model rate limiting #812

Are you sure you want to change the base?

feat: customizable per-model rate limiting #812

Uh oh!

Conversation

leonardmq commented Nov 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklists

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Coverage Report

Diff: origin/main...HEAD

Summary

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leonardmq commented Nov 14, 2025

Uh oh!

scosman commented Nov 18, 2025

Uh oh!

leonardmq commented Nov 19, 2025

Uh oh!

leonardmq commented Nov 19, 2025

Uh oh!

coderabbitai bot commented Nov 19, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

leonardmq commented Nov 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 14, 2025 •

edited

Loading

github-actions bot commented Nov 14, 2025 •

edited

Loading