Add datasets installation command #1057

gwarmstrong · 2025-11-27T00:50:17Z

Summary by CodeRabbit

New Features
- Datasets now automatically apply required installation steps before data preparation and evaluation pipelines.
- Installation commands are properly merged when both dataset-specific and user-provided requirements exist.
Chores
- Updated datasets library version requirement.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: George Armstrong <[email protected]>

gwarmstrong · 2025-12-02T21:11:53Z

@coderabbitai review

coderabbitai · 2025-12-02T21:11:59Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2025-12-02T21:16:54Z

Walkthrough

This PR introduces dataset-specific installation command support. A new utility function retrieves installation commands defined by datasets, aggregates them with user-provided commands, and ensures they are executed before pipeline tasks. The datasets package version constraint is updated in requirements.

Changes

Cohort / File(s)	Summary
Dataset installation command definition `nemo_skills/dataset/livecodebench/__init__.py`	Added `INSTALLATION_COMMAND` constant with value `"pip install 'datasets<4'"` for pre-preparation setup.
Installation command utilities `nemo_skills/dataset/utils.py`	Added `get_dataset_installation_command(dataset_name: str) -> str \| None` to dynamically retrieve dataset-specific installation commands.
Pipeline command aggregation `nemo_skills/pipeline/prepare_data.py`, `nemo_skills/pipeline/utils/eval.py`	Parse dataset names from context, collect per-dataset installation commands, aggregate them with `&&` separators, and log results. Aggregate commands passed to underlying execution.
Evaluation pipeline integration `nemo_skills/pipeline/eval.py`	Merge dataset-specific installation commands with user-provided commands and propagate the final merged command to all task creations (main, judge, and summarize tasks).
Shell execution safety `nemo_skills/pipeline/utils/exp.py`	Quote installation commands with `shlex.quote` and execute via `bash -c` to properly handle special characters.
Dependency version update `requirements/main.txt`	Changed datasets version constraint from `<4` to `>=4.0.0`.

Sequence Diagram(s)

sequenceDiagram
    participant Pipeline as Pipeline Entry<br/>(prepare_data/eval)
    participant Utils as Dataset Utils
    participant Dataset as Dataset Module<br/>(e.g., livecodebench)
    participant Aggregator as Command<br/>Aggregator
    participant Executor as Task<br/>Executor

    Pipeline->>Utils: get_dataset_installation_command(name)
    Utils->>Dataset: import & access INSTALLATION_COMMAND
    Dataset-->>Utils: installation_command (if defined)
    Utils-->>Pipeline: installation_command or None

    Pipeline->>Aggregator: collect per-dataset commands
    Aggregator->>Aggregator: join with " && "
    Note over Aggregator: aggregated_dataset_installation_command

    Pipeline->>Pipeline: merge with user_installation_command
    Note over Pipeline: final_installation_command

    Pipeline->>Executor: pass final_installation_command
    Executor->>Executor: quote with shlex.quote()
    Executor->>Executor: execute via bash -c
    Executor-->>Pipeline: success/failure

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Return type signature change: prepare_eval_commands() now returns a 3-tuple instead of 2-tuple; verify all call sites correctly unpack the new third element (aggregated_dataset_installation_command).
Command aggregation logic: Verify correct handling of None values, empty lists, and the && joining across prepare_data.py and utils/eval.py.
Bash quoting in exp.py: Ensure shlex.quote() correctly handles edge cases and special characters in installation commands.
Deduplication and logging: Confirm duplicate commands are properly deduped and all log statements provide clear diagnostics.
Requirements version change: Validate that datasets >=4.0.0 is compatible with the existing codebase and the livecodebench module's intent (datasets<4).

Poem

🐰 A rabbit hops through dataset lands,
Collecting scripts with nimble hands,
Commands aggregated, merged with care,
Bash quotes them safely through the air,
Pre-installation magic—prepare!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: adding dataset-specific installation command functionality throughout the pipeline.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch georgea/add-datasets-installation-command

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

nemo_skills/dataset/livecodebench/__init__.py (1)

21-23: livecodebench INSTALLATION_COMMAND clashes with global datasets>=4.0.0

This dataset-level command (pip install 'datasets<4') conflicts with the project-wide datasets>=4.0.0 requirement in requirements/main.txt. In a shared env this will either fail pip resolution or leave you running code built for datasets>=4 on an older <4 version.

Please align this command with the global requirement (or segregate environments) so that running livecodebench’s prepare/eval doesn’t destabilize datasets for the rest of the pipelines. See also the comment in requirements/main.txt.

🧹 Nitpick comments (4)

nemo_skills/pipeline/utils/exp.py (1)

380-390: Quoting fix in install_packages_wrap looks correct; env UUID export is redundant

Using shlex.quote(installation_command) and executing via bash -c {quoted_command} correctly treats the full installation command as a single -c script and avoids shell parsing issues for <, >, &&, etc. This is a solid improvement over inlining the raw command.

If you want to trim dead code, setup_env currently exports NEMO_SKILLS_JOB_UUID but that variable is never read (the UUID is only embedded into lock_file). You can either drop the export or refactor the lock logic to actually derive the path from the env var to make the comment accurate.

nemo_skills/dataset/utils.py (1)

241-256: Helper is straightforward; consider documenting name expectations

get_dataset_installation_command is simple and robust (graceful on missing modules/attrs). The only subtle assumption is that dataset_name matches the Python module path under nemo_skills.dataset (e.g., livecodebench or code.livecodebench), so datasets coming from extra_datasets or with different CLI aliases won’t participate in this mechanism.

If you expect non-core or aliased datasets to define INSTALLATION_COMMAND in the future, it’s worth documenting this expectation in the docstring (and possibly threading data_dir/extra_datasets_type through later) so callers know which names will work.

nemo_skills/pipeline/prepare_data.py (1)

21-21: Dataset installation aggregation works; heuristics are a bit loose but safe

The integration is sound:

prepare_data now discovers per-dataset INSTALLATION_COMMANDs via get_dataset_installation_command(...),

aggregates them into a single " && "-joined string,

and passes that as installation_command into _run_cmd, so the new exp-level wrapper will run them once per job before the main prepare command.

A couple of minor points you may want to refine later:

The heuristic “any ctx.args entry that does not start with - is a dataset name” will also see values of flags (e.g., --split test → test) and Hydra-style extras; these produce harmless ModuleNotFoundError → None, but add a few wasted import attempts.

If the same dataset is specified multiple times, its INSTALLATION_COMMAND will be repeated in the aggregated string; you could mirror the deduplication you already do in prepare_eval_commands.

Neither is functionally blocking given the current defensive implementation.

Also applies to: 94-112, 169-190

nemo_skills/pipeline/utils/eval.py (1)

24-25: Per-benchmark installation command aggregation is consistent with eval’s usage

This wiring looks correct:

dataset_installation_commands is built while enumerating benchmarks/groups, using get_dataset_installation_command(benchmark) and deduping by command string.

You then aggregate once at the end (and in the num_jobs == 0 early return) into aggregated_dataset_installation_command joined with " && ", which matches how eval.py expects to combine and forward it.

Importing get_dataset_module and import_from_path from nemo_skills.dataset.utils also lines up with their definitions there, so call sites here stay consistent.

The only assumption worth keeping in mind is that benchmark_args.name matches the dataset module path under nemo_skills.dataset; benchmarks coming purely from extra_datasets won’t currently participate in this mechanism unless they’re also importable under that namespace, which is a reasonable first step.

Also applies to: 295-339, 370-377, 508-514

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7c79641 and 232cd2b.

📒 Files selected for processing (7)

nemo_skills/dataset/livecodebench/__init__.py (1 hunks)
nemo_skills/dataset/utils.py (1 hunks)
nemo_skills/pipeline/eval.py (7 hunks)
nemo_skills/pipeline/prepare_data.py (3 hunks)
nemo_skills/pipeline/utils/eval.py (5 hunks)
nemo_skills/pipeline/utils/exp.py (1 hunks)
requirements/main.txt (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-23T17:56:57.556Z

Learnt from: Glorf
Repo: NVIDIA-NeMo/Skills PR: 908
File: requirements/main.txt:16-16
Timestamp: 2025-11-23T17:56:57.556Z
Learning: faiss-cpu must be explicitly listed in requirements/main.txt for BFCLv4 memory evaluations (memory_kv, memory_vector, memory_rec_sum) as it is an optional dependency of sentence_transformers that is required for vector similarity search functionality in the memory backends.

Applied to files:

requirements/main.txt

🧬 Code graph analysis (2)

nemo_skills/pipeline/prepare_data.py (1)

nemo_skills/dataset/utils.py (1)

get_dataset_installation_command (241-256)

nemo_skills/pipeline/utils/eval.py (1)

nemo_skills/dataset/utils.py (1)

get_dataset_installation_command (241-256)

🔇 Additional comments (1)

nemo_skills/pipeline/eval.py (1)

474-497: Installation-command plumbing is consistent; no backward-compatibility issues found

The implementation is sound:

prepare_eval_commands returns exactly 3 values in all code paths: (benchmarks_dict, job_batches, aggregated_dataset_installation_command).

The single call site at line 474 correctly unpacks all 3 values.

The merged final_installation_command flows consistently to all eval-related tasks (main eval, NVEmbed judge, LLM judge, summarize, and group-score).

No additional call sites to update.

coderabbitai · 2025-12-02T21:16:57Z

requirements/main.txt

 bs4
 click < 8.2.0  # https://github.com/ai-dynamo/dynamo/issues/1039
-datasets<4  # lcb problem with datasets 4.0.0
+datasets>=4.0.0


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check requirements/main.txt for datasets constraints echo "=== datasets constraints in requirements/main.txt ===" rg -n "datasets" requirements/main.txt # Check all requirements files echo -e "\n=== datasets constraints in all requirements files ===" fd -e txt requirements/ --exec rg -l "datasets" {} + | while read f; do echo "--- $f ---" rg -n "datasets" "$f" done # Find INSTALLATION_COMMAND uses echo -e "\n=== INSTALLATION_COMMAND uses ===" rg -n "INSTALLATION_COMMAND" nemo_skills # Check livecodebench __init__.py specifically echo -e "\n=== livecodebench/__init__.py content ===" fd -type f -name "__init__.py" nemo_skills/dataset/livecodebench --exec cat -n {} +

Repository: NVIDIA-NeMo/Skills

Length of output: 1116

🏁 Script executed:

#!/bin/bash # Get full livecodebench/__init__.py echo "=== Full livecodebench/__init__.py ===" cat nemo_skills/dataset/livecodebench/__init__.py # Also check if there are any other datasets version constraints in other config files echo -e "\n=== Search for datasets in setup files ===" rg -i "datasets" setup.py pyproject.toml 2>/dev/null || true

Repository: NVIDIA-NeMo/Skills

Length of output: 1240

Reconcile conflicting datasets version constraints

requirements/main.txt pins datasets>=4.0.0 (line 17), while nemo_skills/dataset/livecodebench/__init__.py defines INSTALLATION_COMMAND = "pip install 'datasets<4'" (line 23). These version ranges are mutually exclusive. When prepare/eval jobs for livecodebench execute, the INSTALLATION_COMMAND will attempt to install datasets<4, creating an unsatisfiable constraint conflict with the main requirements or causing a downgrade that violates project expectations.

Align both constraints to the same supported range, or move the legacy requirement behind a separate environment/extra to prevent installation failures or incompatible datasets versions during livecodebench operations.

🤖 Prompt for AI Agents

In requirements/main.txt line 17 and nemo_skills/dataset/livecodebench/__init__.py around line 23, the project pins conflicting datasets versions (requirements: datasets>=4.0.0; livecodebench INSTALLATION_COMMAND: pip install 'datasets<4'), causing impossible installs; fix by making the constraints consistent: either update nemo_skills/dataset/livecodebench/__init__.py to use a compatible range such as "pip install 'datasets>=4.0.0,<5'" (or simply "pip install 'datasets>=4.0.0'") OR move the legacy datasets<4 requirement into a separate extra/environment for livecodebench and remove the hard-coded INSTALLATION_COMMAND so main requirements remain authoritative; apply one of these changes and run dependency resolution to confirm no conflicts.

gwarmstrong added 4 commits November 26, 2025 16:48

Draft: add installation command for datasets

a984ba7

Signed-off-by: George Armstrong <[email protected]>

MAINT adjust datasets pin

f7fd71d

Signed-off-by: George Armstrong <[email protected]>

Merge branch 'main' into georgea/add-datasets-installation-command

ae4c43a

revert test

232cd2b

Signed-off-by: George Armstrong <[email protected]>

coderabbitai bot reviewed Dec 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add datasets installation command #1057

Add datasets installation command #1057

Uh oh!

gwarmstrong commented Nov 27, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

gwarmstrong commented Dec 2, 2025

Uh oh!

coderabbitai bot commented Dec 2, 2025

Uh oh!

coderabbitai bot commented Dec 2, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add datasets installation command #1057

Are you sure you want to change the base?

Add datasets installation command #1057

Uh oh!

Conversation

gwarmstrong commented Nov 27, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

gwarmstrong commented Dec 2, 2025

Uh oh!

coderabbitai bot commented Dec 2, 2025

Uh oh!

coderabbitai bot commented Dec 2, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gwarmstrong commented Nov 27, 2025 •

edited by coderabbitai bot

Loading