Skip to content

Conversation

@gwarmstrong
Copy link
Collaborator

@gwarmstrong gwarmstrong commented Nov 27, 2025

Summary by CodeRabbit

  • New Features

    • Datasets now automatically apply required installation steps before data preparation and evaluation pipelines.
    • Installation commands are properly merged when both dataset-specific and user-provided requirements exist.
  • Chores

    • Updated datasets library version requirement.

✏️ Tip: You can customize this high-level summary in your review settings.

@gwarmstrong
Copy link
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 2, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 2, 2025

Walkthrough

This PR introduces dataset-specific installation command support. A new utility function retrieves installation commands defined by datasets, aggregates them with user-provided commands, and ensures they are executed before pipeline tasks. The datasets package version constraint is updated in requirements.

Changes

Cohort / File(s) Summary
Dataset installation command definition
nemo_skills/dataset/livecodebench/__init__.py
Added INSTALLATION_COMMAND constant with value "pip install 'datasets<4'" for pre-preparation setup.
Installation command utilities
nemo_skills/dataset/utils.py
Added get_dataset_installation_command(dataset_name: str) -> str | None to dynamically retrieve dataset-specific installation commands.
Pipeline command aggregation
nemo_skills/pipeline/prepare_data.py, nemo_skills/pipeline/utils/eval.py
Parse dataset names from context, collect per-dataset installation commands, aggregate them with && separators, and log results. Aggregate commands passed to underlying execution.
Evaluation pipeline integration
nemo_skills/pipeline/eval.py
Merge dataset-specific installation commands with user-provided commands and propagate the final merged command to all task creations (main, judge, and summarize tasks).
Shell execution safety
nemo_skills/pipeline/utils/exp.py
Quote installation commands with shlex.quote and execute via bash -c to properly handle special characters.
Dependency version update
requirements/main.txt
Changed datasets version constraint from <4 to >=4.0.0.

Sequence Diagram(s)

sequenceDiagram
    participant Pipeline as Pipeline Entry<br/>(prepare_data/eval)
    participant Utils as Dataset Utils
    participant Dataset as Dataset Module<br/>(e.g., livecodebench)
    participant Aggregator as Command<br/>Aggregator
    participant Executor as Task<br/>Executor

    Pipeline->>Utils: get_dataset_installation_command(name)
    Utils->>Dataset: import & access INSTALLATION_COMMAND
    Dataset-->>Utils: installation_command (if defined)
    Utils-->>Pipeline: installation_command or None

    Pipeline->>Aggregator: collect per-dataset commands
    Aggregator->>Aggregator: join with " && "
    Note over Aggregator: aggregated_dataset_installation_command

    Pipeline->>Pipeline: merge with user_installation_command
    Note over Pipeline: final_installation_command

    Pipeline->>Executor: pass final_installation_command
    Executor->>Executor: quote with shlex.quote()
    Executor->>Executor: execute via bash -c
    Executor-->>Pipeline: success/failure
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Return type signature change: prepare_eval_commands() now returns a 3-tuple instead of 2-tuple; verify all call sites correctly unpack the new third element (aggregated_dataset_installation_command).
  • Command aggregation logic: Verify correct handling of None values, empty lists, and the && joining across prepare_data.py and utils/eval.py.
  • Bash quoting in exp.py: Ensure shlex.quote() correctly handles edge cases and special characters in installation commands.
  • Deduplication and logging: Confirm duplicate commands are properly deduped and all log statements provide clear diagnostics.
  • Requirements version change: Validate that datasets >=4.0.0 is compatible with the existing codebase and the livecodebench module's intent (datasets<4).

Poem

🐰 A rabbit hops through dataset lands,
Collecting scripts with nimble hands,
Commands aggregated, merged with care,
Bash quotes them safely through the air,
Pre-installation magic—prepare!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding dataset-specific installation command functionality throughout the pipeline.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch georgea/add-datasets-installation-command

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
nemo_skills/dataset/livecodebench/__init__.py (1)

21-23: livecodebench INSTALLATION_COMMAND clashes with global datasets>=4.0.0

This dataset-level command (pip install 'datasets<4') conflicts with the project-wide datasets>=4.0.0 requirement in requirements/main.txt. In a shared env this will either fail pip resolution or leave you running code built for datasets>=4 on an older <4 version.

Please align this command with the global requirement (or segregate environments) so that running livecodebench’s prepare/eval doesn’t destabilize datasets for the rest of the pipelines. See also the comment in requirements/main.txt.

🧹 Nitpick comments (4)
nemo_skills/pipeline/utils/exp.py (1)

380-390: Quoting fix in install_packages_wrap looks correct; env UUID export is redundant

Using shlex.quote(installation_command) and executing via bash -c {quoted_command} correctly treats the full installation command as a single -c script and avoids shell parsing issues for <, >, &&, etc. This is a solid improvement over inlining the raw command.

If you want to trim dead code, setup_env currently exports NEMO_SKILLS_JOB_UUID but that variable is never read (the UUID is only embedded into lock_file). You can either drop the export or refactor the lock logic to actually derive the path from the env var to make the comment accurate.

nemo_skills/dataset/utils.py (1)

241-256: Helper is straightforward; consider documenting name expectations

get_dataset_installation_command is simple and robust (graceful on missing modules/attrs). The only subtle assumption is that dataset_name matches the Python module path under nemo_skills.dataset (e.g., livecodebench or code.livecodebench), so datasets coming from extra_datasets or with different CLI aliases won’t participate in this mechanism.

If you expect non-core or aliased datasets to define INSTALLATION_COMMAND in the future, it’s worth documenting this expectation in the docstring (and possibly threading data_dir/extra_datasets_type through later) so callers know which names will work.

nemo_skills/pipeline/prepare_data.py (1)

21-21: Dataset installation aggregation works; heuristics are a bit loose but safe

The integration is sound:

  • prepare_data now discovers per-dataset INSTALLATION_COMMANDs via get_dataset_installation_command(...),
  • aggregates them into a single " && "-joined string,
  • and passes that as installation_command into _run_cmd, so the new exp-level wrapper will run them once per job before the main prepare command.

A couple of minor points you may want to refine later:

  • The heuristic “any ctx.args entry that does not start with - is a dataset name” will also see values of flags (e.g., --split testtest) and Hydra-style extras; these produce harmless ModuleNotFoundErrorNone, but add a few wasted import attempts.
  • If the same dataset is specified multiple times, its INSTALLATION_COMMAND will be repeated in the aggregated string; you could mirror the deduplication you already do in prepare_eval_commands.

Neither is functionally blocking given the current defensive implementation.

Also applies to: 94-112, 169-190

nemo_skills/pipeline/utils/eval.py (1)

24-25: Per-benchmark installation command aggregation is consistent with eval’s usage

This wiring looks correct:

  • dataset_installation_commands is built while enumerating benchmarks/groups, using get_dataset_installation_command(benchmark) and deduping by command string.
  • You then aggregate once at the end (and in the num_jobs == 0 early return) into aggregated_dataset_installation_command joined with " && ", which matches how eval.py expects to combine and forward it.
  • Importing get_dataset_module and import_from_path from nemo_skills.dataset.utils also lines up with their definitions there, so call sites here stay consistent.

The only assumption worth keeping in mind is that benchmark_args.name matches the dataset module path under nemo_skills.dataset; benchmarks coming purely from extra_datasets won’t currently participate in this mechanism unless they’re also importable under that namespace, which is a reasonable first step.

Also applies to: 295-339, 370-377, 508-514

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7c79641 and 232cd2b.

📒 Files selected for processing (7)
  • nemo_skills/dataset/livecodebench/__init__.py (1 hunks)
  • nemo_skills/dataset/utils.py (1 hunks)
  • nemo_skills/pipeline/eval.py (7 hunks)
  • nemo_skills/pipeline/prepare_data.py (3 hunks)
  • nemo_skills/pipeline/utils/eval.py (5 hunks)
  • nemo_skills/pipeline/utils/exp.py (1 hunks)
  • requirements/main.txt (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-23T17:56:57.556Z
Learnt from: Glorf
Repo: NVIDIA-NeMo/Skills PR: 908
File: requirements/main.txt:16-16
Timestamp: 2025-11-23T17:56:57.556Z
Learning: faiss-cpu must be explicitly listed in requirements/main.txt for BFCLv4 memory evaluations (memory_kv, memory_vector, memory_rec_sum) as it is an optional dependency of sentence_transformers that is required for vector similarity search functionality in the memory backends.

Applied to files:

  • requirements/main.txt
🧬 Code graph analysis (2)
nemo_skills/pipeline/prepare_data.py (1)
nemo_skills/dataset/utils.py (1)
  • get_dataset_installation_command (241-256)
nemo_skills/pipeline/utils/eval.py (1)
nemo_skills/dataset/utils.py (1)
  • get_dataset_installation_command (241-256)
🔇 Additional comments (1)
nemo_skills/pipeline/eval.py (1)

474-497: Installation-command plumbing is consistent; no backward-compatibility issues found

The implementation is sound:

  • prepare_eval_commands returns exactly 3 values in all code paths: (benchmarks_dict, job_batches, aggregated_dataset_installation_command).
  • The single call site at line 474 correctly unpacks all 3 values.
  • The merged final_installation_command flows consistently to all eval-related tasks (main eval, NVEmbed judge, LLM judge, summarize, and group-score).

No additional call sites to update.

bs4
click < 8.2.0 # https://github.com/ai-dynamo/dynamo/issues/1039
datasets<4 # lcb problem with datasets 4.0.0
datasets>=4.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check requirements/main.txt for datasets constraints
echo "=== datasets constraints in requirements/main.txt ==="
rg -n "datasets" requirements/main.txt

# Check all requirements files
echo -e "\n=== datasets constraints in all requirements files ==="
fd -e txt requirements/ --exec rg -l "datasets" {} + | while read f; do
  echo "--- $f ---"
  rg -n "datasets" "$f"
done

# Find INSTALLATION_COMMAND uses
echo -e "\n=== INSTALLATION_COMMAND uses ==="
rg -n "INSTALLATION_COMMAND" nemo_skills

# Check livecodebench __init__.py specifically
echo -e "\n=== livecodebench/__init__.py content ==="
fd -type f -name "__init__.py" nemo_skills/dataset/livecodebench --exec cat -n {} +

Repository: NVIDIA-NeMo/Skills

Length of output: 1116


🏁 Script executed:

#!/bin/bash
# Get full livecodebench/__init__.py
echo "=== Full livecodebench/__init__.py ==="
cat nemo_skills/dataset/livecodebench/__init__.py

# Also check if there are any other datasets version constraints in other config files
echo -e "\n=== Search for datasets in setup files ==="
rg -i "datasets" setup.py pyproject.toml 2>/dev/null || true

Repository: NVIDIA-NeMo/Skills

Length of output: 1240


Reconcile conflicting datasets version constraints

requirements/main.txt pins datasets>=4.0.0 (line 17), while nemo_skills/dataset/livecodebench/__init__.py defines INSTALLATION_COMMAND = "pip install 'datasets<4'" (line 23). These version ranges are mutually exclusive. When prepare/eval jobs for livecodebench execute, the INSTALLATION_COMMAND will attempt to install datasets<4, creating an unsatisfiable constraint conflict with the main requirements or causing a downgrade that violates project expectations.

Align both constraints to the same supported range, or move the legacy requirement behind a separate environment/extra to prevent installation failures or incompatible datasets versions during livecodebench operations.

🤖 Prompt for AI Agents
In requirements/main.txt line 17 and
nemo_skills/dataset/livecodebench/__init__.py around line 23, the project pins
conflicting datasets versions (requirements: datasets>=4.0.0; livecodebench
INSTALLATION_COMMAND: pip install 'datasets<4'), causing impossible installs;
fix by making the constraints consistent: either update
nemo_skills/dataset/livecodebench/__init__.py to use a compatible range such as
"pip install 'datasets>=4.0.0,<5'" (or simply "pip install 'datasets>=4.0.0'")
OR move the legacy datasets<4 requirement into a separate extra/environment for
livecodebench and remove the hard-coded INSTALLATION_COMMAND so main
requirements remain authoritative; apply one of these changes and run dependency
resolution to confirm no conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants