Production-grade RL for industrial autonomy. UNDER ACTIVE DEVELOPMENT Train in JAX/Flax/Optax, export to ONNX, run deterministically in C++20 with safety guards. Designed to plug into real robots and machines alongside classical control and sensor-fusion stacks.
Most RL repos stop at toy sims. RLTK is built for factory floors, yards, ports, mines, and agri:
- IO-agnostic contract: numeric observations in, numeric actions out.
- Deterministic C++ runtime: fixed tick, no heap after init, WCET budgets.
- Safety at runtime: action clamps, rate/jerk limits, novelty/timing monitors, instant fallback signals.
- Portable deployment: ROS2, industrial PCs, and—with proper optimization—PLC/embedded integration via C ABI.
- Auditability: ablation configs, seeds, metrics, artifacts, ONNX +
metadata.jsoncontract.
Training (Python/JAX) -> Export (ONNX + metadata) -> Runtime (C++20) -> Adapters (ROS2/PLC/embedded)
- Training stack: JAX/Flax/Optax + Hydra configs + W&B tracking.
- Export layer: frozen params -> ONNX, with strict op set and numerics parity tests.
- Runtime: lightweight C++20 inference loop with safety hooks and deterministic scheduling.
- Operational modes: Shadow / Residual / Primary / Cooperative.
- Integration: thin adapters feed obs/actions to ICTK/SFTK pipelines.
-
Agents (JAX): modular policies, losses, buffers, model-based rollouts.
-
Experiment engine: Hydra configs, grid/sweep runners, seeded determinism, W&B artifacts.
-
Data & logging: episode stores, KPI reporters, exportable artifacts.
-
Export & parity: ONNX export, Python↔C++ inference parity tests (tolerance configurable).
-
Runtime safety: clamps, rate/jerk limits, novelty/timing monitors, fallback trip signals.
-
Modes:
- Shadow: observe only, log counterfactual actions.
- Residual: add bounded residual on top of classical controller.
- Primary: direct control within safety envelope.
- Cooperative: blend with classical policy via scheduler.
Legacy baseline (complete/archival):
- Tabular: Bandits, MDP planning, MC, TD, n-step.
- Function Approximation: NN-Q, NN-SARSA, NN n-step Q.
- DQN family: Vanilla, Double, Dueling, PER, Rainbow.
- Policy Gradients: REINFORCE, A2C, TRPO, PPO.
- Actor-Critic: TD3, SAC, REDQ.
- Model-Based RL: MBPO, PETS, PlaNet.
- World Models: Dreamer (V1/V2/V3).
- Latent/World-Model Methods: MuZero, SimPLe.
- Meta-RL: MAML, RL².
- Hierarchical RL: Options, Feudal variants.
- Offline/IL: BC, DAgger, CQL, BRAC.
- Multi-Agent: I-DQN, QMIX, MADDPG.
- Exploration: ICM, RND, NGU, Go-Explore.
- Inverse/Imitation: GAIL, AIRL, SQIL.
- Transformer RL: Decision Transformer, Trajectory Transformer, Gato-style.
- Robust Sim2Real: Domain Randomization, EPOpt, Adversarial RL.
- Real-Time RL: latency-bounded inference, scheduler integration.
- Multi-Task/Transfer: shared backbones, adapters, fine-tune heads.
- Safe/Risk-Sensitive: constrained RL, risk-aware PPO/SAC.
Phases 1–5 remain as archival references. Phase 6+ is the active, production-grade track.
A model is deployable only if it meets:
- Parity: Python vs C++ outputs match within
≤1e-5L∞ for fixed test cases. - Latency: p95 inference
≤2 mson reference hardware. - Safety: no observed clamp/rate violations in validation suites; novelty alarms within budget.
- Stability: zero runtime allocations after init; no missed ticks at target Hz.
- KPI uplift: beats classical baseline on scenario KPIs (e.g., autonomy ratio, lateral RMS, stop distance, disengagements/hour, throughput delta).
Artifacts per trained policy:
model.onnx— fixed opset, static shapes preferred.metadata.json— minimal schema:
{
"abi_version": "rltk-v1",
"obs_dim": 24,
"act_dim": 4,
"obs_norm": {"mean": [...], "std": [...]},
"act_bounds": {"min": [...], "max": [...]},
"tick_hz": 100.0,
"preprocess": ["clip:obs:-10,10", "scale:act:-1,1"],
"training_commit": "abcdef1234",
"onnx_sha256": "…",
"kpi_ref": "reports/run_2025-08-28.json",
"license": "MIT"
}Runtime guarantees:
- Fixed-step scheduler at
tick_hz, bounded WCET. - No heap allocs after init; pre-allocated workspaces.
- Safety hooks: clamp -> rate/jerk -> novelty -> fallback signal cascade.
PLC/embedded note: deployment is inference-only. With small nets and static shapes, ONNX or generated kernels can meet real-time budgets on industrial PCs and many modern controllers. Integration via C ABI.
- Train: Python 3.10+, JAX/Flax/Optax, Hydra, W&B.
- Sim: Isaac Sim / Isaac Lab for industrial scenes.
- Export: ONNX.
- Runtime: C++20, ONNX Runtime, C ABI for integration.
- Adapters: ROS2 nodes; PLC/embedded bridges where applicable.
- Init: load ONNX, allocate all buffers, bind threads, warm up.
- Tick: acquire obs -> normalize -> infer -> clamp -> rate/jerk limit -> emit action.
- Monitors: deadline misses, novelty scores, out-of-range inputs.
- Fallback: on violation, raise signal for ICTK controller takeover.
- WCET harness: included tests run warm/cold cycles and write CSVs.
- Mid Sept: Base runtime + export + parity + safety hooks ready.
- Sept 30: On-vehicle validation (golf cart) with Shadow/Residual modes.
Focused roadmap. PRs welcome for: determinism, export tooling, runtime safety, adapters, tests. Follow CONTRIBUTING.md.
MIT. See LICENSE.
Built to operate alongside ICTK and SFTK with thin ROS2/PLC adapters for end-to-end autonomy.