feat(docs): Kubeflow Trainer ROADMAP 2026#3242
feat(docs): Kubeflow Trainer ROADMAP 2026#3242google-oss-prow[bot] merged 7 commits intokubeflow:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a comprehensive 2026 roadmap for Kubeflow Trainer, organizing planned features and enhancements into logical topic areas. The roadmap builds upon the 2025 roadmap already present in the file, providing a clear vision for the project's future direction.
Changes:
- Added a new "2026" section to ROADMAP.md with organized categories of planned features
- Included links to tracking issues for most roadmap items
- Grouped items by themes: distributed scheduling, MPI/HPC, observability, data cache, LLM fine-tuning, new runtimes, UI, and integrations
| - Enhanced Multi-Node NVLink Support | ||
| - First-Class Integration with [Kueue](https://kueue.sigs.k8s.io/docs/tasks/run/trainjobs/) for | ||
| multi-cluster job dispatching, topology-aware scheduling, and other features. | ||
| - MPI and HPC on Kubernetes |
There was a problem hiding this comment.
This looks great! Flux supports Intel MPI and PMIX.
There was a problem hiding this comment.
Most of the issues stated in #2751 are not relevant with Flux.
| - Distributed Data Cache | ||
| - Tensor caching to accelerate GPU workloads: https://github.com/kubeflow/trainer/issues/3173 | ||
| - Integration with OptimizationJob | ||
| - Explore RDMA with AI Schedulers and Data Cache |
There was a problem hiding this comment.
RDMA will work nicely with MPI, although I suspect you are thinking of some of the Google products for GPU.
There was a problem hiding this comment.
We've been chatting with @EkinKarabulut and @akshaychitneni about how we can leverage the Data Cache feature for RDMA. With the appropriate topology placement, data can be transferred directly to GPU nodes using zero-copy.
We believe that this could be highly beneficial for advanced AI data centres using GPUs like the GB200.
We can also explore how our MPI support might be helpful in this context.
| - Implement registration mechanism in the Pipeline Framework to extend plugins and supported ML | ||
| frameworks in the Kubeflow Trainer: https://github.com/kubeflow/trainer/issues/2750 | ||
| - Kubeflow Trainer UI and TrainJob History Server: https://github.com/kubeflow/trainer/issues/2648 | ||
| - Integration with Kubeflow MCP Server: https://github.com/kubeflow/sdk/issues/238 |
There was a problem hiding this comment.
This one is specifically interesting to me! We are working on agentic, state machine orchestration, and I already ran a study to deploy Flux MiniClusters in Kubernetes for MPI applications. It would be really easy to extend this to a Kubeflow Trainer spec!
| - Scheduling & Scalability | ||
| - Workload-Aware Scheduling for TrainJobs: https://github.com/kubeflow/trainer/issues/3015 | ||
| - KAI Scheduler Integrations: https://github.com/kubeflow/trainer/issues/2628 | ||
| - Enhanced Multi-Node NVLink Support |
There was a problem hiding this comment.
What does this exactly mean? Expanding Trainer as workload scheduler?
There was a problem hiding this comment.
Not exactly. We've had several discussions with @Ronkahn21 about how to integrate TrainJob with the NVIDIA DRA Driver (e.g., ComputeDomain) to improve topology-aware placement for multi-node NVLink on GPUs like GB200
@Ronkahn21 will open an issue soon to track this work.
cc @klueska.
| - MPI and HPC on Kubernetes | ||
| - Flux Integration for MPI and HPC workloads: https://github.com/kubeflow/trainer/issues/2841 | ||
| - IntelMPI Support: https://github.com/kubeflow/trainer/issues/1807 | ||
| - PMIx Investigation with Flux or Slurm: https://github.com/kubeflow/mpi-operator/issues/12 |
There was a problem hiding this comment.
Please open trainer dedicted issue.
The mpi-operator and trainer should have separate mechanisms because they have no compatibility.
There was a problem hiding this comment.
Good point, I will create it soon!
f65276d to
d9f5cf6
Compare
5472147 to
ad58c04
Compare
e6d0f3d to
7a431a0
Compare
|
Any additional comments before we can move this forward? |
|
Hey @andreyvelich , the roadmap looks great!
IIRC Back in V1, for Phase 3: 'Graduate operator to production grade' listed observability as a graduation criterion - unified job metrics 1 and a Grafana dashboard 2 were filed but both went stale and never carried over to v2. The controller currently has zero custom Prometheus metrics, no ServiceMonitor, and no dashboard yet.. |
|
The roadmap looks good to me. There might also be a need to enhance the way Kubeflow enables controllers/users to mutate a TrainJob. This would act as an enabler for #3328 (mutate resources) or even for #3428 to allow the mutation of the image that a TrainJob uses. That said, if we do decide to update the Kubeflow Trainer API for the above we could just use the #3328 roadmap entry as an umbrella to cover this item as well. |
|
The roadmap looks good to me. Perhaps #3428 (or something related to runtime lifecycle management) could be added as a standalone item? |
|
@tenzen-y Any other comments you would like to make before we move this forward? |
tenzen-y
left a comment
There was a problem hiding this comment.
Thank you!
/lgtm
/approve
/hold
for others. feel free to unhold this PR.
|
/lgtm |
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Vassilis Vassiliadis <43679502+VassilisVassiliadis@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com>
Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com>
3b97e79 to
4ec3787
Compare
|
/hold cancel Rebased the PR to fix E2Es. |
terrytangyuan
left a comment
There was a problem hiding this comment.
This looks great. Thanks!
/lgtm
/approve
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tenzen-y, terrytangyuan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* feat(docs): Kubeflow Trainer ROADMAP 2026 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add Scalability feature Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add issue for Multi-Node NVLink Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update ROADMAP.md Co-authored-by: Vassilis Vassiliadis <43679502+VassilisVassiliadis@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update ROADMAP.md Co-authored-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add item for Runtime lifecycle management Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com> * Add Observability items Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com> Co-authored-by: Vassilis Vassiliadis <43679502+VassilisVassiliadis@users.noreply.github.com> Co-authored-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com> Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
I updated ROADMAP 2026 for Kubeflow Trainer. I tried to group them by multiple topics.
Please let me know what do you think, and if we should add more items 🚀
cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @jaiakash @akshaychitneni @robert-bell @vsoch @Ronkahn21 @EkinKarabulut @omer-dayan @kaisoz @kannon92 @mimowo @Fiona-Waters @abhijeet-dhumal
@bigsur0 @shravan-achar @Krishna-kg732 @XploY04 @aniket2405 @johnugeorge @kuizhiqing @franciscojavierarceo @eero-t @kwohlfahrt @stivanov-intercom @nqvuong1998 @trivialfis
/hold