Skip to content

feat(docs): Kubeflow Trainer ROADMAP 2026#3242

Merged
google-oss-prow[bot] merged 7 commits intokubeflow:masterfrom
andreyvelich:roadmap-2026
Apr 17, 2026
Merged

feat(docs): Kubeflow Trainer ROADMAP 2026#3242
google-oss-prow[bot] merged 7 commits intokubeflow:masterfrom
andreyvelich:roadmap-2026

Conversation

@andreyvelich
Copy link
Copy Markdown
Member

I updated ROADMAP 2026 for Kubeflow Trainer. I tried to group them by multiple topics.

Please let me know what do you think, and if we should add more items 🚀

cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @jaiakash @akshaychitneni @robert-bell @vsoch @Ronkahn21 @EkinKarabulut @omer-dayan @kaisoz @kannon92 @mimowo @Fiona-Waters @abhijeet-dhumal
@bigsur0 @shravan-achar @Krishna-kg732 @XploY04 @aniket2405 @johnugeorge @kuizhiqing @franciscojavierarceo @eero-t @kwohlfahrt @stivanov-intercom @nqvuong1998 @trivialfis

/hold

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a comprehensive 2026 roadmap for Kubeflow Trainer, organizing planned features and enhancements into logical topic areas. The roadmap builds upon the 2025 roadmap already present in the file, providing a clear vision for the project's future direction.

Changes:

  • Added a new "2026" section to ROADMAP.md with organized categories of planned features
  • Included links to tracking issues for most roadmap items
  • Grouped items by themes: distributed scheduling, MPI/HPC, observability, data cache, LLM fine-tuning, new runtimes, UI, and integrations

Comment thread ROADMAP.md
- Enhanced Multi-Node NVLink Support
- First-Class Integration with [Kueue](https://kueue.sigs.k8s.io/docs/tasks/run/trainjobs/) for
multi-cluster job dispatching, topology-aware scheduling, and other features.
- MPI and HPC on Kubernetes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Flux supports Intel MPI and PMIX.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the issues stated in #2751 are not relevant with Flux.

Comment thread ROADMAP.md
- Distributed Data Cache
- Tensor caching to accelerate GPU workloads: https://github.com/kubeflow/trainer/issues/3173
- Integration with OptimizationJob
- Explore RDMA with AI Schedulers and Data Cache
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RDMA will work nicely with MPI, although I suspect you are thinking of some of the Google products for GPU.

Copy link
Copy Markdown
Member Author

@andreyvelich andreyvelich Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been chatting with @EkinKarabulut and @akshaychitneni about how we can leverage the Data Cache feature for RDMA. With the appropriate topology placement, data can be transferred directly to GPU nodes using zero-copy.

We believe that this could be highly beneficial for advanced AI data centres using GPUs like the GB200.

We can also explore how our MPI support might be helpful in this context.

Comment thread ROADMAP.md
- Implement registration mechanism in the Pipeline Framework to extend plugins and supported ML
frameworks in the Kubeflow Trainer: https://github.com/kubeflow/trainer/issues/2750
- Kubeflow Trainer UI and TrainJob History Server: https://github.com/kubeflow/trainer/issues/2648
- Integration with Kubeflow MCP Server: https://github.com/kubeflow/sdk/issues/238
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is specifically interesting to me! We are working on agentic, state machine orchestration, and I already ran a study to deploy Flux MiniClusters in Kubernetes for MPI applications. It would be really easy to extend this to a Kubeflow Trainer spec!

Comment thread ROADMAP.md Outdated
- Scheduling & Scalability
- Workload-Aware Scheduling for TrainJobs: https://github.com/kubeflow/trainer/issues/3015
- KAI Scheduler Integrations: https://github.com/kubeflow/trainer/issues/2628
- Enhanced Multi-Node NVLink Support
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this exactly mean? Expanding Trainer as workload scheduler?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. We've had several discussions with @Ronkahn21 about how to integrate TrainJob with the NVIDIA DRA Driver (e.g., ComputeDomain) to improve topology-aware placement for multi-node NVLink on GPUs like GB200
@Ronkahn21 will open an issue soon to track this work.

cc @klueska.

Comment thread ROADMAP.md Outdated
- MPI and HPC on Kubernetes
- Flux Integration for MPI and HPC workloads: https://github.com/kubeflow/trainer/issues/2841
- IntelMPI Support: https://github.com/kubeflow/trainer/issues/1807
- PMIx Investigation with Flux or Slurm: https://github.com/kubeflow/mpi-operator/issues/12
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please open trainer dedicted issue.
The mpi-operator and trainer should have separate mechanisms because they have no compatibility.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will create it soon!

Comment thread ROADMAP.md Outdated
Copy link
Copy Markdown
Member Author

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Folks, does the ROADMAP look good to you all?
If yes, we can merge this version and iterate moving forward!
/hold cancel
cc @kubeflow/kubeflow-trainer-team

Comment thread ROADMAP.md Outdated
@andreyvelich
Copy link
Copy Markdown
Member Author

Any additional comments before we can move this forward?
cc @tenzen-y @astefanutti @akshaychitneni @robert-bell @VassilisVassiliadis

@abhijeet-dhumal
Copy link
Copy Markdown
Member

Hey @andreyvelich , the roadmap looks great!
Thinking is it good to add two more items under Observability & Reliability:

IIRC Back in V1, for Phase 3: 'Graduate operator to production grade' listed observability as a graduation criterion - unified job metrics 1 and a Grafana dashboard 2 were filed but both went stale and never carried over to v2. The controller currently has zero custom Prometheus metrics, no ServiceMonitor, and no dashboard yet..
Peer projects like Kueue and JobSet both ship custom metrics and bundled dashboards as standard. I guess these two issues would complete that story for v2. WDYT?

@VassilisVassiliadis
Copy link
Copy Markdown
Contributor

The roadmap looks good to me.

There might also be a need to enhance the way Kubeflow enables controllers/users to mutate a TrainJob. This would act as an enabler for #3328 (mutate resources) or even for #3428 to allow the mutation of the image that a TrainJob uses. That said, if we do decide to update the Kubeflow Trainer API for the above we could just use the #3328 roadmap entry as an umbrella to cover this item as well.

@robert-bell
Copy link
Copy Markdown
Contributor

The roadmap looks good to me.

Perhaps #3428 (or something related to runtime lifecycle management) could be added as a standalone item?

@andreyvelich
Copy link
Copy Markdown
Member Author

@tenzen-y Any other comments you would like to make before we move this forward?
Happy to refine items in the future too!

Copy link
Copy Markdown
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
/lgtm
/approve

/hold
for others. feel free to unhold this PR.

@abhijeet-dhumal
Copy link
Copy Markdown
Member

/lgtm
Thank you! 🚀

@andreyvelich
Copy link
Copy Markdown
Member Author

@jaiakash @XploY04 Please can you check why GPU e2e is failing?

andreyvelich and others added 7 commits April 17, 2026 14:25
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Vassilis Vassiliadis <43679502+VassilisVassiliadis@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com>
Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com>
@andreyvelich
Copy link
Copy Markdown
Member Author

/hold cancel

Rebased the PR to fix E2Es.
/cc @tenzen-y @astefanutti @abhijeet-dhumal

Copy link
Copy Markdown
Member

@terrytangyuan terrytangyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Thanks!

/lgtm
/approve

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [tenzen-y,terrytangyuan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot merged commit dc254b0 into kubeflow:master Apr 17, 2026
29 checks passed
@google-oss-prow google-oss-prow Bot added this to the v2.3 milestone Apr 17, 2026
@andreyvelich andreyvelich deleted the roadmap-2026 branch April 17, 2026 19:12
Goku2099 pushed a commit to Goku2099/trainer that referenced this pull request Apr 25, 2026
* feat(docs): Kubeflow Trainer ROADMAP 2026

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add Scalability feature

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add issue for Multi-Node NVLink

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update ROADMAP.md

Co-authored-by: Vassilis Vassiliadis <43679502+VassilisVassiliadis@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update ROADMAP.md

Co-authored-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add item for Runtime lifecycle management

Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com>

* Add Observability items

Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com>
Co-authored-by: Vassilis Vassiliadis <43679502+VassilisVassiliadis@users.noreply.github.com>
Co-authored-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com>
Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants