feat(docs): Kubeflow Trainer ROADMAP 2026 by andreyvelich · Pull Request #3242 · kubeflow/trainer

andreyvelich · 2026-02-24T00:00:21Z

I updated ROADMAP 2026 for Kubeflow Trainer. I tried to group them by multiple topics.

Please let me know what do you think, and if we should add more items 🚀

cc @kubeflow/kubeflow-trainer-team @kubeflow/kubeflow-sdk-team @jaiakash @akshaychitneni @robert-bell @vsoch @Ronkahn21 @EkinKarabulut @omer-dayan @kaisoz @kannon92 @mimowo @Fiona-Waters @abhijeet-dhumal
@bigsur0 @shravan-achar @Krishna-kg732 @XploY04 @aniket2405 @johnugeorge @kuizhiqing @franciscojavierarceo @eero-t @kwohlfahrt @stivanov-intercom @nqvuong1998 @trivialfis

/hold

Copilot

Pull request overview

This PR adds a comprehensive 2026 roadmap for Kubeflow Trainer, organizing planned features and enhancements into logical topic areas. The roadmap builds upon the 2025 roadmap already present in the file, providing a clear vision for the project's future direction.

Changes:

Added a new "2026" section to ROADMAP.md with organized categories of planned features
Included links to tracking issues for most roadmap items
Grouped items by themes: distributed scheduling, MPI/HPC, observability, data cache, LLM fine-tuning, new runtimes, UI, and integrations

vsoch · 2026-02-24T00:54:37Z

+  - Enhanced Multi-Node NVLink Support
+  - First-Class Integration with [Kueue](https://kueue.sigs.k8s.io/docs/tasks/run/trainjobs/) for
+    multi-cluster job dispatching, topology-aware scheduling, and other features.
+- MPI and HPC on Kubernetes


This looks great! Flux supports Intel MPI and PMIX.

Most of the issues stated in #2751 are not relevant with Flux.

vsoch · 2026-02-24T01:02:16Z

+- Distributed Data Cache
+  - Tensor caching to accelerate GPU workloads: https://github.com/kubeflow/trainer/issues/3173
+  - Integration with OptimizationJob
+  - Explore RDMA with AI Schedulers and Data Cache


RDMA will work nicely with MPI, although I suspect you are thinking of some of the Google products for GPU.

We've been chatting with @EkinKarabulut and @akshaychitneni about how we can leverage the Data Cache feature for RDMA. With the appropriate topology placement, data can be transferred directly to GPU nodes using zero-copy.

We believe that this could be highly beneficial for advanced AI data centres using GPUs like the GB200.

We can also explore how our MPI support might be helpful in this context.

vsoch · 2026-02-24T01:03:09Z

+- Implement registration mechanism in the Pipeline Framework to extend plugins and supported ML
+  frameworks in the Kubeflow Trainer: https://github.com/kubeflow/trainer/issues/2750
+- Kubeflow Trainer UI and TrainJob History Server: https://github.com/kubeflow/trainer/issues/2648
+- Integration with Kubeflow MCP Server: https://github.com/kubeflow/sdk/issues/238


This one is specifically interesting to me! We are working on agentic, state machine orchestration, and I already ran a study to deploy Flux MiniClusters in Kubernetes for MPI applications. It would be really easy to extend this to a Kubeflow Trainer spec!

tenzen-y · 2026-02-25T03:44:53Z

+- Scheduling & Scalability
+  - Workload-Aware Scheduling for TrainJobs: https://github.com/kubeflow/trainer/issues/3015
+  - KAI Scheduler Integrations: https://github.com/kubeflow/trainer/issues/2628
+  - Enhanced Multi-Node NVLink Support


What does this exactly mean? Expanding Trainer as workload scheduler?

Not exactly. We've had several discussions with @Ronkahn21 about how to integrate TrainJob with the NVIDIA DRA Driver (e.g., ComputeDomain) to improve topology-aware placement for multi-node NVLink on GPUs like GB200
@Ronkahn21 will open an issue soon to track this work.

cc @klueska.

tenzen-y · 2026-02-25T03:46:07Z

+- MPI and HPC on Kubernetes
+  - Flux Integration for MPI and HPC workloads: https://github.com/kubeflow/trainer/issues/2841
+  - IntelMPI Support: https://github.com/kubeflow/trainer/issues/1807
+  - PMIx Investigation with Flux or Slurm: https://github.com/kubeflow/mpi-operator/issues/12


Please open trainer dedicted issue.
The mpi-operator and trainer should have separate mechanisms because they have no compatibility.

Good point, I will create it soon!

andreyvelich

Hey Folks, does the ROADMAP look good to you all?
If yes, we can merge this version and iterate moving forward!
/hold cancel
cc @kubeflow/kubeflow-trainer-team

andreyvelich · 2026-04-15T01:18:05Z

Any additional comments before we can move this forward?
cc @tenzen-y @astefanutti @akshaychitneni @robert-bell @VassilisVassiliadis

abhijeet-dhumal · 2026-04-15T07:12:18Z

Hey @andreyvelich , the roadmap looks great!
Thinking is it good to add two more items under Observability & Reliability:

IIRC Back in V1, for Phase 3: 'Graduate operator to production grade' listed observability as a graduation criterion - unified job metrics 1 and a Grafana dashboard 2 were filed but both went stale and never carried over to v2. The controller currently has zero custom Prometheus metrics, no ServiceMonitor, and no dashboard yet..
Peer projects like Kueue and JobSet both ship custom metrics and bundled dashboards as standard. I guess these two issues would complete that story for v2. WDYT?

VassilisVassiliadis · 2026-04-15T09:17:32Z

The roadmap looks good to me.

There might also be a need to enhance the way Kubeflow enables controllers/users to mutate a TrainJob. This would act as an enabler for #3328 (mutate resources) or even for #3428 to allow the mutation of the image that a TrainJob uses. That said, if we do decide to update the Kubeflow Trainer API for the above we could just use the #3328 roadmap entry as an umbrella to cover this item as well.

robert-bell · 2026-04-15T10:13:28Z

The roadmap looks good to me.

Perhaps #3428 (or something related to runtime lifecycle management) could be added as a standalone item?

andreyvelich · 2026-04-16T10:24:58Z

@tenzen-y Any other comments you would like to make before we move this forward?
Happy to refine items in the future too!

tenzen-y

Thank you!
/lgtm
/approve

/hold
for others. feel free to unhold this PR.

abhijeet-dhumal · 2026-04-17T08:19:16Z

/lgtm
Thank you! 🚀

andreyvelich · 2026-04-17T11:02:47Z

@jaiakash @XploY04 Please can you check why GPU e2e is failing?

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Co-authored-by: Vassilis Vassiliadis <43679502+VassilisVassiliadis@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Co-authored-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com>

andreyvelich · 2026-04-17T13:26:34Z

/hold cancel

Rebased the PR to fix E2Es.
/cc @tenzen-y @astefanutti @abhijeet-dhumal

terrytangyuan

This looks great. Thanks!

/lgtm
/approve

google-oss-prow · 2026-04-17T18:36:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y, terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y,terrytangyuan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* feat(docs): Kubeflow Trainer ROADMAP 2026 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add Scalability feature Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add issue for Multi-Node NVLink Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update ROADMAP.md Co-authored-by: Vassilis Vassiliadis <43679502+VassilisVassiliadis@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update ROADMAP.md Co-authored-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add item for Runtime lifecycle management Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com> * Add Observability items Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com> Co-authored-by: Vassilis Vassiliadis <43679502+VassilisVassiliadis@users.noreply.github.com> Co-authored-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com> Signed-off-by: Sameer_yadav <159073326+Goku2099@users.noreply.github.com>

Copilot AI review requested due to automatic review settings February 24, 2026 00:00

google-oss-prow Bot added the do-not-merge/hold label Feb 24, 2026

google-oss-prow Bot requested review from akshaychitneni and jinchihe February 24, 2026 00:00

google-oss-prow Bot added the size/M label Feb 24, 2026

Copilot started reviewing on behalf of andreyvelich February 24, 2026 00:00 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

vsoch reviewed Feb 24, 2026

View reviewed changes

tenzen-y reviewed Feb 25, 2026

View reviewed changes

andreyvelich force-pushed the roadmap-2026 branch from f65276d to d9f5cf6 Compare March 2, 2026 16:03

VassilisVassiliadis reviewed Mar 16, 2026

View reviewed changes

Comment thread ROADMAP.md Outdated

andreyvelich force-pushed the roadmap-2026 branch from 5472147 to ad58c04 Compare March 17, 2026 01:19

andreyvelich commented Mar 18, 2026

View reviewed changes

google-oss-prow Bot removed the do-not-merge/hold label Mar 18, 2026

VassilisVassiliadis reviewed Mar 20, 2026

View reviewed changes

Comment thread ROADMAP.md Outdated

andreyvelich force-pushed the roadmap-2026 branch from e6d0f3d to 7a431a0 Compare April 15, 2026 01:17

tenzen-y reviewed Apr 16, 2026

View reviewed changes

google-oss-prow Bot added the do-not-merge/hold label Apr 16, 2026

google-oss-prow Bot assigned tenzen-y Apr 16, 2026

google-oss-prow Bot added lgtm approved labels Apr 16, 2026

google-oss-prow Bot assigned abhijeet-dhumal Apr 17, 2026

andreyvelich mentioned this pull request Apr 17, 2026

feat: add e2e tests to Helm CI workflow #3253

Merged

andreyvelich and others added 7 commits April 17, 2026 14:25

feat(docs): Kubeflow Trainer ROADMAP 2026

09ef71e

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Add Scalability feature

2afb5c5

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Add issue for Multi-Node NVLink

4e99cf7

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Update ROADMAP.md

86dd497

Co-authored-by: Vassilis Vassiliadis <43679502+VassilisVassiliadis@users.noreply.github.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Update ROADMAP.md

e7245e7

Co-authored-by: Vassilis Vassiliadis <vassilis.vassiliadis@ibm.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

Add item for Runtime lifecycle management

b7e7858

Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com>

Add Observability items

4ec3787

Signed-off-by: andreyvelich <andrey.velichkevich@gmail.com>

andreyvelich force-pushed the roadmap-2026 branch from 3b97e79 to 4ec3787 Compare April 17, 2026 13:26

google-oss-prow Bot removed the lgtm label Apr 17, 2026

google-oss-prow Bot requested review from abhijeet-dhumal and astefanutti April 17, 2026 13:26

google-oss-prow Bot removed the do-not-merge/hold label Apr 17, 2026

google-oss-prow Bot requested a review from tenzen-y April 17, 2026 13:26

terrytangyuan approved these changes Apr 17, 2026

View reviewed changes

google-oss-prow Bot assigned terrytangyuan Apr 17, 2026

google-oss-prow Bot added the lgtm label Apr 17, 2026

google-oss-prow Bot merged commit dc254b0 into kubeflow:master Apr 17, 2026
29 checks passed

google-oss-prow Bot added this to the v2.3 milestone Apr 17, 2026

andreyvelich deleted the roadmap-2026 branch April 17, 2026 19:12

Conversation

andreyvelich commented Feb 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andreyvelich left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

andreyvelich commented Apr 15, 2026

Uh oh!

abhijeet-dhumal commented Apr 15, 2026

Uh oh!

VassilisVassiliadis commented Apr 15, 2026

Uh oh!

robert-bell commented Apr 15, 2026

Uh oh!

andreyvelich commented Apr 16, 2026

Uh oh!

tenzen-y left a comment

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal commented Apr 17, 2026

Uh oh!

andreyvelich commented Apr 17, 2026

Uh oh!

andreyvelich commented Apr 17, 2026

Uh oh!

terrytangyuan left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow Bot commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

andreyvelich Feb 24, 2026 •

edited

Loading

andreyvelich left a comment •

edited

Loading