Skip to content

Feature: Add the ability to call a saved workflow from another#9093

Open
JPPhoto wants to merge 222 commits into
invoke-ai:mainfrom
JPPhoto:call-saved-workflows
Open

Feature: Add the ability to call a saved workflow from another#9093
JPPhoto wants to merge 222 commits into
invoke-ai:mainfrom
JPPhoto:call-saved-workflows

Conversation

@JPPhoto

@JPPhoto JPPhoto commented Apr 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR adds engine-native call_saved_workflow support. A parent workflow can select a saved workflow, bind values or connections to its exposed fields, suspend at the call node, execute the selected workflow as first-class child queue item(s), capture explicit named return values, and resume the parent workflow.

The implementation includes backend call-boundary execution, parent/child queue relationship metadata, access validation, dynamic input binding, bounded nested calls, strict child failure propagation, root-oriented retry behavior, child-row cancel/delete semantics, compatibility reporting, and named workflow returns via workflow_return_value, workflow_return, and workflow_return_get.

Related Issues / Discussions

This has been discussed many times on Discord.

QA Instructions

New tests in:

  • tests/app/invocations/test_call_saved_workflows.py
  • tests/app/services/test_workflow_call_batch.py
  • tests/app/services/test_workflow_call_batch_runtime.py
  • tests/app/services/test_workflow_call_compatibility.py
  • tests/app/services/test_workflow_call_runtime.py
  • tests/app/services/test_workflow_graph_builder.py
  • tests/app/services/workflow_call_test_utils.py

Manual return-value test plan (note that this PR includes a database migration!):

  1. Create a child workflow with integer, workflow_return_value, collect, and workflow_return.
  2. Connect integer.value to workflow_return_value.value.
  3. Set workflow_return_value.key to result.
  4. Connect workflow_return_value.value to collect.item.
  5. Connect collect.collection to workflow_return.values.
  6. Save the child workflow.
  7. Create a parent workflow with call_saved_workflow, workflow_return_get, and a downstream node such as add.
  8. Select the saved child workflow in call_saved_workflow.
  9. Connect call_saved_workflow.values to workflow_return_get.values.
  10. Set workflow_return_get.key to result.
  11. Connect workflow_return_get.value to add.a.
  12. Set add.b to a literal value.
  13. Run the parent workflow and confirm add.value equals the returned child value plus add.b.

For batch behavior, replace the child integer with integer_batch connected to integer.value, then return the integer.value through the same workflow_return_value path. The parent should receive result as a list of returned values.

Merge Plan

This branch includes a queue/session migration for workflow-call relationship metadata. Recheck migration ordering against other in-flight migrations before final merge.

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • ❗Changes to a redux slice have a corresponding migration
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

Copilot AI and others added 30 commits March 5, 2026 20:09
* Add per-user workflow isolation: migration 28, service updates, router ownership checks, is_public endpoint, schema regeneration, frontend UI

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

* feat: add shared workflow checkbox to Details panel, auto-tag, gate edit/delete, fix tests

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
…mode (invoke-ai#116)

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
… mode (#120)

* Disable Save when editing another user's shared workflow in multiuser mode

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
…-board filter, archive

Co-authored-by: lstein <111189+lstein@users.noreply.github.com>
…image, etc.)

Previously, images in shared boards owned by another user could not be
dragged at all — the draggable setup was completely skipped in
GalleryImage.tsx when canWriteImages was false. This blocked ALL drop
targets including the viewer, reference image pane, and canvas.

Now images are always draggable. The board-move restriction is enforced
in the dnd target isValid functions instead:
- addImageToBoardDndTarget: rejects moves from shared boards the user
  doesn't own (unless admin or board is public)
- removeImageFromBoardDndTarget: same check

Other drop targets (viewer, reference images, canvas, comparison, etc.)
remain fully functional for shared board images.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…-allow-shared-boards' into copilot/enhancement-allow-shared-boards
Stacked on top of origin PR invoke-ai#9018 (shared/private workflows and boards) for multiuser workflow visibility semantics.
JPPhoto and others added 19 commits May 17, 2026 16:01
…ches/47014-1779051704/31cb653f40eceef89f75012837b2fcce2fafe2c5
…ches/84060-1779131920/31cb653f40eceef89f75012837b2fcce2fafe2c5
…ches/159007-1779139207/31cb653f40eceef89f75012837b2fcce2fafe2c5
…ches/171072-1779140145/31cb653f40eceef89f75012837b2fcce2fafe2c5
…ches/195698-1779142089/31cb653f40eceef89f75012837b2fcce2fafe2c5
…ches/203614-1779142542/31cb653f40eceef89f75012837b2fcce2fafe2c5
…ches/223674-1779144135/31cb653f40eceef89f75012837b2fcce2fafe2c5
…ches/254629-1779146653/31cb653f40eceef89f75012837b2fcce2fafe2c5
…ches/260019-1779147418/31cb653f40eceef89f75012837b2fcce2fafe2c5
…ches/270817-1779149406/31cb653f40eceef89f75012837b2fcce2fafe2c5
…ches/348085-1779421017/31cb653f40eceef89f75012837b2fcce2fafe2c5
…ches/470617-1779836147/31cb653f40eceef89f75012837b2fcce2fafe2c5
…ches/474568-1779845939/31cb653f40eceef89f75012837b2fcce2fafe2c5
…ches/490877-1779916988/31cb653f40eceef89f75012837b2fcce2fafe2c5
…ches/16354-1780052531/31cb653f40eceef89f75012837b2fcce2fafe2c5
@Pfannkuchensack

Copy link
Copy Markdown
Collaborator

Findings

F1 - HIGH - Unbounded generator resolution + full cartesian-product materialization, reachable from an authenticated read endpoint (DoS)

  • Path: invokeai/app/services/session_processor/workflow_call_batch.py:609, invokeai/app/services/session_queue/session_queue_common.py:585, triggered from invokeai/app/api/routers/workflows.py:55.
  • Evidence chain:
    1. GET /workflows/i/{workflow_id} calls get_workflow_call_compatibility(...) with maximum_children = max_queue_size and the default resolve_generator_items=True (invokeai/app/api/routers/workflows.py:55-61). Note list_workflows deliberately passes resolve_generator_items=False (invokeai/app/api/routers/workflows.py:215), but the single-item GET does not, so a real generator resolution happens on every workflow open.
    2. Generator items are resolved eagerly with no upper bound. _resolve_integer_generator returns [start + i*step for i in range(count)] where count = int(value.get("count", 10)) comes straight from the workflow JSON (invokeai/app/services/session_processor/workflow_call_batch.py:267-275). The same pattern exists for the float and uniform-distribution generators (:258, :289). No ge/le clamp anywhere.
    3. The capacity guard is if calc_session_count(batch) > maximum_children: (workflow_call_batch.py:609). calc_session_count executes data_product = list(product(*data)); return len(data_product) * batch.runs (session_queue_common.py:584-585) - it materializes the ENTIRE cartesian product into a list BEFORE comparing to the 10000 cap. The cap never protects against the materialization itself.
  • Triggering scenario: any user who can save a workflow crafts one whose batch groups are fed by integer generators with large count (e.g. a single count=10**8, or a few groups multiplying to >>10000). Merely fetching that workflow (the editor does this on open) makes the server build a list of 10^8+ ints and/or list(product(...)) of the full product, causing OOM/hang. This is an authenticated read with no execution required.
  • Fix direction: clamp generator count/maxPrompts with a hard le=, and compute product size arithmetically (multiply group lengths, short-circuit at the cap) instead of len(list(product(*data))).
  • To expose this issue, add a test that calls get_workflow_call_compatibility (or hits GET /workflows/i/{id}) for a workflow whose integer generator count is large and asserts it returns an "exceeds capacity" compatibility result without materializing the product (e.g. bounded time/allocation), rather than resolving all items.

F2 - MEDIUM - Bulk cancel methods bypass the parent/child chain, leaving orphaned running children

  • Path: invokeai/app/services/session_queue/session_queue_sqlite.py:553 (cancel_by_batch_ids), :603 (cancel_by_destination), :733 (cancel_by_queue_id).
  • Evidence: cancel_queue_item (:509-510) and delete_queue_item (:515-517) were deliberately made chain-aware via _get_workflow_call_chain_item_ids. The bulk cancel methods were NOT. cancel_by_batch_ids issues a single UPDATE ... SET status='canceled' WHERE batch_id IN (...) AND status != 'in_progress' (:566-572) and never expands to the workflow-call chain. Child workflows are enqueued as separate sessions with their own batch_id (enqueue_workflow_call_child), so cancelling a parent's batch does not reach the child; a child that is in_progress (explicitly excluded by the status != 'in_progress' clause) keeps executing while its waiting parent is force-canceled.
  • Triggering scenario: a call_saved_workflow parent is waiting with a running child; the user cancels the parent's batch from the queue UI. Parent is canceled, child continues, parent never receives the child's completion -> orphaned execution and a canceled-parent/running-child inconsistency.
  • To expose this issue, add a test that enqueues a parent with an in-progress child, calls cancel_by_batch_ids on the parent's batch, and asserts the child is also canceled (or that the chain is handled consistently with cancel_queue_item).

F3 - MEDIUM - cancel_all_except_current / delete_all_except_current can tear down a suspended parent during the pending->in_progress handoff

  • Path: invokeai/app/services/session_queue/session_queue_sqlite.py:769-770 and :693-694.
  • Evidence: the exclusion set comes from _get_current_workflow_call_chain_item_ids (:395-399), which is built from get_current(queue_id) - and get_current returns only the single status='in_progress' row. The bulk status filter was widened to include waiting/pending. When a parent is waiting and its child has been enqueued as pending but not yet picked up by the processor, nothing is in_progress, so get_current returns None, the exclusion set is empty, and the UPDATE cancels/deletes both the waiting parent and the pending child. The "protect the current chain" guard only holds while a descendant is actively in_progress.
  • Triggering scenario: user clicks "cancel all except current" in the narrow window after a child is enqueued but before the worker marks it in_progress; the in-flight workflow-call chain is destroyed despite the intent to preserve the current item.
  • To expose this issue, add a test that puts a parent in waiting with a pending (not yet in_progress) child, calls cancel_all_except_current, and asserts the chain is preserved.

F4 - MEDIUM - Owner's own call_saved_workflow node selection is silently cleared when they make their public workflow private

  • Path: invokeai/app/api/sockets.py:420-426 (synthetic emit) + :179 (every user joins workflows:shared); consumer invokeai/frontend/web/src/services/events/setEventListeners.tsx:138-160.
  • Evidence chain:
    1. Every authenticated socket joins workflows:shared on connect (invokeai/app/api/sockets.py:179), including a workflow's owner.
    2. On a public->private transition, _handle_workflow_event emits a synthetic workflow_deleted to the entire workflows:shared room with no owner exclusion (invokeai/app/api/sockets.py:423-426).
    3. The frontend workflow_deleted handler resets workflow_id to '' on every open call_saved_workflow node whose value matches data.workflow_id (invokeai/frontend/web/src/services/events/setEventListeners.tsx:149-159).
  • Triggering scenario: User A has a public workflow W and an open graph with a call_saved_workflow node pointing at W. A toggles W to private. A's own editor receives workflow_deleted for W (still owned and callable by A) and silently clears the node's selection. Admins viewing such a node are affected the same way.
  • To expose this issue, add a frontend logic test around the event-to-field-reset selection (or backend test asserting the synthetic workflow_deleted is not delivered to the owner) verifying an owner-private toggle does not reset the owner's node referencing a still-existing, still-callable workflow.

F5 - MEDIUM (i18n) - Backend free-text compatibility message rendered untranslated; structured reason enum ignored

  • Path: invokeai/frontend/web/src/features/workflowLibrary/util/workflowCallCompatibility.ts:26, invokeai/frontend/web/src/features/nodes/components/flow/nodes/Invocation/fields/inputs/SavedWorkflowFieldInputComponent.tsx:161-164, invokeai/frontend/web/src/features/workflowLibrary/components/WorkflowLibrary/WorkflowListItem.tsx:137,168.
  • Evidence: the API now returns a structured WorkflowCallCompatibilityReason enum (ok, missing_workflow_return, multiple_workflow_return, unsupported_node, unsupported_batch_input, invalid_graph, invalid_inputs, unknown) plus a free-text English message. The frontend never consumes reason; it renders the server's message verbatim (<Text>{displayState.compatibilityMessage}</Text>). The translation keys (workflows.savedWorkflowUnsupportedDescription) are only used as a fallback when message is null, so whenever the backend supplies a message the untranslated English string wins. This violates the repo's localization mechanism (invokeai/frontend/web/public/locales/en.json).
  • Fix direction: map reason -> translation keys client-side and use message only as a debug fallback.
  • To expose this issue, add a vitest around getWorkflowCallCompatibilityState/workflowCallCompatibility.ts asserting the UI selects a translation key per reason discriminator rather than echoing the backend message.

F6 - LOW - Queue-size accounting ignores waiting; nested-child enqueue has no capacity check

  • Path: invokeai/app/services/session_queue/session_queue_sqlite.py:150-164 (_get_current_queue_size counts only status='pending'), and enqueue_workflow_call_child performs no capacity check.
  • Evidence: _get_current_queue_size is used for the enqueue_batch and retry_items capacity guards but excludes waiting parents, so effective queue occupancy can exceed max_queue_size. Nested child enqueue bypasses the capacity check entirely (the only guard is the per-call maximum_children/remaining_queue_capacity computed in the runtime), so deep/fanned-out workflow-call trees can push the queue past max_queue_size.

F7 - LOW - Synthetic workflow_deleted payload violates the declared event schema

  • Path: invokeai/app/api/sockets.py:425.
  • Evidence: the synthetic emit sends only {"workflow_id": ...}, whereas the real WorkflowDeletedEvent (invokeai/app/services/events/events_common.py) and the generated WorkflowDeletedEvent in invokeai/frontend/web/src/services/api/schema.ts mark user_id/is_public/timestamp as present. The same workflow_deleted channel now carries two differently-shaped payloads; it "works" only because invokeai/frontend/web/src/services/events/types.ts hand-relaxes the type, and any consumer reading data.user_id on the synthetic event gets undefined. Contract drift between generated and hand-written types.

F8 - LOW - list_workflows adds an N+1 query and a best-effort total that can disagree with pages

  • Path: invokeai/app/api/routers/workflows.py:207-229.
  • Evidence: the listing endpoint now does a per-item workflow_records.get(...) inside the page loop (N+1 on a hot endpoint) and computes total=max(len(items), workflows.total - skipped_missing_workflows). When a row is concurrently deleted between get_many and the per-item get, total is adjusted only for the current page while pages comes from the underlying query, so the two can disagree. Cosmetic.

F9 - LOW - Capacity cap enforced in three places with three different messages

  • Path: invokeai/app/services/session_processor/workflow_call_batch.py:609, invokeai/app/services/session_processor/workflow_call_runtime.py:74, :88.
  • Evidence: the child-count cap is checked and raised independently in three locations with three distinct error strings and two separate reads of queue capacity. No single source of truth; the two capacity reads can disagree under concurrent enqueue, risking an inconsistent waiting parent. Maintenance hazard rather than a proven live bug.

Open Questions

  • Q1 - max_workflow_call_depth is hardcoded to 4 (invokeai/app/services/shared/graph.py:1829) with no per-install config and NO static cycle detection. A self-calling or A<->B workflow is permitted up to depth 4 (then raises), so recursion is bounded, but with batch fan-out a self-calling workflow reaches up to B^4 queued children, bounded overall only by max_queue_size. Confirm 4 is intended to be non-configurable and that bounded cycles are acceptable vs. up-front cycle rejection.
  • Q2 - record_waiting_workflow_call_child_completion (invokeai/app/services/shared/graph.py:~2151) aggregates batched child return values in completion order, not enqueue order, and the first-completing child fixes the allowed key set. If callers expect positional correspondence between batch inputs and aggregated outputs, results may be mis-ordered; a later child returning a strict subset of keys yields unequal-length lists silently. Confirm the ordering/equal-length contract.
  • Q3 - Router-level authorization in retry_items_by_id/delete_queue_item authorizes the passed item_id, but the SQL collapses to and acts on root_item_id (invokeai/app/api/routers/session_queue.py vs invokeai/app/services/session_queue/session_queue_sqlite.py). This is safe ONLY because enqueue_workflow_call_child copies the parent's user_id into the child. No divergent path exists in this diff, so it is defense-in-depth, not a live leak; recommend authorizing on root_item_id directly. Also confirm that a child-delete intentionally tears down the parent root (UX behavior change).

@JPPhoto

JPPhoto commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

@Pfannkuchensack Thanks for the thorough review. I plan to address the following:

  • F1: Add bounded generator/capacity tests first, then prevent eager unbounded resolution and Cartesian-product materialization.
  • F3: Add pending-child handoff tests for both cancel/delete-all-except-current, then preserve the active waiting chain.
  • F4/F7: Test owner/admin and shared-user event handling, then replace the synthetic malformed deletion with a schema-correct access-removal contract.
  • F5: Add frontend tests proving compatibility reasons select localized strings, then stop rendering backend English messages directly.
  • Q2: Add out-of-order child completion tests, then aggregate batch returns in stable enqueue order.
  • Q3: Add authorization tests around child/root ownership, then authorize the root directly while retaining intentional full-chain deletion.

F2 will be tested before changing code. Children currently inherit the parent's batch ID and destination, so its stated failure scenario may not apply. I will add focused bulk batch/destination cancellation tests and only change behavior if they expose a gap.

For F6, I will test and consider an atomic child-enqueue capacity guard, but I do not plan to count suspended waiting parents as pending capacity.

I propose deferring F8 and F9 as low-impact optimization/maintenance work unless testing exposes a correctness issue. Q1 is intentional and documented: recursive calls remain allowed with a fixed runtime depth cap of 4.

JPPhoto added 5 commits June 22, 2026 11:39
Add the missing F2 regression test for destination-scoped bulk cancellation of workflow-call parent/child chains.
…ches/7563-1782183112/31cb653f40eceef89f75012837b2fcce2fafe2c5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

6.14.x api backend PRs that change backend files docs PRs that change docs frontend PRs that change frontend files invocations PRs that change invocations python PRs that change python files python-tests PRs that change python tests services PRs that change app services

Projects

Status: 6.14.x Theme: USER EXPERIENCE

Development

Successfully merging this pull request may close these issues.

4 participants