Request for kernel to return both state and results

### Required prerequisites

- [x] Search the [issue tracker](https://github.com/NVIDIA/cuda-quantum/issues) to check if your feature has already been mentioned or rejected in other issues.

### Describe the feature

This is my team's single most important request. I'm available to discuss and help with this!

After running a kernel, we'd like to be able to get the result (the return value from the kernel) and also carry the resulting state forward for the next kernel.

### Currently:
- If we call `cudaq.run(kernel, state, args)` on our kernel, we get the return value from the kernel, but not the resulting state.
- If we call `cudaq.get_state(kernel, state, args)` on our kernel, we get the resulting state, but not the return value.
- We'd really like both.

### What we'd like, specifically:
- Call a kernel, and get the return values.
- Call another kernel, using the resulting state from the previous kernel, and get the return values.
- ...repeat thousands of times

Note that we don't actually need to get the state after each kernel call; if there's a way to just let it persist for the next kernel (like it would in the physical FTQC machine), that's great.

### Why this is needed:
- In FTQC, mid-circuit measurements are essential. For example (see diagram), we compute a QROM (hundreds of gates), and then to uncompute it we measure the ancilla qubits and do some phase fixup which depends on the measurement results. This may happen thousands of times in a single program.

<img width="1029" height="470" alt="Image" src="https://github.com/user-attachments/assets/353cac5c-6370-4b63-8959-7067f12e0f8a" />

### Our current workaround:
We have a solution, but it's not great. 

1. We call `state = cudaq.get_state(initial_state_kernel, num_qubits)` to create the state
2. Then we call `state = cudaq.get_state(ops_kernel, state, ops_list)` with a list of QPU ops (X gates and such) to evolve the state.
3. Repeat step 2 as needed with more ops.
4. When the ops list includes measurements, then RIGHT after them we call `results = cudaq.run(fetch_result_kernel, state, ops_list, shots_count=1)` which re-measures the same qubits (this is wasteful but doesn't change the state), return the results.
5. Goto step 2

- Step 2 is potentially wasteful, if the state vector is copied from CPU to GPU, because we're not actually going to need to inspect it.
- Step 4 is definitely wasteful, because we wouldn't need to stop and do another kernel call to re-measure, if only we could get the return values (which contain the measurement results) the first time the instructions are run.

### The good part
From the user's point of view, our workaround makes mid-circuit measurements easy, allowing users to branch their code and issue more gates on the same live state, based on measurement results. It's just messy under the hood at the moment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for kernel to return both state and results #4213

Required prerequisites

Describe the feature

Currently:

What we'd like, specifically:

Why this is needed:

Our current workaround:

The good part

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Request for kernel to return both state and results #4213

Description

Required prerequisites

Describe the feature

Currently:

What we'd like, specifically:

Why this is needed:

Our current workaround:

The good part

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions