Description
Extreme case, where we end up nearly duplicating runtime (because we add a + b in every iteration, instead of only once)
import numpy as np
import pytensor
import pytensor.tensor as pt
from pytensor.compile.mode import Mode
a = pt.tensor("a", shape=(2,))
b = pt.tensor("b", shape=(2,))
c = pt.tensor("c", shape=(100_000, 2))
out = (a + b) + c # ideal associativity
fn1 = pytensor.function([a, b, c], out, mode="numba", trust_input=True)
with pytensor.config.change_flags(optimizer_verbose=True):
fn2 = pytensor.function([a, b, c], out, mode=Mode(linker="numba", optimizer=None), trust_input=True)
a_test = np.ones(a.type.shape)
b_test = np.ones(b.type.shape)
c_test = np.ones(c.type.shape)
np.testing.assert_allclose(fn1(a_test, b_test, c_test), fn2(a_test, b_test, c_test))
%timeit fn1(a_test, b_test, c_test) # 238 μs ± 14.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit fn2(a_test, b_test, c_test) # 154 μs ± 8.95 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
This happens in a couple places, AddCanonizer and flatten_nested_add_mul
We are careful not to do this in the regular Fusion. It may make sense to still canonicalize as variadic, but we may want to specialize into subsets that reduce the number of flops
Description
Extreme case, where we end up nearly duplicating runtime (because we add a + b in every iteration, instead of only once)
This happens in a couple places, AddCanonizer and
flatten_nested_add_mulWe are careful not to do this in the regular Fusion. It may make sense to still canonicalize as variadic, but we may want to specialize into subsets that reduce the number of flops