What about unrolling?
Unrolling the loop should have two benefits: it should allow even better scheduling, and it should reduce the amount of loop management overhead per iteration. E.g.
ldd [%o0],%f2 ! load f2 for first itern L513: fmuld %f6,%f2,%f2 ldd [%o1],%f4 faddd %f2,%f4,%f2 ldd [%o0+8],%f8 ! from itern below std %f2,[%o1] fmuld %f6,%f8,%f8 ldd [%o1+8],%f4 add %o1,16,%o1 faddd %f8,%f4,%f8 ldd [%o0+16],%f2 ! move fwd from next itern add %o0,16,%o0 cmp %o0,%o2 blu L513 std %f8,[%o1-8]