Software pipelining didn't help us much in this example. Have we reached the peak performance for this loop?
Try more aggressive scheduling:
ldd [%o0],%f8 ! load f8 for first itern L313: ldd [%o1],%f4 fmuld %f6,%f8,%f2 ! use f8 from prev itern ldd [%o0+8],%f8 ! load f8 for next itern add %o1,8,%o1 add %o0,8,%o0 faddd %f2,%f4,%f2 ! use res of fmul and ldd cmp %o0,%o2 blu L613 std %f2,[%o1-8] ! use res of fadd