What about unrolling?
Should have two benefits: better instruction scheduling, and should reduce the amount of loop management overhead per iteration. E.g.
ldd [%o0],%f2 ! load B[k,j] for 1st itern L513: fmuld %f6,%f2,%f2 ! f6 is r, ie A[i,k] ldd [%o1],%f4 ! load C[i,j] faddd %f2,%f4,%f2 ! compute C[i,j]+r*B[k,j] ldd [%o0+8],%f8 ! from itern below std %f2,[%o1] ! store C[i,j] fmuld %f6,%f8,%f8 ldd [%o1+8],%f4 add %o1,16,%o1 faddd %f8,%f4,%f8 ldd [%o0+16],%f2 ! move fwd from next itern add %o0,16,%o0 cmp %o0,%o2 blu L513 std %f8,[%o1-8]