Idea: reorder exec
of loop nest so data isn't
evicted from cache before it's needed again.
Blocking is a combination of two transformations: ``strip mining'', followed by interchange; we start with
for (i = 0; i < 512; i++) for (k = 0; k < 512; k++){ r = A[i][k]; for (j = 0; j < 512; j++) C[i][j] += r * B[k][j]; }Strip mine the k and j loops:
for (i = 0; i < 512; i++) for (kk = 0; kk < 512; kk += BLKSZ) for (k = kk; k < min(kk+BLKSZ,512); k++){ r = A[i][k]; for (jj = 0; jj < 512; jj += BLKSZ) for (j = jj; j < min(jj+BLKSZ, 512); j++) C[i][j] += r * B[k][j]; }