The min operators in the inner loop bounds are likely to be a performance hit; if we choose a good blocking factor which divides the problem size exactly...