Accessing large memory in loops

Going through a 2-D array seems trivial. There’s an inner loop and an outer loop. However, I finally got to understand some concepts I had learned in computer organization theory when I actually tried to access large 2-D array for image processing. Consider the following scenarios for a 2-D Integer array of 20megabytes:

        
      
int (*arr)[5000] = malloc(sizeof(int[5000][5000]));

// Loop 1 --
for (int i = 0; i < 5000; i++)
{
    for (int j = 0; j < 5000; j++)
    {
        arr[i][j] = /* assume some more complex operation */ 99;
    }
}

// Loop 2 --
for (int i = 0; i < 5000; i++)
{
    for (int j = 0; j < 5000; j++)
    {
        arr[j][i] = /* assume some more complex operation */ 99;
    }
}

Notice that the only difference between the loop is the arr[i][j] vs arr[j][i]. Loop1 is also called the “row major” order and Loop2 the “column major”. Which one is faster and why? Being from Electrical engineering background and not having paid much attention to any of the computer organization classes in college, I was blissfully unaware that this is something even worth thinking about. However, if we test this out quickly:

        
      
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

int main()
{
    int dim = 10000;
    int (*y)[dim] = malloc(sizeof(int[dim][dim]));
    for (int i = 0; i < dim; i++)
    {
        for (int j = 0; j < dim; j++)
        {
            y[i][j] = ((7.91101f * 8113.0f * (double)i) / 133771.0f) + (797.0f * (double)j);
        }
    }
    free(y);
    return 0;
}

Replace i and j for the other version. Run with time command, and we see:

        
      
$ time ./test_column_major

real    0m0.308s
user    0m0.237s
sys     0m0.059s

$ time ./test_row_major

real    0m0.163s
user    0m0.101s
sys     0m0.051s

Why is this happening? Well, there’s a great video series that explains the same - https://www.youtube.com/watch?v=JogSnkvENr0&list=PLxfrSxK7P38X7XfG4X8Y9cdOURvC7ObMF&index=67

In short, the cache design is the reason both access behave differently. Cache is set up in a way so that if you have a cache miss, which will occur if the processor asks for a piece of memory it hasn’t seen before, then the cache fetches that element and also a few of its neighbors. Our first call, for arr[0][0] triggers a cache miss and fetches that element along with a few others ALONG THE SAME ROW! This is because the array is stored in memory row-wise. Therefore, if we access the element arr[0][1] after this, no problem, it’s already in the cache. However, if we access arr[1][0], then that (for large row size like we have here), is most likely NOT in the cache. This means that going along the column triggers a cache miss for almost every call. Hence the 2x slowdown.

A small caveat: Depending on the alignment of arr[0][0], arr[0][1] may be actually on the next line and thus might trigger a cache miss, but usually the memory allocation is set up to avoid this.

Accessing large memory in loops

Further Reading

Performance of various strategies of condition based function call

CheatSheet - C Problems

PIR based light control simulation