Looping is one of the most common programming tasks. In this post we explore tricks to improving your skills in this most basic of skills.

Skill of the Loop

April 6th 2020

One of the first things you learn when programming is how to loop over data structures. When I was learning programming, I would get hung up regularly over simple things like whether to start my loop at index 0 or 1. The famous adage that one of the only complicated things about programming is off-by-one errors rang true. As time has gone by I find myself learning more and more about what at first seemed so simple: iterating over data.

There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors.

This post assumes you have a basic grasp of hashes, the layout of data in memory, and type casting. If you are not familiar with these topics, I recommend doing some Google searches and perhaps reading the Wikipedia entries on the topics.

Looping Problems

To begin with, looping is about address access of memory and jumping to a new instruction based on a conditional. Accessing memory can be a slow process, especially if the memory being accessed is not already loaded into the CPU cache - or even worse if the memory has to be loaded from disk or over the network. A conditional jump to a new instruction is also a time-intensive process because it disrupts the sequential processing of instructions a Turing machine is best at performing.

So, to summarize our situation: the fastest code is code that does not involve any jumps to new instructions. If jumps to new instructions must be performed, it is best if they are not conditional. And if we must have a conditional jump to a new instruction, it is best if it is done the least number of times as possible. Finally, if we have to have a lot of conditional jumps in our code done over and over (e.g. a loop) we want to make sure that data we are accessing over and over is as close to the registers on the CPU as we can get them. We want to minimize putting data out into the far reaches of our memory stores as much as possible.

Memory storage lays out like this:

CPU registers (fastest) -> CPU cache (faster) -> RAM (fast) -> Hard Disk (slow) -> Network (slowest)

Note: modern CPUs often have multiple layers of cache, each one progressively faster than the last. Expensive CPUs will have larger caches, making your entire system perform dramatically faster by taking load off of the slow calls to the RAM.

If our loop is only requesting data in the CPU registers, then we are good. If it is making network calls, we are going to be magnitudes slower. We must strike a balance, and there are a lot of tricks to do this.

Our biggest problem is that in almost all modern languages (especially scripting languages), we do not have any control over whether data is stored in CPU registers, CPU cache, or RAM. The moving of data between these three (through things like cache invalidation) are all optimized and hidden within the operating system, compiler or script interpreter, or the hardware. It used to be, in languages like C, that you could ask the compiler to try and put values into the register to help speed things up. In short, you had a lot of control over these things. Not so much any more.

The older languages like Basic, Fortran, C, and Cobol did not hide the raw side of these operations. One of the ways they exposed the underlying machine language was through keywords like the goto statement.

The goto statement was a natural "English" nod to the assembly instructions that told the CPU to "jump" to a new address for processing instructions. There were a set of these assembly instructions: JMP, JE, JNE, etc. that mapped directly to machine code the CPU understands. An unconditional jump (JMP) is extremely fast because the CPU can predict ahead of time what data needs to be loaded before it gets there.

Conditional jumps are a lot more complicated. For a conditional jump, the CPU is given two different instruction addresses in memory (one for each of the directions the instructions might go depending on the conditional) and a few registers of data and then instructed to check the registers and choose which address needs to be jumped to based on the values in the registers. Due to how common this pattern is in code, modern CPUs and compilers employ numerous optimizations to help translate these seemingly simple tasks into something that does not hang the system waiting for a long-running loop to complete.

Now then, the goto statement combined with an if statement is the simplest way to create a generic loop.

Consider the following pseudocode:

1 INT X = 0
10 X = X + 1
15 PRINT X
20 IF X != 10 GOTO 10

Not only is this code easy to read, it is extremely efficient on the compiler and CPU. The compiler can translate this into machine code rather easily. Declare an integer X, increment it by 1, print something to stdout and then compare the integer X to 10. If there is not a match, move to the instruction at line 10. Basically we have a JE (jump-if-equal) in which we compare two values.

The fastest way to perform this conditional jump would be to store X and 10 in two different registers in the CPU and keep them there for the duration of the loop. If this were followed, the CPU would never have to fetch any values from RAM or even the cache. You could easily scale the above code to billions of operations and still have it operate in milliseconds.

The goto statement has been all but abandoned, primarily due to the confusion it causes in reading other people's code. Instead the developer community has resorted to more modern looping syntax, which makes code easier to read, but makes optimizing the loop a little more difficult for the language.

In Javascript, the above goto code could easily be rewritten as a while, for, or do/while loop. They all end up functioning exactly the same way in practice:

for (var i = 0; i < 10; i++) console.log(i);

var i = 0;
while (i !== 11) console.log(i++);

var i = 0;
do {
  console.log(i++);
} while (i !== 11)

An optimized compiler or interpreter (for scripting languages) will be smart enough to convert all of these to the same underlying machine code. The more optimized the compiler or interpreter is, the better it will be at optimizing. An extremely optimized one will be smart enough to only use values in a CPU register because it can predict that i will always be an 8-bit integer. But these compiler optimizations are incredibly complex, and a lot of newer languages might not have the ability to do this.

In a lot of non-typed scripting languages like Javascript, Python, or Ruby we have the added complexity of type inference and high-level data structures (like lists) that complicate things super quickly.

Consider something like the following, in Python:

  y = 0
  my_list = [1, 2, 3, 4, 5]
  for x in my_list:
    y = y + x

This may not at first seem more complex, but the amount of machine code generated is considerable and the performance will drop considerably.

We have several major problems for the compiler in the above Python code:

Is the type of y always an integer?
Is the storage of my_list sequential in memory?
How do we iterate over my_list if the values are not stored sequentially in memory?

For (1), it may seem like a massive convenience to not have to type out that the variable y is an integer. However, the consequence is that the compiler must assume at all points in the code that y might not be an integer anymore. What if y suddenly becomes a string? This obviously changes the addition quite a bit.

Consider if we changed the code just a little bit:

  y = 0
  my_list = [1, 2, 3, 4, 5]
  for x in my_list:
    y = add(y, x)
    # add() is some function defined heaven-knows-where, the compiler might not know ahead of time

In this case, we might think we are helping things by creating this add() function, but now it is even harder for the compiler to know what in the world add() might return. It might return an integer one moment and a goose the next. So we have to be safe. We must check.

As a result, non-typed languages will tend to add a ton of extra machine code code for every single operation - on top of not storing values in registers all the time. Behind the scenes the language is often doing type checking for every single access to a variable. There are optimizations that languages do, but the point is that we cannot know what the compiler is thinking. So the above Python loop will now check, every time that y is accessed, whether it is an integer or not. And it will check x. And... it cannot even be sure that add() is also not updating my_list behind the scenes. So to be safe we just check everything every time. What would have been a few machine code instructions in Basic with our goto statement and types is now potentially hundreds of machine code instructions.

The problem is exacerbated by the inability of Python to know the type of my_list's elements as well. What if one of the elements in the list is a string? As a result, the seemingly simple operation y = y + x involves two type-checks! If the compiler is unable to accurately assume the types of every element, our simple code above might blow up to something like this by the time it is interpreted (in pseudo-code again);

  y:* = 0
  my_list = [1, 2, 3, 4, 5]
  for x:* in my_list:
    if typeof y is Integer and typeof x is Integer
      y = y + x
    else if typeof y is Integer and typeof x is String
      y = String.concatenate(toString(y), x)
    else if typeof y is String and typeof x is Integer
      y = String.concatenate(y, toString(x))
    else if typeof y is String and typeof x is String
      y = String.concatenate(y, x)

What a mess! Type checking is slow, concatenation is slow, and the potential to allocate new memory locations to store strings is even slower. But what if x or y is an Array or an Object or Dictionary or.... heaven knows what? Then we need to add error handling. Our simple desire to avoid strict typing has now turned into a nightmare of hidden complexity for the compiler or interpreter of our language.

Please note I am not saying anything in particular about Python or Javascript or Ruby or any other non-typed language. I am simply pointing out that knowing how this stuff works can help tremendously in understanding what is happening internally and why scripting languages can be so slow.

Scripting languages give us a lot of power in development speed at the cost of CPU speed. They give us the illusion of being able to accomplish more work more quickly, but they offshore the real work into the CPU. And in loops, especially, this can get us into massive trouble.

Optimizing Your Loops

With this background information, we have the knowledge we need to start optimizing some of our loops. While our hands might be tied in telling the compiler how to interpret our code, we can employ some tricks in most scripting languages to help improve performance of our loops.

First, use local variables and do not reuse them for multiple types. Some compilers (like Javascript) will do their best to inspect the surrounding code context and try to prove, before generating the machine code, which variables will have a fixed type. For example:

for (let i = 0; i < 10; i++) console.log(i);

In this code, i is limited in scope to the loop by definition of the let keyword. The only operator performed on i is ++. As a result, the Javascript interpreter can be extremely confident that i will always be an integer. As a result, it might be able to store i closer to a CPU register.

But consider the following code:

window.i = 0;
for (; window.i < 10; window.i++) console.log(i);

In this code, the variable i is declared with a wider scope (as a global on the window), and perhaps our call to console.log might change that value! The compiler cannot be 100% sure that console.log might itself not change. After all, you can redefine what console.log does in Javascript. What if it changes to a string? The compiler cannot be sure, so it will not be able to optimize as much.

Consider context, and try to keep your loops simple and tight.

Another optimization you can make is using dictionaries to pre-store data and avoid large comparisons.

Consider the following code:

const users = [{ first_name: 'bob', last_name: 'winkle' }, ...] // maybe 1000 users in here

...

function findUsers(queryString) {
    return users.filter(user => user.first_name === queryString || user.last_name === queryString)
}

In this code, every time we call findUsers we do a string comparison for the first_name and last_name... for every user. Lots of loading from RAM, most likely, and tons of type checking with the strict equality operator ===.

It would be better to cache the values for first_name and last_name ahead of time in a more efficient data structure. The reason for this is that memory is cheap these days, and if findUsers is going to be called more than once, we can reduce findUsers down to a single operation!

const users = [{ first_name: 'bob', last_name: 'winkle' }, ...] // maybe 1000 values in here
const usersMap = {}; // create a dictionary

// This loop is only run once!
users.forEach(user => {
  usersMap[user.first_name] = usersMap[user.first_name] || []
  usersMap[user.first_name].push(user);

  usersMap[user.last_name] = usersMap[user.last_name] || []
  usersMap[user.last_name].push(user);
});

...

function findUsers(queryString) {
    return usersMap[queryString]; // super efficient map lookup! No loop.
}

This method of caching values ahead of time in a dictionary / hashmap is extremely efficient. Dictionary lookups are insanely fast, because they are almost always a core and heavily optimized feature of all scripting languages. Use them often.

Consider the following:


let users = [{ first_name: 'bob', last_name: 'winkle' }, ...] 

users = users.map(user => ({
  ...user,
  full_name: user.first_name + ' ' + user.last_name,
});

Seems simple enough, right? Not so fast. In Javascript all string values are copied by value, not by reference. This means that everytime you assign one variable to another variable that contains a string, the CPU needs to manually copy every single character to a new location in memory. In addition, another problem with this code is that it creates an entirely new object for every user and then copies the strings for first_name and last_name into the new object and then constructs a new full_name value.

The following is much more efficient:


let users = [{ first_name: 'bob', last_name: 'winkle' }, ...] 

users = users.map(user => {
  user.full_name = user.first_name + ' ' + user.last_name;
  return user;
});

In fact, testing this myself, the second test is actually twice as fast:

let users = [];
  
for (var i = 0; i < 1000000; i++) {
  users.push({
    first_name: i.toString(),
    last_name: (1000000 - i).toString()
  });
}

let users1 = users.map(user => ({...user})); // make sure to make copies of the user objects for our test
let users2 = users.map(user => ({...user}));

// test 1 (new object creation)
let s = new Date().getTime();
users1 = users1.map(user => ({
  ...user,
  full_name: user.first_name + ' ' + user.last_name
}));

console.log(new Date().getTime() - s);

// test 2 (reusing the same object)
let s2 = new Date().getTime();
users2 = users2.map(user => {
  user.full_name = user.first_name + ' ' + user.last_name;
  return user;
});
console.log(new Date().getTime() - s2);

Running it, I get the following:

joshjung@Joshuas-MacBook-Pro-2 ~/dev$ node index.js 
716
363

This may not seem like a lot, but imagine we add a ton more keys to the user object. Imagine it has a dozen string key values, instead of just 2:

let users = [];
  
for (var i = 0; i < 1000000; i++) {
  users.push({
    first_name: i.toString(),
    last_name: (1000000 - i).toString(),
    long_string: 'lorem ipsum this is some super long string that needs to be copied in memory every single time the object is copied.' + i,
    long_string2: 'lorem ipsum this is some other super long string that needs to be copied in memory every single time the object is copied.' + i,
    long_string3: 'lorem ipsum this is some other, other super long string that needs to be copied in memory every single time the object is copied.' + i
  });
}

let users1 = users.map(user => ({...user})); // make sure to make copies of the user objects for our test
let users2 = users.map(user => ({...user}));

// test 1 (new object creation)
let s = new Date().getTime();
users1 = users1.map(user => ({
  ...user,
  full_name: user.first_name + ' ' + user.last_name
}));

console.log(new Date().getTime() - s);

// test 2 (reusing the same object)
let s2 = new Date().getTime();
users2 = users2.map(user => {
  user.full_name = user.first_name + ' ' + user.last_name;
  return user;
});
console.log(new Date().getTime() - s2);

Now my performance is like this! Our test 1 is now over 4 times slower!

joshjung@Joshuas-MacBook-Pro-2 ~/dev$ node index.js 
1600
323

What is going on here? The reality is that in test 1, we are asking quite a bit of the CPU and memory of our system. While it is pretty looking code, it tasks the entire system with a mammoth amount of new memory operations, copying of lengthy strings, iterating over key values to perform the spread operator, and pumping data in and out of registers, cache, and RAM over and over. All so we can use some sexy modern operators. In the end, our code runs long enough that not only do we have all of this inefficiency we are giving the operating system the ability to perform more context switching between our process and other processes, which adds to the load of our code.

Conclusion

In my experience, spending a little time learning about how looping works behind the scenes, especially in more modern scripting languages, can truly help not only your coding skill, but also your bottom line. When writing code day in and day out, loop after loop, little inefficiencies build up until the simplest of code can take minutes to run. Taking some time and care early on when writing your code can make your entire programming experience more enjoyable. You can build habits that pay off in the long run as you learn what questions to ask about the language you are using so that you do not accidentally create inefficiencies.

One could try to argue that this is "preoptimizing". I completely disagree. If I know that using wooden tires is slower than rubber ones, it is not preoptimizing to use rubber ones from the beginning. Look at the examples above. They all have the same amount of code. In other words, if I were to write test 1 first, and then optimize later to test 2 - I have now spent at least twice the time. If I know ahead of time that test 2 is the more efficient way to do things, would I not want to just get in the habit of doing it from the beginning? The more you know about the fundamentals, the more you can choose the more efficient way to write your code from the beginning and make the programming experience more enjoyable for everyone.

Skill of the Loop

April 6th 2020

Looping Problems

Optimizing Your Loops

Conclusion

Josh's Recommended Links