Getting Started with Event Sourcing

In this series we work through the need for an event source, how to implement one early in a project, and what pitfalls to avoid.

Introduction to Event Sourcing

August 31st 2020

Note: this post is primarily database agnostic, but the query syntax assumes you are probably using a relational database like Postgres or MySQL. All code is in pseudo-Javascript and your implementation will vary.

When constructing the foundation of a building or something like the Hoover Dam, most people assume the architects will have thought ahead. The hope the foundation was not built for the first few floors or layers of concrete, but was laid in a way that accounted for potential future problems. Unfortunately, I have often seen that myself or colleagues build pieces of software as if we assume that it will not last long. There is an implicit assumption that we will fix things later as we encounter them. Sometimes this is appropriate, but some data cannot be constructed later. Some data is only available today, and should be collected now to be available later when we need it. This post focuses on the thought process for building a data model that requires very little extra code, but gives you a lot of flexibility in the future.

Typical Table Structure

Typically, when creating databases it is recommended that at a minimum you include the following fields in every row or document in your schema:

MyTable

  • id (int or UUID)
  • created_at (timestamp)
  • updated_at (timestamp)
  • ... other fields

Each of the timestamp fields should have at least millisecond precision, preferably microsecond if possible. (I have worked on a database that had only second precision and it caused massive debugging headaches later because we could not see which rows were inserted first.) When a row is created, the created_at property is set to the millisecond timestamp of the server's current time. When the row is updated, the updated_at is set so that there is at least a little bit of traceability on when rows are being accessed. Many frameworks (like Django) provide the ability to manage these two timestamps for you, which is useful so that you do not have to remember to set or update them yourself.

With the fields created_at and updated_at we can ask things like "when was this row last updated?" or "what are all the rows that were updated in the last day?" or "was this row updated when I thought it was?"

For most projects, these two fields are sufficient when combined with basic logging to debug a lot of scenarios. However, as your project grows, you will probably find that it is impossible to trace and revisit the past. And when you need to visit the past, the past does not exist in your database. It will be difficult to, for example, display a line chart of a value as it changes over time. If we have a field like account_balance, and we want to see all of the account_balances in the last week, how can we do it with only updated_at?

Perhaps a History Table?

As projects grow, it is nearly inevitable that stakeholders or clients will want to be able to visualize how things are changing over time. Normally if someone needs to be able to see when a row was accessed, they also want to know who accessed it and what they changed. But this does not just help clients, it also helps us debug. After all, the more we know about changes to state in our system the more easily we can track down problems in our code's ability to update the state.

Now then, we could create a secondary table, like so:

MyTableHistory

  • id
  • created_at
  • created_by (foreign key to a User)
  • value_key (Text, the key that was updated like 'first_name' or 'last_name' or 'country')
  • old_value (Text or JSON)
  • new_value (Text or JSON)

Note: we do not need an updated_at on this table, because we are assuming that rows in this table will never change. In other words, each row itself is an update, so each row's created_at is when it was updated.

If we use this model, whenever we make an update to MyTable, we can simply create a row in MyTableHistory for every single value in our original table that changed. Excellent, we have a full history - and with our created_by field we can see who made the changes.

But a quick eye will see that or MyTableHistory includes duplicate values. After all, imagine a history with three values:

const MTH= query("SELECT * FROM MyTableHistory") // imagine this returns 3 rows

If we have no bugs in our code, we can expect that for any value_key: MTH[2].old_value is the same as MTH[1].new_value and MTH[1].old_value is the same as MTH[0].new_value. We could instead simply do:

MyTableHistory

  • id
  • mytable_id (foreign key to MyTable)
  • created_at
  • created_by (foreign key to a User)
  • value_key (Text)
  • value (Text or JSON)

If we want to find how much an integer value changes, we could do the following:

delta_from_mth1_to_mth2 = int(MTH[2].value) - int(MTH[1].value)

Great! On each element in our history, we only need to store the new value, because we trust that the previous MyTableHistory record has the old value in it. It is always easier, in the long run, to minimize the number of variables and fields you have since the less code you have the fewer bugs you are likely to encounter. In this case, we have eliminated code without decreasing features.

In this scenario, we have created a log of all the changes to MyTable, who made those changes, and when.

But at this point, we are left with more duplicated data. MyTable now is expected to have the same values in its fields that are also available in the latest MyTableHistory rows. After all, the most recent row in MyTableHistory will have the exact same value that is stored in that field on MyTable.

Removing Duplicate Data

At this point the history table is like a side data source, one that we will probably not use for our primary data. We are still treating MyTable as the primary "up-to-date" source of information. But as a result, we end up with some of our data duplicated, and duplicated data should generally be seen as an optimization only because duplicated data means you have to make sure it is always synchronized. What if someone comes along and decides to update MyTable but bypasses inserting entries into MyHistoryTable? We see that it could easily become out of sync.

Perhaps we could simply eliminate MyTable? Perhaps all we need is MyTableHistory. If we load all the rows in MyTableHistory, for a given object type, we can "build up" the current state by simply right-folding (more on this in a moment) the values together into the final object.

We can do a query and code like so, to get not only the history but also the current state of an object:

rows = query("SELECT * FROM MyTableHistory WHERE object_id = 'some_object' ORDER BY created_at ASC")

current_state = {}
rows.forEach(row => {
  const key = row.value_key;
  current_state[key] = row.value;
});

When this code finishes, current_state is a single object that contains the latest full state of the object. It might work like this:

const rows = [
    { value_key: 'first_name', value: 'Josh' },
    { value_key: 'last_name', value: 'Jung' },
    { value_key: 'age', value: 35 }
]

current_state = {}
rows.forEach(row => {
  const key = row.value_key;
  current_state[key] = row.value;
});

assert(current_state, {
    'first_name': 'Josh',
    'last_name': 'Jung',
    'age': 35
}); // True

What we are doing is called folding, a specific implementation of reducing, the values in an Array into a single final object. Another fancier way this could be written in Javascript using the reduce method is like so:

function rightFold(rows) {
    return rows.reduce((accumulator, row) => Object.extend(accumulator, { [row.value_key]: row.value }), {});
}

In short this rightFold method above says to start with an empty object {} called the accumulator and then for each row in rows, we merge the accumulator with another object that has the value for that history row. Since each history row changes only one value, and we merge them in order so that the most recent one has the highest priority, we end up with a single object that has the most recent history.

While this is very little code, it looks complicated. What benefit does this give us?

We can now rebuild the entire state of MyTable at any point in time. For example, if we simply change our query to be like so:

rows = query("SELECT * FROM MyTableHistory WHERE object_id = 'some_object' AND created_at < '2020-01-01'::date ORDER BY created_at ASC")
state_on_dec_31st_2020 = rightFold(rows);

We now have the entire state of MyTableHistory at the very end of 2020 stored in state_on_dec_31st_2020. And with very little code and only one table!

Conclusion and Next Steps

This process of storing every change to an object over time and using this as the principle source of truth for an object is called Event Sourcing, and is incredibly useful for auditing, debugging, and logging changes to your objects and it requires very little code to implement up front but gives massive benefit down the line.

Please see the recommended links below for more information about this approach to storing and constructing data in your system. There is a wealth of knowledge out there on this, from simple services to incredible enterprise systems. The rest of this series will explore the deeper aspects of this approach to data storage, some of the downsides and how to avoid them, and end with a solid implementation that is scalable and will guarantee significant flexibility in your development for years to come.