Getting Started with Event Sourcing

In this series we work through the need for an event source, how to implement one early in a project, and what pitfalls to avoid.

What fields should we save on our events?

September 9th 2020

Now that we have created a basic event sourcing model, you may ask what other metadata should we also save? After all, if we want a single record for each change to our model, we have the opportunity to also store a lot of tracing information.

To make things more relatable to the "real world" (whatever that means), let's take the abstract example in the previous post and rename things. MyTable is now User. MyTableHistory is now UserEvent.

Here is our Event Sourcing table so far:

id (UUID)
user_id (foreign key to User who this change is for)
created_at (the time this event occurred, so no need for updated_at)
created_by (foreign key to a User who was logged in when this event occurred)
value_key (Text)
value (Text or JSON)

Simplifying the Table

You might notice that we could simplify this a little bit more. Originally we wanted the value_key to reference the original field on User which changed, and value to reference the field on User which we are recording a change to. Since we are using the JSON value field to store our change, perhaps we can simply use that field to store the delta but have that delta include the field name itself? Like so:

id
user_id
created_at
created_by
value_delta (Text or JSON)

Much simpler. At this point it feels like we are regressing to something... too simple. But this will make sense, let's give it a chance.

In this way, we can imagine that if we retrieved the record it might look something like this:

{
  "id": "5ca761cb",
  "user_id": "06fbd0e0",
  "created_at": "2020-09-08T21:28:26.025Z",
  "created_by": "f5862b6e" 
  "value_delta": {
    "first_name": "william"
  }
}

Very nice.

There is a lot of information packed into one small row! We know that a user with the id of f5862b6e was authenticated to change the first_name property for another User with id of 06fbd0e0 at exactly 2020-09-08T21:28:26.025Z. Perhaps this means that Jerry in sales (f5862b6e) was logged into the portal and updated the customer "william"'s (06fbd0e0) first name. We have an audit log of who changed which information on which account at which time, and with only five columns in a single table.

Adding Source System

We have a problem now though. How do we know whether Jerry was logged into the Portal or the mobile application? This might be important information at some point, especially when debugging why the name was not properly capitalized like it should have been. Since every HTTP request includes the user-agent of the caller, it would be very easy to figure out which client made the call to update this record and store this type of information on the event as well.

id (UUID)
user_id (UUID)
created_at (Datetime)
created_by (Datetime)
value_delta (JSON)
client_source (String)

Now our data might look like this:

{
  "id": "5ca761cb",
  "user_id": "06fbd0e0",
  "created_at": "2020-09-08T21:28:26.025Z",
  "created_by": "f5862b6e" 
  "value_delta": {
    "first_name": "William"
  },
  "client_source": "Portal"
}

Now we know for sure that Jerry changed the customer's first name to "william" from the portal client. Jerry swears up and down that the portal gave him no warning to capitalize the name and now we know the pipeline where we can fix this so it does not happen again. Wonderful.

Versioning

One problem with an Event Source system is that as the product grows, the schema for values of a User might change. We might add birth date or hair color or favorite fishing spot. While our current model is flexible enough that we can easily add new fields (or ignore old ones) simply by adding new fields to the value_delta JSON, what if an old field needs to change? For example, perhaps we need to go back through all the events before version 1.0.0 and capitalize all the first letters in first names. Maybe poor "william" is not the only one. It might be nice if we knew what version of our system inserted the event, so that we can simply loop through those records and run the cleanup.

id (UUID)
user_id (UUID)
created_at (Datetime)
created_by (Datetime)
value_delta (JSON)
client_source (String)
api_major_version (Integer)
api_minor_version (Integer)
api_patch_version (Integer)

In this case, I am assuming you are using something similar to semver for your versioning, but obviously this could change.

Why did I create three fields? Well, generally if you want to update things based on a version change, you will want to do so by searching for records that have, say, a api_major_version less than 1. We will discuss indexing more later, but suffice to say the following SQL is much easier to digest and faster on the database:

SELECT id, first_name FROM UserEvent WHERE api_major_version < 1

Now we can loop over all of these records and capitalize the first letter in what I hope would be trivial code in your favorite language. This is much easier than trying to use regular expression (slow on database tables) to parse a version string like 1.0.0 or 0.99.1. The reason is that on millions of records, pulling the records and splitting a string and converting each subvalue to an integer would not be ideal. You might even be able to use your SQL version to capitalize the records in place, but that is beyond this series.

Note that you could also at this point add variables for client version, or whatever information you think your infrastructure may need. The point of this approach is to just allow the next engineer to inject whatever relevant information they can to the row so that if you have to debug something later or update records based on changes to your stack it is easy to find those rows and update them, healing your old data.

Grouping our Data

There is one downside to storing our value in JSON: it is harder to index and query changes based on a single field. For example, what if we want to get all the first_name changes in the database?

Since JSON can have a deep tree structure, even though Databases can add indexes on JSON (e.g. Postgres), you generally want to handpick exactly which sub-fields within the JSON you want to index. This involves a little more work than a normal Integer or String (varchar or text) column.

Regardless, our goal is to do something like this:

SELECT id, first_name FROM UserEvent WHERE type="first_name_updated"

Let us go ahead and add a few convenience fields to help us group our data:

id (UUID)
user_id (UUID)
created_at (Datetime)
created_by (Datetime)
value_delta (JSON)
client_source (String)
api_major_version (Integer)
api_minor_version (Integer)
api_patch_version (Integer)
type (String)
sub_type (String)
event (String)

Now we can imagine our output data might look like this:

{
  "id": "5ca761cb",
  "user_id": "06fbd0e0",
  "created_at": "2020-09-08T21:28:26.025Z",
  "created_by": "f5862b6e" 
  "value_delta": {
    "first_name": "william"
  },
  "client_source": "Portal",
  "api_major_version": 1,
  "api_minor_version": 0,
  "api_patch_version": 0,
  "type": "account",
  "sub_type": "first_name",
  "event": "change"
}

Now we have three grouping keys type, sub_type, and event which can help us organize our data. All in a table with very few columns.

Want to retrieve all account information for "william"?

SELECT * FROM UserEvent WHERE type = 'account'

Want to retrieve every change to the user, regardless of whether it was an account field or not?

SELECT * FROM UserEvent WHERE event= 'change'

Want to see every interaction of Jerry with "william's" account in the last year that came from the Portal?

SELECT * FROM UserEvent WHERE created_by='f5862b6e' AND created_at > '2019-01-01':date AND client_source='Portal'

We now have massive flexibility and a tiny, maintainable table.

Conclusion

It should be clear at this point that we have a lot of flexibility on the type of metadata we can store in the table. Your mileage may vary on what other columns you would like to have, but adding these types of columns is cheap.

Add columns that do the following:

  1. Could help you clean old data when code breaks
  2. Debug or remove records that were inserted erroneously

In the next post we will explore indexing in more depth and how to maximize your use of this, particularly in a RDMS database.