Time Disorder

Don't order events by timestamp

03 March 2020

When presented with a series of events, many developers will first be tempted to sort them by time. This is dangerous because timestamps do not provide the strict ordering they've assumed.

Out of order events can lead to infrequent but significant bugs: consider "add to basket then checkout" vs "checkout then add to basket".

Instead of timestamps, developers should prefer simple counters and proper conflict detection. Timestamps may still be useful, but should be approached with caution due to the complexities outlined below.

Time resolution is not infinite

What happens when two events have the same timestamp?

When writing to a log file, two entries with the same timestamp are not a problem because the lines in the log file provide the real order of the data. However, import those entries into a database and the original line order is lost. Now, when sorted by timestamp, two entries with the same value may be returned in an undefined order.

The chances of a duplicate timestamp are affected by:

  1. The resolution of the hardware clock and time APIs - running your code on another platform may significantly increase your chance of duplicates.
  2. The resolution of your timestamps - do you store seconds since epoch (resolution 1 second)? nanoseconds? a date string with HH:MM (resolution 1 minute)?
  3. The frequency of events

Events in close proximity may record identical timestamps causing them to appear shuffled. To make matters worse, this is most likely when a machine is under heavy load.

Clocks can go backwards

If, like me, you experience time in one direction, this is easy to forget. A clock is merely a device to measure time and as such requires calibration and adjustment.

Manual adjustments, like when a user naively changes timezone or corrects a slow clock, are the most likely cause of a jump backwards in time, but automatic changes can also be to blame.

If a developer generates timestamps or stores timezone data incorrectly, the automatic change from daylight saving time could jump events backwards by a whole hour. We have to be particularly careful in the UK, where GMT can happily masquerade as UTC for half the year.

Services like ntpd (Network Time Protocol Daemon) can also cause dramatic clock changes. Depending on configuration, a large drift in system time can cause ntpd to hop immediately to the correct time (possibly backwards). Devices like the Raspberry Pi are particularly vulnerable to this as they are frequently disconnected from a network and have no Real Time Clock.

There are clocks guaranteed to never run backwards, called 'monotonic' clocks, but a timestamp from a monotonic clock is often of little use between reboots, and useless to compare between machines. Generally, a monotonic clock is used to measure a time interval on a single machine.

Intervals can stretch and shrink

Jumps in time can cause problems, so services like ntpd often prefer to slow down or speed up the system clock until it gradually approaches the correct time (this is called 'slew' correction).

Google uses a similar approach for leap seconds, 'smearing' an extra second over a 24 hour period, instead of bamboozling software with a 61 second minute.

Even if you could start a timer on multiple machines at a known instant in time and stop them at another instant, they would likely measure a subtly different elapsed time. The longer the interval, the more apparent manufacturing tolerances will be. As an example, Adafruit advises this PCF8523 based RTC "may lose or gain a second or two per day".

Clocks are never in sync

A developer may be attracted to timestamps because they're easy to collect at multiple sites then insert into an ordered collection later. However, in addition to all of the above, they must now consider the disparity between multiple system clocks.

Replying to a chat message on one machine you might easily record a timestamp before the original if the original was recorded at a different machine.

Recommendations

Timestamps are complex. They're difficult to store and generate correctly, they're almost impossible to compare accurately across machines, and they cannot guarantee a strict causal ordering of events.

When you sort data by timestamp it almost always implies a causal relationship (e.g. implying a message happened before it's reply, or a form GET happened before a POST). Because of this, techniques that provide a strict (or at least causal) ordering of events should be preferred.

Use a counter

The most fool-proof alternative to timestamps is an incremental counter stored on a single machine. If there is only one instance of the software, or clients always submit to a central server, this is often the best choice.

Most databases provide an auto increment or sequence type that can provide a suitable value.

Consider distributed clocks

If you need to generate points in a sequence at multiple sites, then you may need a more complex series of counters like Lamport timestamps or a vector clock. Distributed clocks like this provide a partial causal ordering of events and a means to detect conflicts (i.e. events that are seen as concurrent because they extend a shared point in history).

If your clients generate timestamps locally but the data is only integrated by a central server (not shared peer-to-peer), your logical clock can be relatively simple requiring only two peers.

Handle conflicts

Distributed clocks will only help you detect concurrent events. Once detected, the problem of resolving conflicting events is often domain-specific. Using the appropriate clock or data structure should force you to handle these conflicts early on. Remember, the conflicts were always present with regular timestamps, they were just not being surfaced in your design.

Conflict detection and resolution can get as fancy and as complicated as you like, including employing tools like git to provide a full history. That said, it's so hard to imagine an architecture that started with simple timestamps and ended with git, that I'm going to suggest you try a distributed clock or simple counter first.

When are timestamps appropriate?

I'm only suggesting timestamps are a bad way to order causally linked events. Timestamps are still useful for:

A frequent bugbear

I'm recording my arguments against ordering by timestamp here as a reference because it's a conversation I frequently have in architecture meetings. I hope this is a useful reference for you too, and if you have any relevant experience please do share it with me.