Our measurement of time is imperfect, but we perceive it as ordered and logical. So it's both tempting and dangerous to sort events by timestamp. Timestamps do not provide the strict order you might assume and events out of sequence can cause significant bugs.
Time resolution is not infinite
What if two events have the same timestamp?
In a log file this is not a problem, because the lines of the file provide a strict order. Import the events into a database, however, and the original line order is lost - two rows with the same value will be in an undefined order.
The risk of a duplicate timestamp is affected by:
- The resolution of your hardware clock and time APIs - running your code on another platform may significantly increase your chance of duplicates.
- The resolution of your timestamps - do you store seconds since epoch (resolution 1 second)? nanoseconds? a date string with HH:MM (resolution 1 minute)?
- The frequency of events
You might record identical timestamps for events in close proximity, causing them to appear shuffled. To make matters worse, this is most likely when your machine is under heavy load.
Clocks can go backwards
A clock is merely a device to measure time and as such requires calibration and adjustment. Manual adjustments - like a user naively changing a timezone or correcting a slow clock - are the most likely cause of a jump backwards, but automatic changes can also be to blame.
The automatic change from daylight saving could jump an event backwards a whole hour if you handled timezones incorrectly. We have to be particularly careful in the UK, where GMT can happily masquerade as UTC for half the year.
Services like ntpd (Network Time Protocol Daemon) can also cause dramatic clock changes. Depending on configuration, a large drift in system time can cause ntpd to hop immediately to the correct time (possibly backwards). Devices like the Raspberry Pi - that are frequently disconnected and have no real time clock - are particularly vulnerable.
Monotonic clocks - guaranteed to never run backwards - do exist, but a timestamp from a monotonic clock is of little use between reboots, and useless to compare between machines. They are generally used to measure an interval on a single machine.
Intervals can stretch and shrink
Jumps in time can cause problems, so services like ntpd often prefer to slow down or speed up the system clock until it gradually approaches the correct time (this is called 'slew' correction).
Google uses a similar approach for leap seconds, 'smearing' an extra second over a 24 hour period, instead of bamboozling software with a 61 second minute.
Even if you could start a timer on multiple machines at a known instant in time and stop them at another instant, they would likely measure a subtly different elapsed time. The longer the interval, the more apparent manufacturing tolerances will be. Adafruit advises this PCF8523 based RTC "may lose or gain a second or two per day".
Clocks are never in sync
You may be attracted to timestamps because they're easy to collect at multiple sites then add to an ordered series later. However, in addition to all of the above, you must now consider the disparity between multiple system clocks.
Replying to a chat message on a different machine, you might easily record a timestamp before the original message.
When you sort data by timestamp, it implies a causal relationship - that, say, a message happened before it's reply, or a credit happened before a debit. Therefore, techniques that provide a strict - or at least causal - ordering of events should be preferred.
Use a counter
The most fool-proof alternative to timestamps is an incremental counter stored on a single machine. If there is only one instance of the software, or clients always submit to a central server, this is often the best choice.
Most databases provide an auto increment or sequence type that can provide a suitable value.
Consider distributed clocks
If you need to generate points in a sequence at multiple sites, then you may need a more complex series of counters like Lamport timestamps or a vector clock. Distributed clocks like this provide a partial causal ordering of events and a means to detect conflicts (i.e. events that are seen as concurrent because they extend a shared point in history).
If your clients generate timestamps locally, but the data is only integrated by a central server (not shared peer-to-peer), your logical clock can be relatively simple, requiring only two peers.
Distributed clocks only help you detect concurrent events. Once detected, the problem of resolving conflicts is often domain-specific. Using the appropriate clock or data type will force you to handle these conflicts early. Remember, the conflicts were always present with timestamps - they were just not apparent.
Detecting and resolving conflicts can be as fancy and complex as you
like - but, before you reach for a full version-control system
git, I suggest you try a distributed
clock or simple counter first.
When are timestamps appropriate?
I'm only suggesting timestamps are a bad way to order causally linked events. Timestamps are still useful for:
- Communication with humans - Logical clocks don't mean a lot to us. Adding a timestamp as part of the presentation (but not ordering) of data is often a good idea as it lets us place entries in a wider context outside of a single application.
- Sampling - Data collected for statistical analysis is often collected ad-hoc from multiple sources and strictly ordering measurements in close proximity may not be important. Ask yourself: "If I shuffled a few events around would my conclusions still be sound?"