Creating Unique Numbers

Many software systems rely on some kind of unique numbers. Uniqueness is always a question in what universe this uniqueness is required. We do see the different kinds of universes in the case of the IP-addresses. In theory they are world wide unique. In practice we have mechanisms in place like NAT, that use certain dedicated IP-ranges for an intranet and map them to publicly available addresses for internet traffic outside the intranet. So we already have two kinds of universes… This is typical.

A common case are database IDs, that are often used as primary keys in databases. I do challenge this by the question, if there is a natural key already available in the data, which might make this db internal id unnecessary, but more often than not DB tables do have these ID columns as primary keys. They have to be unique within the DB table. Which can mean across several servers, because somewhat distributed databases are common.

Other examples are message ids of emails. They may be quite long, a short line of text is acceptable and they should be world wide unique. Combining the fully qualified publicly accessible hostname and a time stamp and a counter is usually good enough. They look like this 5AE.630049D.2EA050907@gmx.net or 5AE.630049D.2EA050907@ms-93424.gmx.net, where the part after the „@“ stands for the mail server and the part in front is a unique id for the message created by the mail server and looking slightly different depending on its software.

Often the length is not arbitrary and the UUIDs are a good compromise for this. They have 128 bits, some of which are used to specify the type of UUID. One type combines a hostname and timestamp like the message ids. But since in some contexts the generation of the UUID should not reveal the hostname and the time, some implementations prefer random UUIDs. It can very well be argued that with good random numbers duplicates of such random UUIDs are less likely than events that would bother us much more than having a duplicate. For randomly generated UUIDs six bits are used up for expressing the version and the fact that it is a random UUID, leaving 122 bits, which is a total of 2^{122} = 5316911983139663491615228241121378304 \approx 5.317\cdot 10^{36} different possible values. Generating billions of UUIDs for many years leaves the risk of creating duplicates acceptably low.

But the issue of the quality of the random generator and the issue of potential duplicates remain something that needs attention. So it is worth to consider the path of using the host and timestamp. Now the host can not be identified by an IP address or fully qualified domain and host name, because these tend to be either too long or not unique enough. The MAC address used to be a good possibility. But I would not be so sure about this any more. Most server systems are virtualized these days and the MAC address is configured by software, so duplicates can occur accidentally or deliberately. Using a time stamp by itself can be a problem too because sooner or later it will happen that two IDs are generated at the same time, within the given granularity. Machines have several processors, run several processes and several threads within each process.

So achieving the goal of a real globally unique UUID value remains a difficult question. Following a more local uniqueness within an application or application landscape might be more reasonable. The number of servers may be large and may vary, but it should be possible to assign numbers to virtual or real servers. In case there are multiple different processes on the same (virtual) machine to assign numbers to these as well. This can be used as a replacement for the host part of the UUID. If it does not use up all the bits, these can be filled up with random numbers.

Timestamps can be obtained easily and relatively reliably for a granularity of msec (Milliseconds). The UUID timestamp allows up to a granularity of 100 nsec, which is 10’000 sub divisions of the msec. A thread safe counter that may reset during program start or with its overflow can be used to count and its positive remainder modulo 10’000 can be used instead of the 100 nsec part in conjunction with the msec.

Often a uniqueness within an application or application landscape can be achieved by using some kind of unique counter. The best choice is often the sequence of a database, which is good in this task and well tested. It is not too hard to create such a functionality. Handling of multiple processes and threads needs to be addressed. For persistence, it can be an improvement to reserve blocks of 100 or 1000 numbers and persist less often. This will result in skipping some numbers when restarting, but otherwise work out well. The same idea can also be applied for a distributed unique number generator, where each instances gets ranges from some master generator and gets new ranges, when they are used up.

Such unique numbers or identifiers are needed quite often. It is usually best to use something that works reliably, like the DB sequence. But it can be developed with adequate care, if there is a need. Testing and especially automated testing is off course very important, but only sufficient if the whole implementation is conceptionally sound and robust.

Share Button

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

*