Indexing of Database Tables I (Primary Keys)

Any useful databases with non trivial amounts of data have some indexing mechanism in place that helps finding data based on some key values, because a full table scane or something like that is too expensive and too slow to do it often.

The most typical situation is having a key and one or more values behind that. This applies to relational and NoSQL databases.
Typical database access layers for programming languages like ActiveRecord for Ruby or JPA/Hibernate for Java favor the pattern of having one index column called „id“ that is numeric and gets successive values assigned when new entries are created.

This is good, because the id is always unique and it can be handled and stored efficiently, especially when being used for relationships between different tables. It is off course a good idea to think about the size of integers to use for this, so that the number range is not exhausted. Depending on the DB product this can produced automatically by the DB when inserting if the column is declared as identity column or autoincrement or something like that (MySQL, MS-SQLServer) or it can be created by using a sequence (Oracle, PostgreSQL). It could be considered to use just one global sequence instead of one per table, so the id is unique across tables, which might have some advantages, but it kind of unusual.

Another common approach is to use UUIDs instead of such numeric ids. They have the advantage of being kind of globally unique. They have 128 bits and there are two approaches for obtaining these 128 bit numbers. Some bits are used to express version and type of UUID, but most of them can be used for differentiating from other UUIDs. They can be generated using random numbers or using some identity of the current host combined with time stamp, counter and possibly thread and process numbers and filling some digits with random numbers… When having around a billion (German „Milliarde“, 10^9) entries, the probability of having a collision with a single additional randomly generated UUID is less than 10^{-27}. So the risk of being struck by a lightning is higher than the risk of getting a collision by creating the same random UUID twice, if the random number generator is really good. It is quite possible to build big systems based on that and run them for a long time without problems by duplicate UUIDs. If that is good enough or not is a philosophical question, but I prefer the other way and finding a way to generate a UUID that is unique across hosts in the current and expected future system environment. Quite honestly, I do not see the benefit of using UUIDs as ID fields and primary keys in databases, because the numeric IDs can be handled more efficiently and mechanisms of generating them uniquely are reasonably easy to implement.

It is often forgotten that data actually might contain internal unique keys that can be used. Often a autoincrement numeric ID is used as a primary key and these semantic keys are declared as unique constraint or at least with a non unique index to allow searching. This has the advantage that changing these values is possible by accessing only one table and without breaking referential consistency. But for large tables I do recommend to consider using these natural keys instead of the ID as primary key for the sake of efficiency. Serious databases can also deal with combined keys consisting of several columns that are unique only in their combination.

An obvious stupidity that I am only mentioning because I have actually seen it, is the concatenation of all the fields forming the logical primary key to some column called „HANDLE“ and using that as a primary key. A 40 year old junior developer with the authority of a senior lead architect could impose such a thing on the database of a project that I have worked in in the past and he could degrade the whole software by this, because it just wasted half of the database space and a lot of performance for a product that should have performed really well to be useful to its customers.

I would also resist the temptation to use something like oid in PostgreSQL or the ROW_ID of Oracle as key. These values can change when moving data to other table spaces or restoring a backup or even upgrading to another version of the DBMS product. They might be good for temporary use, probably within a transaction or within a set of closely related transactions occurring within a few minutes or hours.

Additional indexes are an interesting issue, which I might deal with in another article in the future.

Share Button

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.


*