PDF and PDF/A formats

The PDF format has experienced a success story on its way from being a quasi proprietary format that could only be dealt with using Adobe tools to a format that is specified and standardized and can be dealt with using open source tools and tools from different vendors. It has become accepted that PDF is primarily a print format and that for web content HTML is the better choice, which was not clear 15 years ago, when people coming from print layout who just considered themselves trivially capable of adding web to their portfolio just wanted to build whole web pages by just using PDF instead of HTML.

Now the format did change over time and there are always PDF files that use specific features that do not work in certain PDF viewers.

But there are requirements for maintaining documents over a long period of time. Just consider long term contracts that have a duration of 50-100 years. The associated documents usually need to be retained for that duration plus ten years. Alone the issue of storing data for such a long time and being able to read it physically is a challenge, but assuming that this issue is addressed and files can still be read in 110 years, the file format should be readable.

Now companies disappear. A lot of them in 100 years, probably even big ones like Adobe, Apple, Microsoft, Oracle and others. We do not know which companies will disappear, only that it is very likely that some companies that are big now will disappear. Proprietary software may make it to another vendor when shutting down the company, to pay the salaries of the former employees for some more days. But it might eventually disappear as well. Open source software has a better chance of being available in 100 years, but that cannot be absolutely guaranteed either, unless special attention is given to that software over such a long time. And if software is not maintained, it is highly unlikely that it will be able to run on the platforms that are common in 100 years.

So it is good to create a stable and simplified standard for long term archiving. Software for accessing that can be written from scratch based on that specification. And it is more likely to remain available, if continuous need for it can be seen.

The idea is a format called PDF/A, where A stands for „archive“, which is an option for storing PDF files over a very long period of time. Many cool features of PDF have been removed from PDF/A and make it more robust and easy to use. Important is also not to rely on additional data sources, for example for passwords of encrypted PDF files or for fonts. Encryption with password protection is a bad thing because it is quite likely that the password is gone in 100 years. Fonts need to be included, because finding them in 100 years might not be trivial. This usually means that proprietary fonts have to be avoided, unless the licensing allows inclusion of the fonts into the PDF file and unlimited reading. Including JavaScript, Video, Audio or Forms is also a bad idea. Video should be archived separately and it has the same issues as PDF for long term archiving.

Share Button

Indexing of Database Tables I (Primary Keys)

Any useful databases with non trivial amounts of data have some indexing mechanism in place that helps finding data based on some key values, because a full table scane or something like that is too expensive and too slow to do it often.

The most typical situation is having a key and one or more values behind that. This applies to relational and NoSQL databases.
Typical database access layers for programming languages like ActiveRecord for Ruby or JPA/Hibernate for Java favor the pattern of having one index column called „id“ that is numeric and gets successive values assigned when new entries are created.

This is good, because the id is always unique and it can be handled and stored efficiently, especially when being used for relationships between different tables. It is off course a good idea to think about the size of integers to use for this, so that the number range is not exhausted. Depending on the DB product this can produced automatically by the DB when inserting if the column is declared as identity column or autoincrement or something like that (MySQL, MS-SQLServer) or it can be created by using a sequence (Oracle, PostgreSQL). It could be considered to use just one global sequence instead of one per table, so the id is unique across tables, which might have some advantages, but it kind of unusual.

Another common approach is to use UUIDs instead of such numeric ids. They have the advantage of being kind of globally unique. They have 128 bits and there are two approaches for obtaining these 128 bit numbers. Some bits are used to express version and type of UUID, but most of them can be used for differentiating from other UUIDs. They can be generated using random numbers or using some identity of the current host combined with time stamp, counter and possibly thread and process numbers and filling some digits with random numbers… When having around a billion (German „Milliarde“, 10^9) entries, the probability of having a collision with a single additional randomly generated UUID is less than 10^{-27}. So the risk of being struck by a lightning is higher than the risk of getting a collision by creating the same random UUID twice, if the random number generator is really good. It is quite possible to build big systems based on that and run them for a long time without problems by duplicate UUIDs. If that is good enough or not is a philosophical question, but I prefer the other way and finding a way to generate a UUID that is unique across hosts in the current and expected future system environment. Quite honestly, I do not see the benefit of using UUIDs as ID fields and primary keys in databases, because the numeric IDs can be handled more efficiently and mechanisms of generating them uniquely are reasonably easy to implement.

It is often forgotten that data actually might contain internal unique keys that can be used. Often a autoincrement numeric ID is used as a primary key and these semantic keys are declared as unique constraint or at least with a non unique index to allow searching. This has the advantage that changing these values is possible by accessing only one table and without breaking referential consistency. But for large tables I do recommend to consider using these natural keys instead of the ID as primary key for the sake of efficiency. Serious databases can also deal with combined keys consisting of several columns that are unique only in their combination.

An obvious stupidity that I am only mentioning because I have actually seen it, is the concatenation of all the fields forming the logical primary key to some column called „HANDLE“ and using that as a primary key. A 40 year old junior developer with the authority of a senior lead architect could impose such a thing on the database of a project that I have worked in in the past and he could degrade the whole software by this, because it just wasted half of the database space and a lot of performance for a product that should have performed really well to be useful to its customers.

I would also resist the temptation to use something like oid in PostgreSQL or the ROW_ID of Oracle as key. These values can change when moving data to other table spaces or restoring a backup or even upgrading to another version of the DBMS product. They might be good for temporary use, probably within a transaction or within a set of closely related transactions occurring within a few minutes or hours.

Additional indexes are an interesting issue, which I might deal with in another article in the future.

Share Button