PDF and PDF/A formats

The PDF format has experienced a success story on its way from being a quasi proprietary format that could only be dealt with using Adobe tools to a format that is specified and standardized and can be dealt with using open source tools and tools from different vendors. It has become accepted that PDF is primarily a print format and that for web content HTML is the better choice, which was not clear 15 years ago, when people coming from print layout who just considered themselves trivially capable of adding web to their portfolio just wanted to build whole web pages by just using PDF instead of HTML.

Now the format did change over time and there are always PDF files that use specific features that do not work in certain PDF viewers.

But there are requirements for maintaining documents over a long period of time. Just consider long term contracts that have a duration of 50-100 years. The associated documents usually need to be retained for that duration plus ten years. Alone the issue of storing data for such a long time and being able to read it physically is a challenge, but assuming that this issue is addressed and files can still be read in 110 years, the file format should be readable.

Now companies disappear. A lot of them in 100 years, probably even big ones like Adobe, Apple, Microsoft, Oracle and others. We do not know which companies will disappear, only that it is very likely that some companies that are big now will disappear. Proprietary software may make it to another vendor when shutting down the company, to pay the salaries of the former employees for some more days. But it might eventually disappear as well. Open source software has a better chance of being available in 100 years, but that cannot be absolutely guaranteed either, unless special attention is given to that software over such a long time. And if software is not maintained, it is highly unlikely that it will be able to run on the platforms that are common in 100 years.

So it is good to create a stable and simplified standard for long term archiving. Software for accessing that can be written from scratch based on that specification. And it is more likely to remain available, if continuous need for it can be seen.

The idea is a format called PDF/A, where A stands for „archive“, which is an option for storing PDF files over a very long period of time. Many cool features of PDF have been removed from PDF/A and make it more robust and easy to use. Important is also not to rely on additional data sources, for example for passwords of encrypted PDF files or for fonts. Encryption with password protection is a bad thing because it is quite likely that the password is gone in 100 years. Fonts need to be included, because finding them in 100 years might not be trivial. This usually means that proprietary fonts have to be avoided, unless the licensing allows inclusion of the fonts into the PDF file and unlimited reading. Including JavaScript, Video, Audio or Forms is also a bad idea. Video should be archived separately and it has the same issues as PDF for long term archiving.

Share Button

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert

*