Since about 20 years we have been kept busy with the change to Unicode.
The good thing: We all know that Unicode and usually UTF-8 as representation is the way we should express textual data. The web is mostly UTF-8 today. But it has been a painful path and it still is sometimes.
Why is it so difficult?
The most important problem is that it is hard to tell how the content of a file is to be interpreted. We do have some hacks that often allow recognizing this:
The suffix is helpful for common and well defined file types, for example .jpg or .png. In other cases the content of the file is analyzed and something like the following is found in the beginning of the file:
From this it can be deduced that the file should be executed with ruby, more precisely with the ruby implementation that is found under /usr/bin/ruby. If it should be the ruby that comes first in the path, something like
could be used instead. When using MS-Windows, this works as well, when using cygwin and cygwin’s Ruby, but not with native Win32-Ruby or Win64-Ruby.
The next thing is quite annoying. Which encoding is used for the file? It can be a useful agreement to assume UTF-8 or ISO-8859-1 or ISO-8859-15, but as soon as one team member forgets to configure the editor appropriately, a mess can be expected, because files appear that mix UTF-8 and ISO-8859-1 or other encodings, leading to obscure errors that are often hard to find and hard to fix.
Maybe it was a mistake when C and Unix and libc were defined and developed to understand files just as byte sequences without any meta information about the content. This more or less defined how current operating systems like Linux and MS-Windows deal with it.
They did not really have a choice, although they did both provide facilities that could theoretically be used to carry such information, but since almost no software uses them and there are no well known and established standards, this is not helpful today. In the internet mime headers have proved to be useful for email and web pages and some other content. This allows the recipient of the communication to know how to interpret the content.
It would have been good to have such meta-information also for files, allowing files to be renamed to anything with any suffix without loosing the readability. But in the seventies, when Unix and C and libc where initially created, such requirements were much less obvious and it was part of the beauty to have a very simple concept of an I/O-stream universally applicable to devices, files, keyboard input and some other ways of I/O. Also MS-Windows has probably been developed in C and has inherited this flaw. It has been tried to keep MS-Windows runnable on FAT-file-systems, which made it hard to benefit from the feature of NTFS of having multiple streams in a file, so the second stream could be used for the meta information. But as a matter of fact suffixes are still used and text files are analyzed for guessing the encoding and magic bytes in the beginning of binary files are used to assume a certain type.
Off course some text formats like XML have ways of writing the encoding within the content. That requires iterating through several assumptions in order to read up to that encoding information, which is not as bad as it sounds, because usually only a few encodings have to be tried in order to find that out. It is a little bit annoying to deal with this when reading XML from a network connection and not from a file, which requires some smart caching mechanism.
This is most dangerous with UTF-8 and ISO-8859-x (x=1,2,3,….), which are easy to mix up. The lower 128 characters are the same and the vast majority of the content consists of these characters for many languages. So it is easy to combine two files with different encodings and not recognizing that until the file is already somewhat in use and has undergone several conversion attempts to „fix“ the problem. Eventually this can lead to byte sequences that are not allowed in the encoding. Since spoken languages are usually quite redundant, it usually is possible to really fix such mistakes, but it can become quite expensive for large amounts of text. For UTF-16 this is easier because files have to start with FFFE or FEFF (two bytes in hex-notation), so it is relatively reliable to tell that a file is utf-16 with a certain endianness. There is even such a magic sequence of three bytes to mark utf-8, but it is not known by many people, not supported by the majority of the software and not at all commonly used. Often we have to explicitly remove these three marker bytes in order to avoid confusing software processing the file.
In the MS-Windows-world things are even more annoying because the whole system is working with modern encodings, but this black CMD-windows is still using CP-850 or CP-437, which contain almost the same characters as ISO-8859-1, but in different positions. So an „ä“ might be displayed as a sigma-character, for example. This incompatibility within the same system does have its disadvantages. In theory there should be ways to fix this by changing some settings in the registry, but actually almost nobody has done that and messing with the registry is not exactly risk-less.
- What every programmer absolutely, positively needs to know about encodings and character sets to work with text
- Wikipedia: Comparison of Unicode encodings
- Wikipedia: Unicode
- Wikipedia: UTF-8
- Wikipedia: UTF-16
- Wikipedia: UTF-32
- Wikipedia: GB 18030
- Wikipedia: ISO 8859-1
- Wikipedia: ISO 8859-15
- Wikipedia: BOCU
- Wikipedia: SCSU
- Some thoughts about String equality
- The little obstacles of interoperability
- Java Properties Files and UTF-8
- Using non-ASCII-characters
- UTF-16 Strings in Java
- MS-Windows-Encodings with CMD: Bug or Feature?
- Introduction to Unicode and UTF-8
- UTF-8 everywhere
- UTF-8 and Unicode FAQ for Unix/Linux