Using non-ASCII-characters

Some of us still remember the times when it was recommended to avoid „special characters“ when writing on the computer. Some keyboards did not contain „Umlaut“-characters in Germany and we fell back to the ugly, but generally understandable way of replacing the German special characters like this: ä->ae, ö->oe, ü->ue, ß->sz or ß->ss. This was due to the lack of decent keyboards, decent entry methods, but also due to transmission methods that stripped the upper bit. It did happen in emails that they where „enhanced“ like this: ä->d, ö->v, ü->|,… So we had to know our ways and sometimes use ae oe ue ss. Similar issues applied to other languages like the Scandinavian languages, Spanish, French, Italian, Hungarian, Croatian, Slovenian, Slovak, Serbian, the Baltic languages, Esperanto,… in short to all languages that could regularly be written with the Latin alphabet but required some additional letters to be written properly.

When we wrote on paper, the requirement to write properly was more obvious, while email and other electronic communication via the internet of those could be explained as being something like short wave radio. It worked globally, but with some degradation of quality compared to other media of the time. So for example with TeX it was possible to write the German special letters (and others in a similar way) like this: ä->\“a, ö->\“o, ü->\“u, ß->\ss and later even like this ä->“a, ö->“o, ü->“u, ß->“s, which some people, including myself, even used for email and other electronic communication when the proper way was not possible. I wrote Emacs-Lisp-Software that could put my Emacs in a mode where the Umlaut keys actually produced these combination when being typed and I even figured out how to tweak an xterm window for that for the sake of using IRC where Umlaut letters did not at all work and quick online typing was the way to go, where the Umlaut-characters where automatically typed because I used 10-finger system typing words, not characters.

On the other hand TeX could be configured to process Umlaut characters properly (more or less, up to the issue of hyphenation) and I wrote Macros to do this and provided them to CTAN, the repository for TeX-related software in the internet around 1994 or so. Later a better and more generic solution become part of standard TeX and superseded this, which was a good thing. So TeX guys could type ä ö ü ß and I strongly recommended (and still recommend) to actually do so. It works, it is more user friendly and in the end the computer should adapt to the humans not vice versa.

The web could process Umlaut characters (and everything) from day one. The transfer was not an issue, it could handle binary data like images, so no stripping of high bit or so was happening and the umlaut characters just went through. For people having problems to find them on the keyboard, transcriptions like this were created: ä->ä ö->ö ü->ü ß->ß. I used them not knowing that they where not actually needed, but I relied on a Perl script to do the conversion so it was possible to actually type them properly.

Now some languages like Russian, Chinese, Greek, Japanese, Georgian, Thai, Korean use a totally different alphabet, so they had to solve this earlier, but others might know better how it was done in the early days. Probably it helped develop the technology. Even harder are languages like Arabic, Hebrew, and Farsi, that are written right to left. It is still ugly when editing a Wikipedia page and the switching between left-to-right and right-to-left occurs correctly, but magically and unexpected.

While ISO-8859-x seemed to solve the issue for most European languages and ISO-8859-1 became the de-facto standard in many areas, this was eventually a dead end, because only Unicode provided a way for hosting all live languages, which is what we eventually wanted, because even in quite closed environments excluding some combinations of languages in the same document will at some point of time prove to be a mistake. This has its issues. The most painful is that files and badly transmitted content do not have a clear information about the encoding attached to them. The same applies to Strings in some programming languages. We need to know from the context what it is. And now UTF-8 is becoming the predominant encoding, but in many areas ISO-8859-x or the weird cp1252 still prevail and when we get the encoding or the implicit conversions wrong, the Umlaut characters or whatever we have gets messed up. I would recommend to work carefully and keep IT-systems, configurations and programs well maintained and documented and to move to UTF-8 whenever possible. Falling back to ae oe ue for content is sixties- and seventies-technology.

Now I still do see an issue with names that are both technical and human readable like file names and names of attachments or variable names and function names in programming languages. While these do very often allow Umlaut characters, I would still prefer the ugly transcriptions in this area, because a mismatch of encoding becomes more annoying and ugly to handle than if it is just concerning text, where we as human readers are a bit more tolerant than computers, so tolerant that we would even read the ugly replacements of the old days.

But for content, we should insist on writing it with our alphabet. And move away from the ISO-8869-x encodings to UTF-8.

Links (this Blog):

Share Button

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

*