How to get rid of these HTML-entities in Files

It has been written here that HTML-entities (these ä etc) should be avoided with the exception of those that we need due to the HTML-syntax like <, >, & and maybe " and  . They were already mostly obsolete more than 20 years ago, but in those days we still did not automatically use UTF-8 or UTF-16, but often an 8-bit character encoding that could express only up to 256 characters, in reality around 200 due to control characters. At least these 200 could be used. That was enough for web pages in those days and texts in German, French, Russian, Greek, Hebrew, Arabic and many other language could well be written, as long as only one language or a few similar languages were used. For the rare occasions that required some characters that were not in this character set, it was an option to rely on these HTML-entities. Or for typing HTML-pages on an US-keyboard without any good tool support.

But now Unicode has been around for more than 25 years and more than 90% of the web pages use UTF-8.

Now some people think that these HTML-entities are kind of necessary or at least „safer“ and I see people still writing HTML-code with them in these days. Or tools by relatively well known companies, that produced such output not so long ago… It is a good thing to have some courage and to change something like this to readable and natural format. Or more generally to try out if a simpler or better solution works. Reasonable courage is good for this, too much of something good can go bad, as so often…

So, please teach your collegues not to use these ugly HTML-entities, where UTF-8-characters are the better option.

And here is a perl script that converts the HTML-entities with the exceptions mentioned above to UTF-8. In the project conversion-utils some more such scripts might be added. The script is a bit too long to be pasted inline in a code block, so it is better to find the current version on github.

Then you can do something like this:

git commit
for file in *.html ; do
echo $file
mv $file ${file}~entities~
html2utf8 < ${file}~entities~ > $file
echo /$file
done
git diff

to convert all files in a directory. I assume that you are using Linux or at least have bash like for example in cygwin.
There are other tools to do the same thing, I am sure. Just use anything that works for you to get away from this unreadable crap.

Share Button

&auml; &ouml; … in HTML

In the old days of the web, more than 20 years ago, we found a possibility to write German Umlaut letters and a lot of other letters and symbols using pure ASCII. These are called „entities„, btw.

Many people, including myself, started writing web pages using these transcriptions, in the assumption that they were required. Actually in the early days of the web there were some rumors that some browsers actually did not understand the character encodings that contained these letters properly, which was kind of plausible, because the late 80s and the 90s were the transition period where people discovered that computers are useful outside of the United States and at least for non-IT-guys it was or it should have been a natural requirement that computers understand their language or at least can process, store and transmit texts using the proper characters of the language. In case of German language this was not so terrible, because there were transcriptions for the special characters (ä->ae, ö->oe, ü->ue, ß->ss) that were somewhat ugly, but widely understandable to native German speakers. Other languages like Russian, Greek, Arabic or East-Asian languages were in more trouble, because they consist of „special characters“ only.

Anyway, this „&auml;“-transcription for web pages, which is actually superior to the „ae“, because the reader of the web page will read the correct „ä“, was part of the HTML-standard to support writing characters not on the keyboard. This was a useful feature in those days, but today we can find web pages that help use with the transliteration or just look up the word with the special characters in the web in order to write it correctly. Then we can as well copy it into our HTML-code, including all the special characters.

There could be some argument about UTF-8 vs. UTF-16 vs. ISO-8859-x as possible encodings of the web page. But in the area of the web this was never really an issue, because the web pages have headers that should be present and inform the browser about the encoding. Now I recommend to use UTF-8 as the default, because that includes all the potential characters that we might want to use sporadically. And then the Umlaut kann just directly be part of the HTML content. I converted all my web pages to use Umlaut-letters properly, where needed, without using entities in the mid 90s.

Some entities are still useful:

  • „&lt;“ for „<“, because „<“ is used as part of the HTML-syntax
  • „&amp;“ for „&“ for the same reason
  • „&gt;“ for „>“ for consistency with „<„
  • „&nbsp;“ for no-break-space, to emphasize it, since most editors make it hard to tell the difference between regular space and no-break-space.
Share Button