How to get rid of these HTML-entities in Files

It has been written here that HTML-entities (these ä etc) should be avoided with the exception of those that we need due to the HTML-syntax like <, >, & and maybe " and  . They were already mostly obsolete more than 20 years ago, but in those days we still did not automatically use UTF-8 or UTF-16, but often an 8-bit character encoding that could express only up to 256 characters, in reality around 200 due to control characters. At least these 200 could be used. That was enough for web pages in those days and texts in German, French, Russian, Greek, Hebrew, Arabic and many other language could well be written, as long as only one language or a few similar languages were used. For the rare occasions that required some characters that were not in this character set, it was an option to rely on these HTML-entities. Or for typing HTML-pages on an US-keyboard without any good tool support.

But now Unicode has been around for more than 25 years and more than 90% of the web pages use UTF-8.

Now some people think that these HTML-entities are kind of necessary or at least „safer“ and I see people still writing HTML-code with them in these days. Or tools by relatively well known companies, that produced such output not so long ago… It is a good thing to have some courage and to change something like this to readable and natural format. Or more generally to try out if a simpler or better solution works. Reasonable courage is good for this, too much of something good can go bad, as so often…

So, please teach your collegues not to use these ugly HTML-entities, where UTF-8-characters are the better option.

And here is a perl script that converts the HTML-entities with the exceptions mentioned above to UTF-8. In the project conversion-utils some more such scripts might be added. The script is a bit too long to be pasted inline in a code block, so it is better to find the current version on github.

Then you can do something like this:

git commit
for file in *.html ; do
echo $file
mv $file ${file}~entities~
html2utf8 < ${file}~entities~ > $file
echo /$file
done
git diff

to convert all files in a directory. I assume that you are using Linux or at least have bash like for example in cygwin.
There are other tools to do the same thing, I am sure. Just use anything that works for you to get away from this unreadable crap.

Share Button

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

*