Java Properties Files and UTF-8

Java uses a nice pragmatic file format for simple configuration tasks and for internationalization of applications. It is called Java properties file or simply „.properties file“. It contains simple key value pairs. For most configuration task this is useful and easy to read and edit. Nested configurations can be expressed by simple using dots („.“) as part of the key. This was introduced already in Java 1.0. For internationalization there is a simple way to create properties files with almost the same name, but a language code just before the .properties-suffix. The concept is called „resource bundle“. Whenever a language specific string is needed, the program just knows a unique key and performs a lookup.

The unpleasant part of this is that these files are in the style of the 1990es encoded in ISO-8859-1, which is only covering a few languages in western, central and northern Europe. For other languages as a workaround an \u followed by the 4 digit hex code can be used to express UTF-16 encoding, but this is not in any way readable or easy to edit. Usually we want to use UTF-8 or in some cases real UTF-16, without this \u-hack.

A way to deal with this is using the native2ascii-converter, that can convert UTF-8 or UTF-16 to the format of properties files. By using some .uproperties-files, which are UTF-8 and converting them to .properties-files using native2ascee as part of the build process this can be addressed. It is still a hack, but properly done it should not hurt too much, apart from the work it takes to get this working. I would strongly recommend to make sure the converted and unconverted files never get mixed up. This is extremely important, because this is not easily detected in case of UTF-8 with typical central European content, but it creates ugly errors that we are used to see like „sch�ner Zeichensalat“ instead of „schöner Zeichensalat“. But we only discover it, when the files are already quite messed up, because at least in German the umlaut characters are only a small fraction of the text, but still annoying if messed up. So I would recommend another suffix to make this clear.

The bad thing is that most JVM-languages have been kind of „lazy“ (which is a good thing, usually) and have used some of Java’s infrastructures for this, thus inherited the problem from Java.

Another way to deal with this is to use XML-files, which are actually by default in UTF-8 and which can be configured to be UTF-16. With some work on development or search of existing implementations there should be ways to do the internationalization this way.

Typically some process needs to be added, because translators are often non-IT-people who use some tool that displays the texts in the original languages and accepts the translation. For good translations, the translator should actually use the software to see the context, but this is another topic for the future. Possibly there needs to be some conversion from the data provided by the translator into XML, uproperties, .properties or whatever is used. These should be automated by scripts or even by the build process and merge new translations properly with existing ones.

Anyway, Java 9 Java 9 will be helpful in this issue. Finally Java-9-properties that are used as resource bundles for internationalization can be UTF-8.

Beteilige dich an der Unterhaltung

6 Kommentare

Simon Martnelli sagt:

2018-02-03 um 16:49:39 Uhr

Don’t forget about GNU gettext if it comes to translation:
https://www.gnu.org/software/gettext/

There you have a workflow and tools that are very helpful
http://gted.org/#Examples

Antworten
bk1 sagt:

2018-02-05 um 14:00:07 Uhr

Thank you for your comment. Yes, gettext had originally been written for C and in C, but it has actually been ported to Scala and Java and many other languages. It is promising and interesting to explore how it can serve as an alternative to the resource bundle way using properties files, especially in projects that use several programming languages and several spoken languages. I will include the links into the links section of the article.

Antworten
Pingback: Some thoughts about String equality | Karl Brodowsky's IT-Blog
Pingback: Unicode, UTF-8, UTF-16, ISO-8859-1: Why is it so difficult? | Karl Brodowsky's IT-Blog
Pingback: Unicode, UTF-8, UTF-16, ISO-8859-1: Warum ist das so schwierig? | Karl Brodowsky's IT-Blog
Pingback: Checked Exceptions in Java | Karl Brodowsky's IT-Blog

Links

Beteilige dich an der Unterhaltung

Schreibe einen Kommentar

Antworten abbrechen