Of course Strings are today in some way Unicode. In this article we assume code points as the building blocks of Strings. That means for example in the Java-world, that we are talking about one code point being comprised of one Java character for typical European languages, using Latin, Greek or Cyrillic alphabets including extensions to support all languages typically using these alphabets, for example. But when moving to Asian languages, a code point can also consist of two Java characters and there are Strings that are illegal from Unicode perspective, because they contain characters that should be combined in a way that cannot be combined properly. So here we assume, that Strings consist of sequences of bytes or two-byte characters or whatever encoding that properly express a sequence of code points. There are many interesting issues when dealing with some Asian languages that we will not cover here today.
Now there are a lot of possibilities to create Strings, that look the same, but are actually different. We are not talking about „0“ and „O“ or „1“ and „l“ and „I“ that might look similar in some fonts, but should not look similar, because we actually depend on their distinctness, even on their visual distinctness. Unfortunately we have the bad habit of using traditional typewriter fonts, that make it hard to distinguish these, for source code, where it would be so crucial. But for today, we just assume that we always look hard enough to solve this issue.
The classical example of what looks the same is whitespace. We have ordinary space “ “ and no break space “ „, that are meant to look exactly the same, but to expose a slightly different behavior. There are tons of possibilities to create exactly the same look with different combinations of whitespace. But this is kind of a special case, because in terms of semantics often carries little information and we want to disregard it to some extent when comparing strings. Typical examples are stripping of leading and trailing whitespace of the string or of the lines contained within it and replacing tabulators with the number of spaces that would be equivalent. Or even to replace any amount of adjacent whitespace within a line by a single space. Again, handling of different whitespace code points might require different rules, so it is good to be careful in not putting to much logic and it is better to rely on a library to at least apply exactly the same rules in equivalent situations.
Another example that we actually might know is that certain characters look the same or almost the same in the Cyrillic, Greek and Latin alphabets. I try to give an idea of the meaning of the Greek and Cyrillic characters, but they depend on the language, the dialect and even the word, the word form or the actual occurrence of the letter in the word…
Latin | Cyrillic | Greek | meaning of Cyrillic Letter | meaning of Greek letter |
---|---|---|---|---|
A | А | A | like Latin | like Latin |
B | В | B | like Latin V | Beta (like V in new Greek) |
C | С | like Latin S | ||
E | Е | E | like Latin | Epsilon (like Latin E) |
Г | H | like Latin G | Gamma (like Latin G) | |
H | Н | Η | like Latin N | Eta (like Latin I in new Greek) |
J | Ј | Serbian Ј, like German J | ||
K | К | Κ | like Latin | Kappa (like Latin K) |
M | М | Μ | like Latin | Mu (like Latin M) |
N | Ν | Nu (like Latin N) | ||
O | О | Ο | like Latin | Omikron (like Latin O) |
P | Р | Ρ | like Latin R | Rho (like Latin R) |
П | Π | like Latin P | Pi (like Latin P) | |
T | Т | Τ | like Latin | Tau (like Latin T) |
Ф | Φ | like Latin F | Phi (like Latin F) | |
X | Х | Χ | like German CH | Chi (like German CH) |
Y | У | Υ | like Latin U | Upsilon (like Latin U) |
Z | Ζ | Zeta (like German Z) | ||
I | І | Ι | Ukrainian I | Iota (like Latin I) |
In this case we usually want the characters to look the same or at least very similar, because that is how to correctly display them, but we do want them to be different when comparing strings.
While these examples are kind of obvious, there is another one that we tend to ignore, but that will eventually catch us. There are so called combining characters, that should actually be named „combining code points“, but here we go. That means that we can put them after a letter and they will combine to form a letter with diacritical marks. A typical example is the letter „U“ that can be combined with two dots “ ̈ ̈“ to form an „Ü“, which looks the same as the „Ü“ that is composed of one code point. It is meant to look the same, but it also has the same meaning, at least for most purposes. What we see is the Glyph. We see the difference when we prefix each code point with a minus or a space: „Ü“ -> „-U-̈“ or “ U ̈“, while the second one is transformed like this: „Ü“ -> „-Ü“ or “ Ü“, as we would expect.
While the way to express the Glyph in such a way with two code points is not very well known and thus not very common, we actually see it already today when we look at Wikipedia articles. In some languages, where the pronunciations is ambiguous, it can be made clear by putting an accent mark on one vowel, as for example Кириллица, which puts an accent mark on the term in the beginning of the article like this: „Кири́ллица“. Since in Cyrillic Alphabet accent marks are unfortunately not used in normal writing, it comes in handy that the combining accent also works with cyrillic letter. When putting minus-signs between the code points it looks like this: „К-и-р-и-́-л-л-и-ц-а“ or with spaces like this: „К и р и ́ л л и ц а“. So Strings that we encounter in our programs will contain these combining characters in the future. While we can prohibit them, it is better to embrace this and it is actually not too hard, if we use decent libraries. Java has the Normalizer class in its built in library, that can convert to one or the other convention of expressing such glyphs and then allowing comparison in the way that we actually mean.
Unfortunately issues like semantic lengths of strings or semantic positions become even harder than they already are after moving from characters to code points. And we can be sure that Unicode has still more to offer to complicate things, if we dig deeper. The typical answer that we get on most web sites that talk about these issues is something like: „The length of strings and positions within strings are surprisingly irrelevant to most programs.“
In the end of the day, jobs that have been trivial in the past are now becoming a big deal and we need to learn to think of comparison, length, position, regular expressions, sorting and all kinds of string functionality with bytes, characters, code points and glyphs in mind.
What can our current libraries already do for us, what are we missing in them, considering different programming languages, databases, text files and network transmission?
Links
- Unicode, UTF-8, UTF-16, ISO-8859-1: Why is it so difficult?
- GNU-Emacs und Unicode
- The little obstacles of interoperability
- Java Properties Files and UTF-8
- Using non-ASCII-characters
- UTF-16 Strings in Java
- Will Java, C, C++ and C# be the new Cobols?
- Unicode
- code points
- whitespace
- combining characters
- Glyph
- Greek Alphabet
- Cyrillic Alphabet
- Latin Alphabet
Schreibe einen Kommentar