Phone Numbers and E-Mail Addresses

Most data that we deal with are strings or numbers or booleans and combinations of these into classes and collections. Dates can be expressed as string or number, but have enough specific logic to be seen as a fourth group of data. All these have interesting aspects, some of which have been discussed in this blog already.

Now phone numbers are by an naïve approach numbers or strings, but very soon we see that they have their own specific aspects. The same applies for email addresses which can be represented as strings.

Often projects go by their own „simplified“ specification of what an email address or a phone number is, how to parse, compare and render them. In the end of the day the simplification is harder to tame than the real solution, because it needs to be maintained and specified by the project team rather than being based on a proven library. And once in a while „edge cases“ occur, that cannot be ignored and that make the „home grown“ library even more complex.

Behind phone numbers and email addresses there are well defined and established standards and they are hard to understand thoroughly within the constrained time budget of a typical „business project“, because the time should be allocated to enhancing the business logic and not to reinventing the basics. Unless there is a real need to do so, of course.

Just to give an idea: When phone numbers are parsed or provided by user input, they can start with a „+“ sign or use some country specific logic to express, to which country they belong. And then the „+1“, for example, does not stand for the United States alone, but also for Canada and some smaller countries that are in some way associated with the United States or Canada. Further analysis of the number is required to know about that. The prefix for international number is often „00“, but in the United States it is „011“ and there were and are some other variants, that are still frequently used. Some people like to write something like „+49(0)431 77 88 99 11 1“ instead of „+49 431 77 88 99 11 1“. We can constrain the input to the variants we happen to think of and force the supplier of data to comply, but why bother? Why not accept legitimate formats, as long as they are correct and unambiguous?

Now for E-Mail-addresses there is the famous one page regular expression to recognize correct email addresses which is even by itself not totally complete. Find it at the bottom of the article…

Of course it includes some rarely used variants of email addresses that were once used and have not been completely abolished officially, but it is hard to draw and exact border for this.

So the general recommendation is to find a good library for working with email addresses and phone numbers. Maybe the library can even to some extent eliminate input strings that are formally complying the format, but know to be incorrect by knowing about numbering schemes world wide or about email domains or even by performing lookups.

Another strong recommendation is to store data like email addresses and phone numbers in a technical format, that is in the example of phone numbers always starting with a „+“ followed by digits only. For input any positioning of spaces is accepted, for output the library knows how to format it correctly. This allows selecting by the numbers without dealing with complex formatting, by just using the technical format in the query as well.

For Java (and thus for many JVM-languages), C++ and JavaScript there is an excellent library from Google for dealing with phone numbers. For E-Mails something like apache commons email validator is a way to go.

Keep in mind that for E-Mail addresses and phone numbers, the ultimate way of verification is to send them a link or a code that they need to enter. In the end of the day it is insufficient to rely only on formal verification without this final step.

But still issues remain for transforming data into a canonical technical format for storing them, formatting data for display etc. And there is a huge added value, if we can reliably recognize formally false entries early, when the user can still easily react to it, rather than waiting for an email/SMS/phone call being processed, which may fail when the user is no longer on our „registration site“. And we can process data which has already been verified by a third party, but still we want to parse it to recognize obvious errors.

The concrete libraries may be outdated by the time you are reading this, or they may not be applicable for the language environment that you are using, but please make an effort to find something similar.

So, please use good libraries, that are like to be found for the environment that you are using and write yourself what creates value for your project or organization. Unless your goal is really to write a better library. Better invest the time into areas where there are still no good libraries around.

And as always, you may understand email addresses and phone numbers as an example for a more general idea.

Links

E-Mail Regex

Source: https://emailregex.com/:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\ r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n) ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t] )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])* )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?: \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031 ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(? :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(? :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<> @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(? :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,; :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\" .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0 00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@, ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])* (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[ ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*( ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:( ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(? :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n) ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n) ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)? (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?: \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t]) *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\ .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:( ?:\r\n)?[ \t])*))*)?;\s*)

Share Button

Ranges of Dates and Times

In Software we often deal with ranges of dates and times.

Let us look at it from the perspective of an end user.

When we say something like „from 2020-03-07 to 2019-03-10“ we mean the set of all timestamps t such that

    \[\text{2019-03-07} \le d < \text{2019-03-11}\]

or more accurately:

    \[\text{2019-03-07T00:00:00}+TZ \le d < \text{2019-03-11T00:00:00}+TZ\]

Important is, that we mean to include the whole 24 hour day of 2019-03-10. Btw. please try to get used to the ISO-date even when writing normal human readable texts, it just makes sense…

Now when we are not talking about dates, but about times or instants of time, the interpretation is different.
When we say sonmething like „from 07:00 to 10:00“ or „from 2020-03-10T07:00:00+TZ to 2020-04-11T09:00:00+TZ“, we actually mean the set of all timestamps t such that

    \[givenDate\text{T07:00:00}+TZ \le t < givenDate\text{T10:00:00}+TZ\]

or

    \[\text{2020-03-10T07:00:00}+TZ \le t < \text{2020-04-11T09:00:00}+TZ,\]

respectively. It is important that we have to add one in case of date only (accuracy to one day) and we do not in case of finer grained date/time information. The question if the upper bound is included or not is not so important in our everyday life, but it proves that commonly the most useful way is not to include the upper bound. If you prefer to have all options, it is a better idea to employ an interval library, i.e. to find one or to write one. But for most cases it is enough to exclude the upper limit. This guarantees disjoint adjacent intervals which is usually what we want. I have seen people write code that adds 23:59:59.999 to a date and compares with \le instead of <, but this is an ugly hack that needs a lot of boiler plate code and a lot of time to understand. Use the exclusive upper limit, because we have it.

Now the requirement is to add one day to the upper limit to get from the human readable form of date-only ranges to something computers can work with. It is a good thing to agree on where this transformation is made. And to do it in such a way that it even behaves correctly on those dates where daylight saving starts or ends, because adding one day might actually mean „23 hours“ or „25 hours“. If we need to be really very accurate, sometimes switch seconds need to be added.

Just another issue has come up here. Local time is much harder than UTC. We need to work with local time on all kinds of user interfaces for humans, with very few exceptions like for pilots, who actually work with UTC. But local date and time is ambiguous for one hour every year and at least a bit special to handle for these two days where daylight saving starts and ends. Convert dates to UTC and work with that internally. And convert them to local date on all kinds of user interfaces, where it makes sense, including documents that are printed or provided as PDFs, for example. When we work with dates without time, we need to add one day to the upper limit and then round it to the nearest some-date\text{T00:00:00}+TZ for our timezone TZ or know when to add 23, 24 or 25 hours, respectively, which we do not want to know, but we need to use modern time libraries like the java.time.XXX stuff in Java, for example.

Working with date and time is hard. It is important to avoid making it harder than it needs to be. Here some recommendations:

  • Try to use UTC for the internal use of the software as much as possible
  • Use local date or time or date and time in all kinds of user interfaces (with few exceptions)
  • add one day to the upper limit and round it to the nearest midnight of local time exactly once in the stack
  • exclude the upper limit in date ranges
  • Use ISO-date formats even in the user interfaces, if possible

Links

Share Button

How to get rid of these HTML-entities in Files

It has been written here that HTML-entities (these &auml; etc) should be avoided with the exception of those that we need due to the HTML-syntax like &lt;, &gt;, &amp; and maybe &quot; and &nbsp;. They were already mostly obsolete more than 20 years ago, but in those days we still did not automatically use UTF-8 or UTF-16, but often an 8-bit character encoding that could express only up to 256 characters, in reality around 200 due to control characters. At least these 200 could be used. That was enough for web pages in those days and texts in German, French, Russian, Greek, Hebrew, Arabic and many other language could well be written, as long as only one language or a few similar languages were used. For the rare occasions that required some characters that were not in this character set, it was an option to rely on these HTML-entities. Or for typing HTML-pages on an US-keyboard without any good tool support.

But now Unicode has been around for more than 25 years and more than 90% of the web pages use UTF-8.

Now some people think that these HTML-entities are kind of necessary or at least „safer“ and I see people still writing HTML-code with them in these days. Or tools by relatively well known companies, that produced such output not so long ago… It is a good thing to have some courage and to change something like this to readable and natural format. Or more generally to try out if a simpler or better solution works. Reasonable courage is good for this, too much of something good can go bad, as so often…

So, please teach your collegues not to use these ugly HTML-entities, where UTF-8-characters are the better option.

And here is a perl script that converts the HTML-entities with the exceptions mentioned above to UTF-8. In the project conversion-utils some more such scripts might be added. The script is a bit too long to be pasted inline in a code block, so it is better to find the current version on github.

Then you can do something like this:

git commit
for file in *.html ; do
echo $file
mv $file ${file}~entities~
html2utf8 < ${file}~entities~ > $file
echo /$file
done
git diff

to convert all files in a directory. I assume that you are using Linux or at least have bash like for example in cygwin.
There are other tools to do the same thing, I am sure. Just use anything that works for you to get away from this unreadable crap.

Share Button

UUIDs revisited

UUIDs have proven useful in many circumstances.
We have basically two main variants:

  • The UUID is calculated as a combination of the Ethernet-MAC-address, the timestamp and a counter.
  • The UUID is calculated using a good random number generator

While variant 1 provides for a good uniqueness, there are some issues with it. Today we use mostly virtualized servers, which means that the MAC-address is coming from a configuration file and no longer guaranteed to be world wide unique. And we give away some information with the UUID, that we do not necessarily want to give away.

Variant 2 can be proven to have an acceptably low risk of collisions, but this is only true when using really good random number generators, which cannot always be guaranteed. Also it introduces an uncertainty in an area where we do not need it. We need to worry about this uniqueness, at least a little but, which is unnecessary.

So the question is, if we can rethink variant 1.

Assuming, that our software runs on our server farm. There may be a few hundred or thousand or even millions of virtual or physical servers. Now the organization does have a way to uniquely identify their servers. Of course we only need to consider the servers that are relevant for the application. Maybe an ID for the service instance instead of the server is even better. We may assume a numerical ID or something or have a table to map IP addresses and the like to such an ID. Some thinking is still required on how to do this. We can fill the digits that we do not need with random numbers.

Putting this ID instead of the MAC address solves the issue of configurable MAC address.

The next problem, that timestamps can be abused to find out something that should not be found out could be resolved by running the timestamp part or even the ID part (including a random number) through a symmetric encryption or simply some bijective function that is kept as a secret.

In many circumstances there is nothing wrong with customizing the UUID-generation to some „local“ standard, if this is well understood and carefully implemented.

Share Button

Unicode and C

It is a common practice in C to use arrays of char as strings. The 0 is used as end marker.

The whole thing was created like that in the 1970s and at that time it was kind of cool to get away with one less language feature and to express it in terms of others instead. And people did not think enough about the necessity to express more than ISO 646 IRV (commonly called ASCII) as string content.

This extended out of the box to 8 bit character sets like ISO 8859-1 or KOI-8, that are identical to ISO 646 in the lower 128 characters and contain an extension in the upper 128 characters. But fortunately we have moved ahead and now Unicode with its encodings UTF-8, UTF-16 and UTF-32.

How can we deal with this in C?

UTF-8 just works out of the box, because the byte 0 is only used to encode the code point U+0000. So the null termination can be kept as it is and a lot of functionality remains valid. Some issues arise, because in UTF-8 things like finding the logical length of a string, not its memory consumption or finding the nth code point, not the nth byte, require UTF-8-logic to be applied and to parse the whole string at least from the beginning to the desired position or the usage of an indexing facility. So a lot of non-trivial string functionality of the standard library will just not be as easy as people thought it would be in the 1970s and subsequently not work as needed. Libraries for better UTF-8-support in C can be found, leaving the „native“ C strings with UTF-8 content only for usage in interfaces that require them. I have not yet explored such libraries, but it would be interesting to find out how powerful and useful they are.

At the time when Unicode came out, it seemed to be sufficient to have 16 bits per character instead of 8. Java was built on this assumption. C added a wchar_t to allow for this and just required it to be „long enough“. So Linux uses 32 bits and MS-Windows 16 bits. This is not too bad, because programming in C for MS-Windows and for Linux is anyway quite different, unless we abstract the differences into a library, which would then also include a common string definition and string handling functionality. While the Linux wchar_t is sufficient, it really wastes a lot of memory, which is often undesirable, if we go the extra effort to program in C in order to gain performance. The Windows-wchar_t is „kind of sufficient“, as are the Java-Strings, because we can really do a lot with assuming that Unicode is only 16 bit or with UTF-16 and ignoring the complexities of that, that are in principal the same as for UTF-8, but can be ignored with less disadvantages most of the time. The good news is, that wchar_t is well supported by standard library functions.

Another way is to use char16_t and char32_t, that have a clear definition of their length, but much less library support.

Probably these facilities are sufficient for software whose string handling is relatively trivial. For more ambitious string handling in terms of functionality and performance, it will be necessary to find third party libraries or to write them.

Links

Share Button

Tip: how to make Thunderbird use ISO Date Format

After an upgrade, my Thnderbird started to use a weird US-date format and forgot my setting for this. I did not find any setting either. But this seems to work fine:

Create a shell script

#!/bin/bash

export LC_TIME=sv_SE.utf8
export LC_DATE=sv_SE.utf8

exec /usr/bin/thunderbird

and make sure that this gets called when thunderbird is started instead of directly /usr/bin/thunderbird.

Share Button

Some thoughts about String equality

Of course Strings are today in some way Unicode. In this article we assume code points as the building blocks of Strings. That means for example in the Java-world, that we are talking about one code point being comprised of one Java character for typical European languages, using Latin, Greek or Cyrillic alphabets including extensions to support all languages typically using these alphabets, for example. But when moving to Asian languages, a code point can also consist of two Java characters and there are Strings that are illegal from Unicode perspective, because they contain characters that should be combined in a way that cannot be combined properly. So here we assume, that Strings consist of sequences of bytes or two-byte characters or whatever encoding that properly express a sequence of code points. There are many interesting issues when dealing with some Asian languages that we will not cover here today.

Now there are a lot of possibilities to create Strings, that look the same, but are actually different. We are not talking about „0“ and „O“ or „1“ and „l“ and „I“ that might look similar in some fonts, but should not look similar, because we actually depend on their distinctness, even on their visual distinctness. Unfortunately we have the bad habit of using traditional typewriter fonts, that make it hard to distinguish these, for source code, where it would be so crucial. But for today, we just assume that we always look hard enough to solve this issue.

The classical example of what looks the same is whitespace. We have ordinary space “ “ and no break space “ „, that are meant to look exactly the same, but to expose a slightly different behavior. There are tons of possibilities to create exactly the same look with different combinations of whitespace. But this is kind of a special case, because in terms of semantics often carries little information and we want to disregard it to some extent when comparing strings. Typical examples are stripping of leading and trailing whitespace of the string or of the lines contained within it and replacing tabulators with the number of spaces that would be equivalent. Or even to replace any amount of adjacent whitespace within a line by a single space. Again, handling of different whitespace code points might require different rules, so it is good to be careful in not putting to much logic and it is better to rely on a library to at least apply exactly the same rules in equivalent situations.

Another example that we actually might know is that certain characters look the same or almost the same in the Cyrillic, Greek and Latin alphabets. I try to give an idea of the meaning of the Greek and Cyrillic characters, but they depend on the language, the dialect and even the word, the word form or the actual occurrence of the letter in the word…

LatinCyrillicGreekmeaning of Cyrillic Lettermeaning of Greek letter
AАAlike Latinlike Latin
BВBlike Latin VBeta (like V in new Greek)
CСlike Latin S
EЕElike LatinEpsilon (like Latin E)
ГHlike Latin GGamma (like Latin G)
HНΗlike Latin NEta (like Latin I in new Greek)
JЈSerbian Ј, like German J
KКΚlike LatinKappa (like Latin K)
MМΜlike LatinMu (like Latin M)
NΝNu (like Latin N)
OОΟlike LatinOmikron (like Latin O)
PРΡlike Latin RRho (like Latin R)
ПΠlike Latin PPi (like Latin P)
TТΤlike LatinTau (like Latin T)
ФΦlike Latin FPhi (like Latin F)
XХΧlike German CHChi (like German CH)
YУΥlike Latin UUpsilon (like Latin U)
ZΖZeta (like German Z)
IІΙUkrainian IIota (like Latin I)

In this case we usually want the characters to look the same or at least very similar, because that is how to correctly display them, but we do want them to be different when comparing strings.

While these examples are kind of obvious, there is another one that we tend to ignore, but that will eventually catch us. There are so called combining characters, that should actually be named „combining code points“, but here we go. That means that we can put them after a letter and they will combine to form a letter with diacritical marks. A typical example is the letter „U“ that can be combined with two dots “ ̈ ̈“ to form an „Ü“, which looks the same as the „Ü“ that is composed of one code point. It is meant to look the same, but it also has the same meaning, at least for most purposes. What we see is the Glyph. We see the difference when we prefix each code point with a minus or a space: „Ü“ -> „-U-̈“ or “ U ̈“, while the second one is transformed like this: „Ü“ -> „-Ü“ or “ Ü“, as we would expect.

While the way to express the Glyph in such a way with two code points is not very well known and thus not very common, we actually see it already today when we look at Wikipedia articles. In some languages, where the pronunciations is ambiguous, it can be made clear by putting an accent mark on one vowel, as for example Кириллица, which puts an accent mark on the term in the beginning of the article like this: „Кири́ллица“. Since in Cyrillic Alphabet accent marks are unfortunately not used in normal writing, it comes in handy that the combining accent also works with cyrillic letter. When putting minus-signs between the code points it looks like this: „К-и-р-и-́-л-л-и-ц-а“ or with spaces like this: „К и р и ́ л л и ц а“. So Strings that we encounter in our programs will contain these combining characters in the future. While we can prohibit them, it is better to embrace this and it is actually not too hard, if we use decent libraries. Java has the Normalizer class in its built in library, that can convert to one or the other convention of expressing such glyphs and then allowing comparison in the way that we actually mean.

Unfortunately issues like semantic lengths of strings or semantic positions become even harder than they already are after moving from characters to code points. And we can be sure that Unicode has still more to offer to complicate things, if we dig deeper. The typical answer that we get on most web sites that talk about these issues is something like: „The length of strings and positions within strings are surprisingly irrelevant to most programs.“

In the end of the day, jobs that have been trivial in the past are now becoming a big deal and we need to learn to think of comparison, length, position, regular expressions, sorting and all kinds of string functionality with bytes, characters, code points and glyphs in mind.

What can our current libraries already do for us, what are we missing in them, considering different programming languages, databases, text files and network transmission?

Links

Share Button

The little obstacles of interoperability

Deutsch

A lot of things in today’s IT landscape have been unified and interoperability is much better than 20 years ago.

Some examples:

  • Networking: Today networking is TCP/IP. Even the physical cables with RJ45/Ethernet and the wireless networks have been standardized and all kinds of devices can use the same networks. In the old days there were tons of mutually incompatible proprietary network technologies like BitNet (IBM), NetBios (MS), DecNet (DEC), IPX (Novell),….
  • Character Sets: Today we have Unicode and a few standard encodings. At least for Web and Emails we have ways to provide meta information about the actual encoding. This area is not totally free of problems, but on a very good way. In earlier days we had to deal with different EBCDIC-encodings or with character sets that only fully supported English langauge and a few languages using the same alphabet without additional letters like „ä“, „ö“, „ü“, „ø“, „ñ“,… So there we have encountered a great progress.
  • Numbers: For floating point numbers and integers a relatively small set of standardized numerical types has been established, that are used more or less everywhere and behave almost in the same way. The issue of integer overflow remains problematic, though.
  • Software: In the old days software was written for a specific machine, one CPU architecture and one OS. Today we have common platforms on different kind of hardware: Linux runs on almost any physical and virtual hardware from mobile phones and routers to super computers with practically the same kernel. Java, Ruby, Perl, Scala and some other programming languages are available on a range of platforms and provide some kind of abstract platform. And the web is often a good way to develop applications once for a large range of devices.
  • File system: At least we now have a common understanding what a file system looks like. There are some OS specific specialties, but the general idea is still the same. It is possible to share file systems between different operating systems by using technologies like Samba.
  • GNU-Tools: The GNU-Tools (bash, ls, cp, mv,……..) have become the standard on Linux boxes and they are way superior to the traditional Unix tools of the same role and name, that we can still find in Solaris, for example. You can (and should) install them on any Unix and even via cygwin on MS-Windows.

Interoperability is today for many of us interoperability between Linux (and possibly some other rare Posix/Unix-like systems) and Win32/Win64 (MS-Windows).

Experienced Linux users are used to having the forward slash („/“) as file path separator. The backslash is used for escaping special characters. In the MS-Windows-world we often see the backslash („\“) as file path separator. That is enforced in the CMD-window, because it does not pass through the forward slash. My experience with low level Win32/Win64 libraries shows that they understand both variants equally well. Anyway do Ruby, Perl, Java, Cygwin and others support the forward slash. So there is hardly ever any need to check the OS and use backslash or forward slash depending on that, other than for cmd/bat-scripts, but who wants them for more then five lines? I strongly recommend just using the forward slash also under MS-Windows when writing programs in Java, Perl, Scala, Ruby, … It makes life also easier, because the backslash often has to be doubled and it is sometimes hard to keep track of how many backslashes need to be written and to read it.

The line ending is a bit tricky. Linux and Unix use just a „Linefeed“ („\n“=Ctrl-J). For MS-DOS and MS-Windows the combination „Carriage-Return+Linefeed“ („\r\n“=Ctrl-M Ctrl-J) has become the default. Most of today’s programs do not care and understand both variants more or less equally well. Only Notepad does get confused with Linefeed-only files, but notepad is a bad choice anyway. Better Editors (gvim, emacs, ultraedit, scite, …) exist. On the other hand we get problems with the MS-Windows-line-termination in the case of executable scripts. Usually the contain a first line like „#/usr/bin/ruby“. The OS uses that as hint on how to execute them, in this case by calling /usr/bin/ruby. If the line ends with Ctrl-M Ctrl-J, then the OS tries to find a program „/usr/bin/ruby^M“ (^M = Ctrl-M = „\r“), which of course does not exist and we just get an obscure error message.

It is easy to do the conversion:

$ perl -i~ -p -e ’s/\r//g;‘ script

Or the other direction:

$ perl -i~ -p -e ’s/\n/\r\n/g;‘ textfile

For those who use subversion there are ways to enforce a certain way of line endings. Even git supports this.

Share Button

Daylight Saving

In many countries of Europe we have to readjust our watches and clocks today, unless they do it automatically.

It is interesting, that dealing with this has always been a great challenge for software engineers and a very high two digit percentage of software that in some way or other deals with time, does not handle it correctly. Operating systems and standard software are doing a very good job on this, but specialized software that is written today very often does not properly handle the switch. It usually does not matter, because it just results in effects like having rare phone calls charged for an hour too much, to give an example. And really critical software is properly tested for this, I hope. But software developers who are able to deal with this properly are less much than 100%, and even those who are at least able to accept that they do not know and prepared to listen to somebody who knows it, are less than 100%. And daylight saving is really a very very minor invisible side issue for most software projects. They have a task to perform and usually they do that task well enough… So we will continue to develop software that is not really properly handling daylight saving.

One more reason to stop this changing of clocks twice a year, especially since the saving of energy, that was once mentioned as advantage, does not seem to be significant.

Share Button

Java Properties Files and UTF-8

Java uses a nice pragmatic file format for simple configuration tasks and for internationalization of applications. It is called Java properties file or simply „.properties file“. It contains simple key value pairs. For most configuration task this is useful and easy to read and edit. Nested configurations can be expressed by simple using dots („.“) as part of the key. This was introduced already in Java 1.0. For internationalization there is a simple way to create properties files with almost the same name, but a language code just before the .properties-suffix. The concept is called „resource bundle“. Whenever a language specific string is needed, the program just knows a unique key and performs a lookup.

The unpleasant part of this is that these files are in the style of the 1990es encoded in ISO-8859-1, which is only covering a few languages in western, central and northern Europe. For other languages as a workaround an \u followed by the 4 digit hex code can be used to express UTF-16 encoding, but this is not in any way readable or easy to edit. Usually we want to use UTF-8 or in some cases real UTF-16, without this \u-hack.

A way to deal with this is using the native2ascii-converter, that can convert UTF-8 or UTF-16 to the format of properties files. By using some .uproperties-files, which are UTF-8 and converting them to .properties-files using native2ascee as part of the build process this can be addressed. It is still a hack, but properly done it should not hurt too much, apart from the work it takes to get this working. I would strongly recommend to make sure the converted and unconverted files never get mixed up. This is extremely important, because this is not easily detected in case of UTF-8 with typical central European content, but it creates ugly errors that we are used to see like „sch�ner Zeichensalat“ instead of „schöner Zeichensalat“. But we only discover it, when the files are already quite messed up, because at least in German the umlaut characters are only a small fraction of the text, but still annoying if messed up. So I would recommend another suffix to make this clear.

The bad thing is that most JVM-languages have been kind of „lazy“ (which is a good thing, usually) and have used some of Java’s infrastructures for this, thus inherited the problem from Java.

Another way to deal with this is to use XML-files, which are actually by default in UTF-8 and which can be configured to be UTF-16. With some work on development or search of existing implementations there should be ways to do the internationalization this way.

Typically some process needs to be added, because translators are often non-IT-people who use some tool that displays the texts in the original languages and accepts the translation. For good translations, the translator should actually use the software to see the context, but this is another topic for the future. Possibly there needs to be some conversion from the data provided by the translator into XML, uproperties, .properties or whatever is used. These should be automated by scripts or even by the build process and merge new translations properly with existing ones.

Anyway, Java 9 Java 9 will be helpful in this issue. Finally Java-9-properties that are used as resource bundles for internationalization can be UTF-8.

Links

Share Button