Unicode and C

It is a common practice in C to use arrays of char as strings. The 0 is used as end marker.

The whole thing was created like that in the 1970s and at that time it was kind of cool to get away with one less language feature and to express it in terms of others instead. And people did not think enough about the necessity to express more than ISO 646 IRV (commonly called ASCII) as string content.

This extended out of the box to 8 bit character sets like ISO 8859-1 or KOI-8, that are identical to ISO 646 in the lower 128 characters and contain an extension in the upper 128 characters. But fortunately we have moved ahead and now Unicode with its encodings UTF-8, UTF-16 and UTF-32.

How can we deal with this in C?

UTF-8 just works out of the box, because the byte 0 is only used to encode the code point U+0000. So the null termination can be kept as it is and a lot of functionality remains valid. Some issues arise, because in UTF-8 things like finding the logical length of a string, not its memory consumption or finding the nth code point, not the nth byte, require UTF-8-logic to be applied and to parse the whole string at least from the beginning to the desired position or the usage of an indexing facility. So a lot of non-trivial string functionality of the standard library will just not be as easy as people thought it would be in the 1970s and subsequently not work as needed. Libraries for better UTF-8-support in C can be found, leaving the „native“ C strings with UTF-8 content only for usage in interfaces that require them. I have not yet explored such libraries, but it would be interesting to find out how powerful and useful they are.

At the time when Unicode came out, it seemed to be sufficient to have 16 bits per character instead of 8. Java was built on this assumption. C added a wchar_t to allow for this and just required it to be „long enough“. So Linux uses 32 bits and MS-Windows 16 bits. This is not too bad, because programming in C for MS-Windows and for Linux is anyway quite different, unless we abstract the differences into a library, which would then also include a common string definition and string handling functionality. While the Linux wchar_t is sufficient, it really wastes a lot of memory, which is often undesirable, if we go the extra effort to program in C in order to gain performance. The Windows-wchar_t is „kind of sufficient“, as are the Java-Strings, because we can really do a lot with assuming that Unicode is only 16 bit or with UTF-16 and ignoring the complexities of that, that are in principal the same as for UTF-8, but can be ignored with less disadvantages most of the time. The good news is, that wchar_t is well supported by standard library functions.

Another way is to use char16_t and char32_t, that have a clear definition of their length, but much less library support.

Probably these facilities are sufficient for software whose string handling is relatively trivial. For more ambitious string handling in terms of functionality and performance, it will be necessary to find third party libraries or to write them.

Links

Share Button

Tips and Tricks: Typing Unicode

I found this on netzwolf.info:

You can enter arbitrary Unicode characters (more precisely code points) in X11 on Linux, if you know their Hex-Code:

  • Press Shift-Ctrl (keep them pressed)
  • Press also the letter u
  • Release the Ctrl-Key
  • Release the u-key
  • Keep the Shift-key pressed
  • Enter the Hex-Code of the Code point with the number of hex digits needed
  • Release the shift key

Let’s try with the Cyrillic Alphabet, more precisely with its Unicode Block:

  • Ѐ U+0400
  • Ё U+0401
  • Ђ U+0402
  • Љ U+0409
  • А U+0410
  • Б U+0411
  • В U+0412
  • Г U+0413
  • Д U+0414

Not very fast, but it works quite well for a few characters. Just open the table in one window and use this key sequence in another one.

For frequently used characters it is a good idea to remap the keyboard with xmodmap.

Share Button

Tip: how to make Thunderbird use ISO Date Format

After an upgrade, my Thnderbird started to use a weird US-date format and forgot my setting for this. I did not find any setting either. But this seems to work fine:

Create a shell script

#!/bin/bash

export LC_TIME=sv_SE.utf8
export LC_DATE=sv_SE.utf8

exec /usr/bin/thunderbird

and make sure that this gets called when thunderbird is started instead of directly /usr/bin/thunderbird.

Share Button

Logging

Deutsch

Software often contains a logging functionality. Usually entries one or sometimes multiple lines are appended to a file, written to syslog or to stdout, from where they are redirected into a file. They are telling us something about what the software is doing. Usually we can ignore all of it, but as soon as something with „ERROR“ or worse and more visible stack traces can be found, we should investigate this. Unfortunately software is often not so good, which can be due to libraries, frameworks or our own code. Then stack traces and errors are so common that it is hard to look into or to find the ones that are really worth looking into. Or there is simply no complete process in place to watch the log files. Sometimes the error shows up much later than it actually occurred and stack traces do not really lead us to the right spot. More often than we think logging actually introduces runtime errors, that were otherwise not present. This is related to a more general concept, which is called observer effect, where logging actually changes the business logic.

It is nice that log files keep to some format. Usually they start with a time stamp in ISO-format, often to the millisecond. Please add trailing zeros to always have 3 digits after the decimal point in this case. It is preferable to use UTC, but people tend to stick to local date and time zones, including the issues that come with switching to and from daylight saving time. Usually we have several processes or threads that run simultaneously. This can result in a wild mix of logging entries. As long as even multiline entries stay together and as long as beginning and end of one multiline entry can easily be recognized, this can be dealt with. Tools like splunk or simple Perl, Ruby or Python scripts can help us to follow threads separately. We could actually have separate logs for each thread in the first place, but this is not a common practice and it might hit OS-limitations on the number of open files, if we have many threads or even thousands of actors as in Erlang or Akka. Keeping log entries together can be achieved by using an atomic write, like the write system call in Linux and other Posix systems. Another way is to queue the log entries and to have a logger thread that processes the queue.

Overall this area has become very complex and hard to tame. In the Java world there used to be log4j with a configuration file that was a simple properties file, at least in the earlier version. This was so good that other languages copied it and created some log4X. Later the config file was replaced by XML and more logging frame works were added. Of course quite a lot of them just for the purpose of abstracting from the large zoo of logging frameworks and providing a unique interface for all of them. So the result was, that there was one more to deal with.

It is a good question, how much logic for handling of log files do we really want to see in our software. Does the software have to know, into which file it should log or how to do log rotation? If a configuration determines this, but the configuration is compiled into the jar file, it does have to know… We can keep our code a bit cleaner by relying on program functionality without code, but this still keeps it as part of the software.

Log files have to please the system administrator or whoever replaced them in a pure devops shop. And in the end developers will have to be able to work with the information provided by the logs to find issues in the code or to explain what is happening, if the system administrator cannot resolve an issue by himself. Should this system administrator have to deal with a different special complex setup for the logging for each software he is running? Or should it be necessary to call for developer support to get a new version of the software with just another log setting, because the configurations are hard coded in the deployment artifacts? Interesting is also, what happens when we use PAAS, where we have application server, database etc., but the software can easily move to another server, which might result in losing the logs. Moving logs to another server or logging across the network is expensive, maybe more expensive than the rest of this infrastructure.

Is it maybe a good idea to just log to stdout, maintaining a decent format and to run the software in such a way that stdout is piped into a log manager? This can be the same for all software and there is one way to configure it. The same means not only the same for all the java programs, but actually the same for all programs in all languages that comply to a minimal standard. This could be achieved using named pipes in conjunction with any hard coded log file that the software wants to use. But this is a dangerous path unless we really know what the software is doing with its log files. Just think of what weird errors might happen if the software tries to apply log rotation to the named pipe by renaming, deleting, creating new files and so on. A common trick to stop software from logging into a place where we do not want this is to create a directory with the name of the file that the software usually uses and to write protect this directory and its parent directory for the software. Please find out how to do it in detail, depending on your environment.

What about software, that is a filter by itself, so its main functionality is to actually write useful data to stdout? Usually smaller programs and scripts work like this. Often they do not need to log and often they are well tested relyable parts of our software installation. Where are the log files of cp, ls, rm, mv, grep, sort, cat, less,…? Yes, they do tend to write to stderr, if real errors occur. Where needed, programs can turn on logging with a log file provided on the command line, which is also a quite operations friendly approach. Named pipes can help here.

And we had a good logging framework in place for many years. It was called syslog and it is still around, at least on Linux.

A last thought: We spend really a lot of effort to get well performing software, using multiple processes, threads or even clusters. And then we forget about the fact that logging might become the bottle neck.

Share Button

LF and CR LF in git and svn

Typically we observe that line endings of text file turn out to be „linefeed“ (LF or „\n“ or 0x0a or Ctrl-J) on Linux and Unix-like systems (including MacOSX), while they are „carriage return and linefeed“ (CR LF or „\r\n“ or 0x0d 0x0a or Ctrl-M Ctrl-J) on MS-Windows. See the little obstacles of interoperability.

This can become annoying and there is little reason for this. Most tools for editing, compiling and working with these files just understand both variants. It does become annoying when diffs are created between different files and even more so when scripts are turning out to have the „CR LF“ ending and the script interpreter given in the first line is not found, because the system tries to find one that has the otherwise invisible „CR“ in its file name. It also becomes kind of messy, when multiple CR-characters are present. CR-characters are annoying even on MS-Windows itself as soon as we use cygwin. Since in most cases the target system is a Linux system anyway and we just waste space with the unnecessary CR-character, it is actually in most cases a good idea to agree on not having these CR-characters at all in certain kinds of files.

The easiest way is to just set up git or subversion to change files from CR LF to LF on commit. So the repository only contains „clean“ files, at least from that moment onward.

This is accomplished in subversion by applying the following command on the files that we want to be kept on „LF“ only:

svn pset svn:eol LF

and then committing.

Git has a way to achieve this using the .gitattributes configuration file. So the .gitattributes file can contain something like this:

* text=auto eol=lf

Remark

Of course I recommend to use git instead of svn for new projects and to consider migrating existing projects from svn to git. But for this special aspect svn provides a slightly more powerful and more intuitive tool than git.

Links

Share Button

The little obstacles of interoperability

Deutsch

A lot of things in today’s IT landscape have been unified and interoperability is much better than 20 years ago.

Some examples:

  • Networking: Today networking is TCP/IP. Even the physical cables with RJ45/Ethernet and the wireless networks have been standardized and all kinds of devices can use the same networks. In the old days there were tons of mutually incompatible proprietary network technologies like BitNet (IBM), NetBios (MS), DecNet (DEC), IPX (Novell),….
  • Character Sets: Today we have Unicode and a few standard encodings. At least for Web and Emails we have ways to provide meta information about the actual encoding. This area is not totally free of problems, but on a very good way. In earlier days we had to deal with different EBCDIC-encodings or with character sets that only fully supported English langauge and a few languages using the same alphabet without additional letters like „ä“, „ö“, „ü“, „ø“, „ñ“,… So there we have encountered a great progress.
  • Numbers: For floating point numbers and integers a relatively small set of standardized numerical types has been established, that are used more or less everywhere and behave almost in the same way. The issue of integer overflow remains problematic, though.
  • Software: In the old days software was written for a specific machine, one CPU architecture and one OS. Today we have common platforms on different kind of hardware: Linux runs on almost any physical and virtual hardware from mobile phones and routers to super computers with practically the same kernel. Java, Ruby, Perl, Scala and some other programming languages are available on a range of platforms and provide some kind of abstract platform. And the web is often a good way to develop applications once for a large range of devices.
  • File system: At least we now have a common understanding what a file system looks like. There are some OS specific specialties, but the general idea is still the same. It is possible to share file systems between different operating systems by using technologies like Samba.
  • GNU-Tools: The GNU-Tools (bash, ls, cp, mv,……..) have become the standard on Linux boxes and they are way superior to the traditional Unix tools of the same role and name, that we can still find in Solaris, for example. You can (and should) install them on any Unix and even via cygwin on MS-Windows.

Interoperability is today for many of us interoperability between Linux (and possibly some other rare Posix/Unix-like systems) and Win32/Win64 (MS-Windows).

Experienced Linux users are used to having the forward slash („/“) as file path separator. The backslash is used for escaping special characters. In the MS-Windows-world we often see the backslash („\“) as file path separator. That is enforced in the CMD-window, because it does not pass through the forward slash. My experience with low level Win32/Win64 libraries shows that they understand both variants equally well. Anyway do Ruby, Perl, Java, Cygwin and others support the forward slash. So there is hardly ever any need to check the OS and use backslash or forward slash depending on that, other than for cmd/bat-scripts, but who wants them for more then five lines? I strongly recommend just using the forward slash also under MS-Windows when writing programs in Java, Perl, Scala, Ruby, … It makes life also easier, because the backslash often has to be doubled and it is sometimes hard to keep track of how many backslashes need to be written and to read it.

The line ending is a bit tricky. Linux and Unix use just a „Linefeed“ („\n“=Ctrl-J). For MS-DOS and MS-Windows the combination „Carriage-Return+Linefeed“ („\r\n“=Ctrl-M Ctrl-J) has become the default. Most of today’s programs do not care and understand both variants more or less equally well. Only Notepad does get confused with Linefeed-only files, but notepad is a bad choice anyway. Better Editors (gvim, emacs, ultraedit, scite, …) exist. On the other hand we get problems with the MS-Windows-line-termination in the case of executable scripts. Usually the contain a first line like „#/usr/bin/ruby“. The OS uses that as hint on how to execute them, in this case by calling /usr/bin/ruby. If the line ends with Ctrl-M Ctrl-J, then the OS tries to find a program „/usr/bin/ruby^M“ (^M = Ctrl-M = „\r“), which of course does not exist and we just get an obscure error message.

It is easy to do the conversion:

$ perl -i~ -p -e ’s/\r//g;‘ script

Or the other direction:

$ perl -i~ -p -e ’s/\n/\r\n/g;‘ textfile

For those who use subversion there are ways to enforce a certain way of line endings. Even git supports this.

Share Button

Encryption of Disks

Today we should use encryption of disks for many situations.

I recommend at least encrypting disks of portable computers that contain the home directory and portable USB disks. They can easily get stolen or lost and it is better if the thief does not have easy access to the content. We should even consider encrypting swap partitions.

There are many ways to do this on different operating systems and actually I only know how to do it for Linux. A possible approach for Windows is to run MS-Windows in a virtual box inside Linux and just profit from the Linux-based encryption. That is what I do, but I do not use my MS-Windows very much. About Apple computers I have no knowledge, please go to the site of somebody else for encryption of disks for them. I know that there is an option available for this, but I do not know how to use it and how good it is.

I prefer to rely on open source solutions for security related issues, because it is harder (but not impossible) to put in malicious components into the software and it is easier to find and to fix them. This is a general point that serious security specialists tend to make that it is better to rely on good and well maintained open source software for security than on closed software of which we do not know the wanted and unwanted backdoors and vulnerabilities.

The way it works in Linux is that we encrypt a disk partition. It can then be accessed after providing a password. It is possible to provide several alternative passwords. The tools to use are dm-crypt, a kernel module, LUKS, cryptsetup, and cryptmount.

It can be done like this (example session) for an external drive that appears as /dev/sdc. Please be extremely careful not to erase any data that you still need or hide data behind a password that you do not know…


# check the partitions
$ fdisk -l /dev/sdc

# encrypt the partition and provide a password:
$ cryptsetup luksFormat /dev/sdc1

# access the partition
$ cryptsetup luksOpen /dev/sdc1 encrypted-external-drive-1

# format it with whatever file system you want to use
$ mkfs.ext4 /dev/mapper/sdc1 encrypted-external-drive-1
# or
$ mkfs.btrfs /dev/mapper/sdc1 encrypted-external-drive-1
# or whatever you prefer..

Now each time the disk is mounted, the password needs to be provided.

The issue which file system is best might be worth writing about in the future, it is not in this article.

Links

Share Button

Microsoft stops Windows Mobile

It seems that Microsoft is ending the development of Windows Mobile. After having tried with some effort to get into the mobile operating system business, it seems that the market share is now less than 1%, with Android being >85% and iOS close to 15% according to IDC. Because the market is so large, it would be possible to run a profitable business even with a low market share, but this is probably hard for a big company and it seems more attractive to concentrate on other areas. It is good to have some good mobile apps available for the platform and that is an area where Android and iOS shine, while doing a third app for Windows phone is a bit unusual. A funny detail is that a retired guy, who was once the founder of Microsoft and who is still associated with that company by some people uses an Android phone for himself. But he is retired, so that is no longer too important.

The story is a bit weird, though… In 2010 Stephen Elop became the boss of Nokia. At this time Nokia had a market share of around 50% in mobile phones covering a wide range from tiny „non-smart“ phone to high end smart phones. They were mostly using Symbian as OS, but the transition to Maemo and MeeGo, like Android Linux variants, was on a good way and it would have been worth seeing where this might go. At this time it was already quite clear that MS Windows phone/Windows mobile was a failure. All efforts concerning Linux-based systems were stopped and Symbian was announced as being a dead end and the strategy was to move to MS-Windows phone only. Most likely this was done because Stephen Elop had more loyalty to Microsoft than to the company that he was running. To my knowledge this never became a case for the courts, but one might assume some criminal energy behind this. And some stupidity of the stock holders, who selected this person as CEO. Some time later the mobile phone branch of Nokia went down and was bought for very little money by Microsoft. After that acquisition it was further downsized and will probably go to zero soon, because Microsoft does not have interest to develop new hardware.

As it seems, there are mobile phones with the brand „Nokia“ again. HMD, a company in Finland designs them, pays to the company Nokia some money to use their brand. And of course they use Android.

It seems that the issue of having to write a mobile app twice because of Android and iOS has become a bit less pressing. Firstly, currently with Android and iOS, there are two dominating platforms that kind of allow neglecting all the others. And besides mobile web applications that can pretty much behave as rich clients through modern JavaScript frameworks, there are „fake apps“ that are actually pretty much browsers without a visible URL bar and programmed to only surf their home site. And even native apps are now increasingly being developed in Swift for iOS and Kotlin for Android, which seem to be quite similar, at least at a conceptional level. There are very interesting mobile operating systems like Sailfish OS, that are worth looking into, but for a market share of about 85% supporting Android only is sufficient and for a market share of 99% Android and iOS together are sufficient in many cases.

Share Button

Shell Scripts

Shell scripts can be useful for writing small stuff like combining a few commands to pipes or doing a bit of „back ticking“. Even simple loops and if-conditions are possible. And if we want, it is almost a full programming language. A bit hard to tame, maybe, but quite a lot of stuff is possible. Those who like to know more about it may look into startup scripts of typical java software. Often a .bat and a .sh file are provided, where the right jvm is found, the classpath and the execution path and maybe some other environment are put together. In the end the .sh-file is quite a long and unreadable horror story and the .bat file is even much worse, because the cmd-language is just a lot more primitive and less capable and requires even worse hacks.

There are ways to make shell scripts more readable, which by themselves are truly admirable, but I think that route is wrong. We can learn all the Shell functionalities and understand bit by bit even more complex shell scripts, but I think for non trivial shell scripts it is time to switch to real programming languages instead. Scripting languages, of course, for example Perl, Ruby, Python or Lua. We may still execute „shell commands“, that are actually programs in /bin, /usr/bin or /usr/local/bin where they are powerful and more concise than writing purely in that programming language. But a magic for putting together a classpath is much cleaner in the Perl programming language than in pure bash (or worse cmd/bat).

This is of course another example of the Golden Hammer anti pattern. We should balance our tool box. Not add specific tools for making any minor task a bit easier on the expense of supporting one more tool, but keep a broad range of tools that in conjunction are very powerful. For example I would retire awk and sed and use either Perl or Ruby instead. We only have to keep them around because a lot of system tools that are just there still rely on them, but for a team I would deprecate awk and sed for new scripts or even for enhancing existing scripts. Bash would be ok only for small scripts, you can invent a line number or a maximum complexity, but for very short scripts I think bash is a legitimate tool.

Switch to Perl, Perl6, Ruby or… when you encounter any of the following:

  • The scripts is getting kind of long (>= 100 lines)
  • You find yourself modularizing it with functions
  • You find yourself using non trivial perl, ruby, sed or awk within the script, for example regex-stuff
  • The script need interaction
  • The scripts needs arrays, numbers or other types
  • More than one or two trivial if-statements or loop-statements are needed
  • Database access is done by the script (SQL or NoSQL)
  • String encoding becomes relevant
  • Quoting levels become an issue

This post was inspired by a similar post on the Isoblog by Kris. And the Shell Style Guide of Google is quite good especially in limiting the area where shell scripts are acceptable.

Share Button

Bring your own Device

This issue is quite controversial and it applies to laptops, tablets and smart phones.
Usually the „bringing“ is not really an issue, you can have anything in your bags and connect it via the mobile phone network as long as it does not absorb the working time.
But usually this implies a bit more.
There are some advantages in having company emails and calendar on a smart phone. This is convenient and useful. But there are some security concerns that should be taken serious. How is the calendar and the emails accessed? How confidential are the emails? Do they pass through servers that we do not trust? What happens, if a phone gets lost?
This is an area, where security concerns are often not taken too serious, because it is cool for top manager to have such devices. And they can just override any worries and concerns, if they like.
This can be compensated by being more restrictive in other areas. 😉
Anyway, the questions should be answered. In addition, the personal preferences for a certain type of phone are very strong. So the phone provided by the company might not be the one that the employee prefers, so there is a big desire to use the own phone or one that is similar to the own phone, which depends on the question of who pays the bills, how much of private telephony is allowed on the company phone and if there are work related calls to abusive times.
Generally the desirable path is to accept this and to find ways to make this secure.

The other issue is about the computer we work with. For some kind of jobs it is clear that the computer of the company is used, for example when selling railroad tickets or working in the post office or in a bank serving customers.

It shows that more creative people and more IT-oriented people like to have more control on the computer they work with.
We like to have hardware that is powerful enough to do the job. We like to be able to install software that helps us do our job. We like to use the OS and the software that we are skilled with. Sometimes it is already useful to be able to install this on the company computer or in a virtual computer within the company computer. Does the company allow this? It should, with some reasonable guidelines.

Some companies allow their employees to use their own laptops instead. They might give some money to pay for this and expect a certain level of equipment for that. Or just allow the employees to buy a laptop with their own money and use it instead of the company computer. They will do so and happily spend the money, even though it is wrong and the company should pay it. But the pain of spending some of the own money is for many people less than the pain of having to use crappy company equipment.

This rises the question of the network drive Q:, the outlook, MS-Word, MS-Excel,…
Actually this is not so much an issue, at least for the group we are talking here. Or becoming less of an issue.
Drive Q: can quite well be accessed from Linux, if the company policies allow it. But actually modern working patterns do not need this any more.
We can use a Wiki, like MediaWiki or Confluence for documentation. This is actually a bit better in many cases and I would see a trend in this direction, at least for IT-oriented teams.
Office-Formats and Email are more and more providing Web-Applications that can be used to work with them on Linux, for example. And MS-Office is already available for Linux, at least for Android, which is a Linux Variant. It might or might not come for Desktop Linux. LibreOffice is most of the time a useful replacement. Maybe better, maybe almost as good, depending on perspective… And there is always the possibility to have a virtual computer running MS-Windows for the absolutely mandatory MS-Windows-programs, if they actually exist. Such an image could be provided and maintained by the company instead of a company computer.

It is better to let the people work. To allow them to use useful tools. To pay them for bringing their own laptop or to allow them to install what they want on the company laptop. I have seen people who quit their job because of issues like this. The whole expensive MS-Windows-oriented universe that has been built in companies for a lot of money proves to be obsolete in some areas. A Wiki, a source code repository, … these things can be accessed over the internet using ssh or https. They can be hosted by third parties, if we trust the third party. Or they can be hosted by the company itself. Some companies work with distributed teams…

It is of course important to figure out a good security policy that allows working with „own“ devices and still provide a sufficient level of security. Maybe we just have to get used to other ways of working and to learn how to solve the problems that they bring us. In the end of the day we will see which companies are more successful. It depends on many factors, but the ability to provide a innovative and powerful IT and to have good people working there and actually getting stuff done is often an important factor.

Share Button