LF and CR LF in git and svn

Typically we observe that line endings of text file turn out to be „linefeed“ (LF or „\n“ or 0x0a or Ctrl-J) on Linux and Unix-like systems (including MacOSX), while they are „carriage return and linefeed“ (CR LF or „\r\n“ or 0x0d 0x0a or Ctrl-M Ctrl-J) on MS-Windows. See the little obstacles of interoperability.

This can become annoying and there is little reason for this. Most tools for editing, compiling and working with these files just understand both variants. It does become annoying when diffs are created between different files and even more so when scripts are turning out to have the „CR LF“ ending and the script interpreter given in the first line is not found, because the system tries to find one that has the otherwise invisible „CR“ in its file name. It also becomes kind of messy, when multiple CR-characters are present. CR-characters are annoying even on MS-Windows itself as soon as we use cygwin. Since in most cases the target system is a Linux system anyway and we just waste space with the unnecessary CR-character, it is actually in most cases a good idea to agree on not having these CR-characters at all in certain kinds of files.

The easiest way is to just set up git or subversion to change files from CR LF to LF on commit. So the repository only contains „clean“ files, at least from that moment onward.

This is accomplished in subversion by applying the following command on the files that we want to be kept on „LF“ only:

svn pset svn:eol LF

and then committing.

Git has a way to achieve this using the .gitattributes configuration file. So the .gitattributes file can contain something like this:

* text=auto eol=lf

Remark

Of course I recommend to use git instead of svn for new projects and to consider migrating existing projects from svn to git. But for this special aspect svn provides a slightly more powerful and more intuitive tool than git.

Links

Share Button

The little obstacles of interoperability

Deutsch

A lot of things in today’s IT landscape have been unified and interoperability is much better than 20 years ago.

Some examples:

  • Networking: Today networking is TCP/IP. Even the physical cables with RJ45/Ethernet and the wireless networks have been standardized and all kinds of devices can use the same networks. In the old days there were tons of mutually incompatible proprietary network technologies like BitNet (IBM), NetBios (MS), DecNet (DEC), IPX (Novell),….
  • Character Sets: Today we have Unicode and a few standard encodings. At least for Web and Emails we have ways to provide meta information about the actual encoding. This area is not totally free of problems, but on a very good way. In earlier days we had to deal with different EBCDIC-encodings or with character sets that only fully supported English langauge and a few languages using the same alphabet without additional letters like „ä“, „ö“, „ü“, „ø“, „ñ“,… So there we have encountered a great progress.
  • Numbers: For floating point numbers and integers a relatively small set of standardized numerical types has been established, that are used more or less everywhere and behave almost in the same way. The issue of integer overflow remains problematic, though.
  • Software: In the old days software was written for a specific machine, one CPU architecture and one OS. Today we have common platforms on different kind of hardware: Linux runs on almost any physical and virtual hardware from mobile phones and routers to super computers with practically the same kernel. Java, Ruby, Perl, Scala and some other programming languages are available on a range of platforms and provide some kind of abstract platform. And the web is often a good way to develop applications once for a large range of devices.
  • File system: At least we now have a common understanding what a file system looks like. There are some OS specific specialties, but the general idea is still the same. It is possible to share file systems between different operating systems by using technologies like Samba.
  • GNU-Tools: The GNU-Tools (bash, ls, cp, mv,……..) have become the standard on Linux boxes and they are way superior to the traditional Unix tools of the same role and name, that we can still find in Solaris, for example. You can (and should) install them on any Unix and even via cygwin on MS-Windows.

Interoperability is today for many of us interoperability between Linux (and possibly some other rare Posix/Unix-like systems) and Win32/Win64 (MS-Windows).

Experienced Linux users are used to having the forward slash („/“) as file path separator. The backslash is used for escaping special characters. In the MS-Windows-world we often see the backslash („\“) as file path separator. That is enforced in the CMD-window, because it does not pass through the forward slash. My experience with low level Win32/Win64 libraries shows that they understand both variants equally well. Anyway do Ruby, Perl, Java, Cygwin and others support the forward slash. So there is hardly ever any need to check the OS and use backslash or forward slash depending on that, other than for cmd/bat-scripts, but who wants them for more then five lines? I strongly recommend just using the forward slash also under MS-Windows when writing programs in Java, Perl, Scala, Ruby, … It makes life also easier, because the backslash often has to be doubled and it is sometimes hard to keep track of how many backslashes need to be written and to read it.

The line ending is a bit tricky. Linux and Unix use just a „Linefeed“ („\n“=Ctrl-J). For MS-DOS and MS-Windows the combination „Carriage-Return+Linefeed“ („\r\n“=Ctrl-M Ctrl-J) has become the default. Most of today’s programs do not care and understand both variants more or less equally well. Only Notepad does get confused with Linefeed-only files, but notepad is a bad choice anyway. Better Editors (gvim, emacs, ultraedit, scite, …) exist. On the other hand we get problems with the MS-Windows-line-termination in the case of executable scripts. Usually the contain a first line like „#/usr/bin/ruby“. The OS uses that as hint on how to execute them, in this case by calling /usr/bin/ruby. If the line ends with Ctrl-M Ctrl-J, then the OS tries to find a program „/usr/bin/ruby^M“ (^M = Ctrl-M = „\r“), which of course does not exist and we just get an obscure error message.

It is easy to do the conversion:

$ perl -i~ -p -e ’s/\r//g;‘ script

Or the other direction:

$ perl -i~ -p -e ’s/\n/\r\n/g;‘ textfile

For those who use subversion there are ways to enforce a certain way of line endings. Even git supports this.

Share Button

Shell Scripts

Shell scripts can be useful for writing small stuff like combining a few commands to pipes or doing a bit of „back ticking“. Even simple loops and if-conditions are possible. And if we want, it is almost a full programming language. A bit hard to tame, maybe, but quite a lot of stuff is possible. Those who like to know more about it may look into startup scripts of typical java software. Often a .bat and a .sh file are provided, where the right jvm is found, the classpath and the execution path and maybe some other environment are put together. In the end the .sh-file is quite a long and unreadable horror story and the .bat file is even much worse, because the cmd-language is just a lot more primitive and less capable and requires even worse hacks.

There are ways to make shell scripts more readable, which by themselves are truly admirable, but I think that route is wrong. We can learn all the Shell functionalities and understand bit by bit even more complex shell scripts, but I think for non trivial shell scripts it is time to switch to real programming languages instead. Scripting languages, of course, for example Perl, Ruby, Python or Lua. We may still execute „shell commands“, that are actually programs in /bin, /usr/bin or /usr/local/bin where they are powerful and more concise than writing purely in that programming language. But a magic for putting together a classpath is much cleaner in the Perl programming language than in pure bash (or worse cmd/bat).

This is of course another example of the Golden Hammer anti pattern. We should balance our tool box. Not add specific tools for making any minor task a bit easier on the expense of supporting one more tool, but keep a broad range of tools that in conjunction are very powerful. For example I would retire awk and sed and use either Perl or Ruby instead. We only have to keep them around because a lot of system tools that are just there still rely on them, but for a team I would deprecate awk and sed for new scripts or even for enhancing existing scripts. Bash would be ok only for small scripts, you can invent a line number or a maximum complexity, but for very short scripts I think bash is a legitimate tool.

Switch to Perl, Perl6, Ruby or… when you encounter any of the following:

  • The scripts is getting kind of long (>= 100 lines)
  • You find yourself modularizing it with functions
  • You find yourself using non trivial perl, ruby, sed or awk within the script, for example regex-stuff
  • The script need interaction
  • The scripts needs arrays, numbers or other types
  • More than one or two trivial if-statements or loop-statements are needed
  • Database access is done by the script (SQL or NoSQL)
  • String encoding becomes relevant
  • Quoting levels become an issue

This post was inspired by a similar post on the Isoblog by Kris. And the Shell Style Guide of Google is quite good especially in limiting the area where shell scripts are acceptable.

Share Button

tmp-directories

On all computers we have some concept of a tmp-directory. Typically it is /tmp on Linux- and Unix-systems and something like C:/TEMP plus some subdirectory in each users home directory on MS-Windows.

In terms of software development this tends to be some dark area. Programs like to create some files there, store some stuff there and then maybe remove it, maybe not. And we do not know for sure, when we can delete these files and we actually do not want to care. Linux and Unix-Systems sometimes clear their tmp-directories on reboot, while providing an additional /var/tmp-directory, that survives reboot. Sometimes the tmp-directory is deducted from shared memory, so it is kind of a RAM-disk, but usually stored in the swap partition (or swap file) of our OS. Now this cleanup on reboot does not help too much, when we want to keep our system running for a long time.

These days most computers are somehow dedicated. Either they are virtual computers that run exactly one server application or a set of closely related server applications. Or it is a mobile phone, tablet or desktop computer that is typically used by only one person. But still we should not forget that the system should allow being used by several applications and by several users. So sharing the same tmp-directory for everyone can cause some conflicts. The Unix- and Linux-family has a way of setting file permissions for the tmp-directory itself and for its entries that stop users from reading, changing or deleting each others files, but still there is some concurrency about using the namespace of this one directory, which is usually quite elegantly bypassed by each software by using smart naming or by having the OS create unique names. But I would not consider it ideal. On the other hand, sometimes we might actually want to use the tmp-directory to share something between users or between processes, where this one tmp-directory might come in handy.

The approach of having a separate tmp-directory in each home directory and in a sub directory of each server application’s installation is tempting, because it separates name spaces, allows to disallow reading the directory entries by others and does not mix totally unrelated stuff in one directory. There is a drawback to this. We usually have different storage technologies. Some are optimized for reading, maybe even avoiding redundancy, because the system can be reinstalled. Some use sporadic writing, some are strictly read-only. And some use a lot of reading and writing. Some data is transient, some can be easily restored and some data needs to be stored redundantly to be safe. Depending on that we should aim to put it on Flash disks, or on a different RAID setup of hard disks. This is getting harder with virtualization, but eventually we can get to the point where virtual computers have disks of different characteristics, that are mapped to the appropriate hardware.

So there is no real good answer to this question, but I think that a tmp-directory that is separate from the home directory, but specific to each user, would be the best approach. Will this change? Probably not so easily. But maybe in some distant future.

Share Button

MS-Windows-Encodings with CMD: Bug or Feature?

Deutsch

Whoever is working with MS-Windows, should know these black windows with CMD running in them, even though they are not really popular. The Unix and Linux guys hate them, because they are really primitive compared to their shells. Windows guys like to work graphically. Or they prefer powershell or bash with cygwin. Linux and Unix have the equivalent of these windows, but usually they are white. Being able to configure the colors on both systems in any way this is of no relevance.

NT-based MS-Windows systems (NT 3.x, 4.x, 2000, XP, Vista, 7, 8, 10) have several subsystems and programs are running in them, for example Win64, Win32 (or Wow64 on 64-bit-systems), Win16, cygwin (if installed), DOS… Because programs for the DOS subsystem are typically started in a CMD window, and because some of the DOS commands have equally named and similarly operating pendents in the CMD window, the CMD window is sometimes called DOS-window, which is just incorrect. Actually this black window comes into existence in many situations. Whenever a program is started that has input or output (stdin, stdout, stderr), a black window is provided aroudn, if no redirection is in place. This applies for CMD. Under Linux (and Unix) with X11 it is the other way round. You start the program that provides the window and it automatically starts the default shell within that window, unless something else is stated.

Now I recommend an experiment. You just need an MS-Windows installation with any graphical editor like emacs, gvim, ultraedit, textpad, scite, or even notepad. And a cmd-window.

  • Please type these commands, do not use copy/paste
  • In the cmd-window cd into a directory you may write in.
  • echo "xäöüx" > filec.txt. Yes, there are ways to type these letters even with an American keyboard. 🙂
  • Open the file with a graphical editor. How do the Umlauts look?
  • Use the editor to create a second file in the same directory with contents like this: yäöüy.
  • view it in CMD:
  • type fileg.txt
  • How do the Umlauts look?

It is a feature or bug, that all common MS-Windows versions are putting the umlauts to different positions then the graphical editors. If you know how to fix this, let me know.

What has happened? In the early 80es MS-DOS came into existence. By that time standards for character encoding were not very good. Only ASCII or ISO-646-IRV existed, which was at least a big step ahead of EBCDIC. But this standardized only the lower 128 characters (7 Bit) and lacked at some characters for almost any language other than English. It was tried to put a small number of these additional letters into the positions of irrelevant characters like „@“, „[„, „~“, „$“ etc. And software vendors started to make use of the upper 128 characters. Commodore, Atari, MS-DOS, Apple, NeXT, TeX and „any“ software came up with a specific way for that, often specific for a language region.

These solutions where incompatible with each other between different software systems, sometimes even between versions or language versions of the same software. Remember that at that time networks were unusual and when they existed, they were proprietary to the operating system with bridge solutions being extremely difficult to implement. Even formats for floppy disks (the three-dimensional incarnations of the save button) had proprietary formats. So it did not hurt so much to have incompatible encodings.

But relatively early X11 which became the typical graphical system for Unix and later Linux started to use standard encodings like the ISO-8859-x family, UTF-8 and UTF-16. Linux was already on ISO-8859-1 in version 0.99 in the early 90es and never tried to invent its own character encoding. Thank god for that….

Today all relevant systems have moved to Unicode standard and standardized encodings like ISO-8869-x, UTF-8, UTF-16… But MS-Windows has done that only partially. The graphical system is using modern encodings or at leas Cp1252, which is a decent approximation. But the text based system with the black window, in which CMD is running, is still using encodings from the MS-DOS times more than 30 years ago, like Cp850. This results in a break within the system, which is at least very annoying, when working with cygwin or CMD-windows.

Those who have a lot of courage can change this in the registry. Just change the entries for OEMCP and OEMHAL in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage simultaneously. One of them is for input, the other one for output. So if you change only one, you will even get inconsistencies within the window… Sleep well with these night mares. 🙂
Research in the internet has revealed that some have tried to change to utf-8 (CP65001) and got a system that could not even boot as a result. Try it with a copy of a virtual system without too much risk, if you like… I have not verified this, so maybe it is just bad rumors to create damage for a great company that has brought is this interesting zoo of encodings within the same system. But anyway, try it at your own risk.
Maybe something like chcp and chhal can work as well. I have not tried that either…

It is up to you if you consider this whole issue a bug or a feature.

Share Button