Why am I learning Python

To be honest, what can be done with Python can also be done with Perl or Ruby. I am not working in areas, where there is much better library support for Python than for Perl or Ruby. And I like Perl and Ruby very much and I am somewhat skeptical about Python. But there are some points that make it worth knowing Python in addition to Perl and Ruby, not instead of them.

I strongly recommend using real programming languages like Perl, Ruby, Python and you can add some more instead of Bash scripts, where a certain complexity is exceeded. Try reading the pure bash scripts that are used to start Maven, Tomcat or other useful software. Often there is a CMD-script as well, that is the real pure horror. Python serves this purpose well enough, the other two of course as well.

It is always good to learn new languages once in a while, because they extend our horizon and help us even to be better with our more preferred languages. And why not challenge the preferences…

There is a good point in allowing for a tool box of languages, not „only Java“ or „only C#“ or „only C“ or even „only Perl“, whatever you like… Combining a useful toolbox of several languages is the right way to go. This would be the case with a toolbox containing A and B, where A ∈ { C, C++, Java, C#, F#, Scala, Clojure, …} and B ∈ { Perl, Perl6, Ruby, Lua, Python,…}. Usually it is a good idea to make it slightly larger, but it is also good to find a consensus on which set of languages to concentrate. I would for example discourage using sed and awk, because they can quite easily be replaced by Perl and limit bash to very trivial scripts. There are some cases in which the awk or sed scripting is a bit shorter than it is with doing the same in Perl or Ruby, but this does not justify maintaining the extra knowledge, while Perl on top of Java does justify this a lot. So the toolbox should be big enough to cover everything, but it does not have to be too redundant and there can be preferences what tool is recommended to use for a certain class of purposes, if this recommendation is reasonable. This makes it easier to maintain each others code. Now there are many projects, where the spot of B is taken by Python. So in order to be a good team player it might be useful to be able to work with the python scripts, write in this language and contribute instead of spending too much time talking about why Perl or Ruby or Lua or whatever is better. Which it might be. Or which might be more a matter of taste. Here is what big sites are using as A, B, C,….

Now out of these scripting languages, Python is for sure a successful contender. This results in good libraries, but also in higher likelyhood of Python occupying the spot B.

Now we have tools like Jenkins, Kubernetes, Docker, Cloud computing, Spark and simply certain Linux distributions, which might come along with their preferred set of scripting languages that are well supported for performing certain tasks. This can be delegated to one or two guys in the team or kept to a minimum, but this might become a factor of increasing importance. It might force us to have multiple „Bs“ or multiple „As“.

And there are certain areas, where Python is simply strong and has become the language of choice. It seems to have become the successor of Fortran for many if not most numerical calculation areas, even though there will probably always be a niche for powerful compiled languages like Fortran and C for the ultimative performance. But so the library is written in C with Python bindings and we get most of the performance as well. Also Data Science seems to mostly opt for Python as the general purpose language besides R and SQL and SAS. Even Bioinformatics, which was a stronghold of Perl for many years is now preferring Python… Yes, it does hurt someone who likes Perl, but it is true… So to be able to work in many interesting areas, it is useful to know some Python. So I started learning it. I am using a Russian translation of the book Programming Python.

I might write a bit more about the language, once I have some more experience with it.

Share Button

Meaningless Whitespace in Textfiles

We use different file formats that are more or less tolerant to certain changes. Most well known is white space in text files.

In some programming languages white space (space, newline, carriage return, form feed, tabulator, vertical tab) has no meaning, as long as any whitespace is present. Examples for this are Java, Perl, Lisp or C. Whitespace, that is somehow part of String content is always significant, but white space that is used within the program can be combination of one or more of the white space characters that are in the lower 128 positions (ISO-646, often referred to as ASCII or 7bit ASCII. It is of course recommended to have a certain coding standard, which gives some guidelines of when to use newlines, if tabs or spaces are preferred (please spaces) and how to indent. But this is just about human readability and the compiler does not really care. Line numbers are a bit meaningful in compiler and runtime error messages and stack traces, so putting everything into one line would harm beyond readability, but there is a wide range of ways that are all correct and equivalent. Btw. many teams limit lines to 80 characters, which was a valid choice 30 years ago, when some terminals were only 80 characters wide and 132 character wide terminals where just coming up. But as a hard limit it is a joke today, because not many of us would be able to work with a vt100 terminal efficiently anyway. Very long lines might be harder to read, so anything around 120 or 160 might still be a reasonable idea about line lengths…

Languages like Ruby and Scala put slightly more meaning into white space, because in most cases a semicolon can be skipped if it is followed by a newline and not just horizontal white space. And Perl (Perl 5) is for sure so hard to compile that only its own implementation can properly format or even recognize which white space is part of a literal string. Special cases like having the language in a string and parsing and then executing that should be ignored here.

Now we put this program files into a source code management system, usually Git. Some teams still use legacy systems like subversion, source safe, clear case or CVS, while there are some newer systems that are probably about as powerful as git, but I never saw them in use. Git creates an MD5 hash of each file, which implies that any minor change will result in a new version, even if it is just white space. Now this does not hurt too much, if we agree on the same formatting and on the same line ending (hopefully LF only, not CR LF, even on MS-Windows). But our tooling does not make any difference between significant changes and insignificant formatting only changes. This gets worse, if users have different IDEs, which they should have, because everyone should use the IDE or editor, with which he or she is most efficient and the formal description of the preferred formatting is not shared between editors or differs slightly.

I think that each programming language should come with a command line diff tool and a command line formatting tool, that obey a standard interface for calling and can be plugged into editors and into source code management systems like git. Then the same mechanisms work for C, Java, C#, Ruby, Python, Fortran, Clojure, Perl, F#, Scala, Lua or your favorite programming language.

I can imaging two ways of working: Either we have a standard format and possibly individual formats for each developer. During „git commit“ the file is brought into the standard format before it is shown to git. Meaning less whitespace changes disappear. During checkout the file can optionally be brought into the preferred format of the developer. And yes, there are ways to deal with deliberate formatting, that for some reason should be kept verbatim and for dealing differently with comments and of course all kinds of string literals. Remember, the formatting tool comes from the same source as the compiler and fully understands the language.

The other approach leaves the formatting up to the developer and only creates a new version, when the diff tool of the language signifies that there is a relevant change.

I think that we should strive for this approach. It is no rocket science, the kind of tools were around for many decades as diff and as formatting tools, it would just be necessary to go the extra mile and create sister diff and formatting tools for the compiler (or interpreter) and to actually integrate these into build environments, IDEs, editors and git. It would save a lot of time and leave more time for solving real problems.

Is there any programming language that actually does this already?

How to handle XML? Is XML just the new binary with a bit more bloat? Can we do a generic handling of all XML or should it depend on the Schema?

Share Button