Just run it twice

Often we use some kind of „clustered“ environment to run our software.

This promises higher performance and better availability.

And the frameworks seem to suggest that it is just a matter of starting it twice and it will magically work correctly.

There is nothing wrong with investing some thoughts on this issue. It can actually quite wrong otherwise…

So some questions to think about:

Where are the data? Does each service have its own set of data? Or do they share the data? Or is there some kind of synchronization mechanism between the copies of data? Or is some data shared and some data as copy for each instance of the service?

Do we gain anything in terms of performance or is the additional power of the second instance eaten up by the overhead of synchronizing data? Or if data is only stored once, does this become the bottleneck?

Then there is an issue with sending the requests to the right service. Usually it is a good idea to use something like „sticky sessions“ to keep a whole session or collections of related requests on one instance. Even if the protocol is „stateless“ and „restful“.

Is there some magic caching that happens automatically, for example in persistence frameworks like Hibernate? What does this mean when running two instances? Do we really understand what is happening? Or do we trust that hibernate does it correctly anyway? I would not trust hibernate (or any other JPA implementation) on this issue.

What about transactions? If storage is not centralized, we might need to do distributed transactions. What does that mean?

Now messaging can become fun. Modern microservice architectures favor asynchronous communication over synchronous communication, where it can be applied. That means some kind of messaging or transmission of „events“ or whatever it is called. Now events can be subscribed. Do we want them to be processed exactly once, at least once or by every instance? How do we make sure it is happening as we need it? Especially the „exactly once“-case is tricky, but of course it can be done.

How do we handle tasks that run like once in a certain period of time, like cronjobs in Linux? Do we need to enforce that they run exactly once or is it ok to run them on each instance? If so, is it ok to run them at the same time?

Do we run the service multiple times on the productive system, but only a single instance on the test and development systems?

Running the service twice or more times is of course something we need to do quite often and it will become more common. But it is not necessarily easy. Some thinking needs to be done. Some questions need to be asked. And we need to find answwers to them. Consider this as a starting point for your own thinking processes for your specific application landscape. Get the knowledge, if it is not yet in the team by learning or by involving specialists who have experience…

Share Button

Combining multiple scans

When images are scanned multiple times, maybe there is a way to construct an image that is better than any of the scans from them.

In this case it is assumed, that one scan has a higher resolution, but another scan got the colors better.

It has already been found out, which two scans belong to the same original.

Also one of them has already been rotated to the desired position.

The rotation can easily be found. Since images are roughly rectangular and width is roughly 1.5 times the height, simply two rotations have to be tried. Images are compared pixelwise and the one that has less deviation is probably the right one.

Now the orientation is already the same. And when scaled to the same size, the images should be exactly the same, apart from inaccuracies. But that is not the case. They are scaled slightly differently and they are shifted a bit.

Now the shift and scaling can be expressed by four numbers as x=a*u+b and y=c*v+d, where (x,y) and (u,v) are coordinates of the two images. It could be a matrix, if rotation is also involved and this would work the same way, but professional scans do have shift and scaling problems, but much less rotation problems.

So to estimate a, b, c and d, it is possible to go through all the pixels (u,v) of one image. Now the pixels (x,y) = (u+du, v+dv) are compared for combinations of small du and dv. The result is a sum of squares of differences s in a range between 0 and 3*65536. Now low values are good, that means that the coordiates (u,v,x,y) are a good match and need to be recorded with a high weight. This is achieved by weighting them with (3*65536-s)^2. Now it is just two linear regressions to calculate a,b,c and d. Of course, the points near ues the borders need special care, but it is possible to ommit a bit near the boarder to avoid this issue.

Now the points (u,v) of one image are iterated and the approximate match (x,y) on the other image is found. Now for (r,g,b) there can be a function to transform them. To start, it can be linear again, so this will be three linear regressions for r, g, and b. Or it could involve a matrix multiplied to the input vector. Or some function, then a matrix multiplication then vector addition and then the inverse function. The result has to be constrained to RGB-values between 0 and 255.

Now in the end, for each pixel in one image there is a function that changes the colors more to what the other image has. This can be applied to the full resolution image. Also, a weighted average of the original values and the calculated values, to find the best results. Then this can be applied to the whole series.

A few experiments with the function and a bit research on good functions for this might be done, to improve. In the end of the day there is a solution that is good enough and works. Or maybe a few variants and then some manual work to pick the best.

Share Button

Pagination of Database Query Results

This article is highly inspired by the blog post We need tool support for keyset pagination.
Please consider reading the original first and then my interpretation and additional thoughts about this idea.

We have a typical database base backed web application. It can be a rich client. It can be a NoSQL database. Whatever, but for the moment maybe a transactional SQL database and a web application will be used as the most common and best understood example.

We kind of construct a query, of which the result set is so large, that we do not want to transfer, render and display it all at once. Think of Google, where you might have millions of hits and only see the first 10 or 100 or so. Now you can navigate through the result set, usually by going to the next pages successively.

Nice web applications tell you how many records there are exactly. And then calculate how to split this up into a large number of pages and allowing you to go to an arbitrary page, or at least to some of the pages (last, next 10 pages) immediately.

There are several problems with this. First of all: the set you are basing this on changes with time. This could be fixed by creating some temporary snapshot in memory, in the database or in some kind of db-cache and keeping that around for some time. This is expensive and usually not worth the price, so we live with some misbehavior.

The second problem is the count. Count actually needs to do a full scan of the result set, there is no decent short cut. Now it would usually be good enough to estimate the size of the result set and that is exactly what google does. Sometimes the estimation is ridiculously bad, but we have gotten used to it. That is much better than using the real count, because it really speeds up the application by a great factor and improves user experience overall. Of course it is a bit more challenging to program an estimation and only worth it for big result sets. We could still jump to any of the next 10 pages by just increasing the size of the current query by 11 and showing the required segment. Since we sort the result by something, we could just get the first N results sorted backwards and then wrap that in a query that sorts them forward. So dropping the count can help for performance without too much loss in functionality.

Now the next problem is, that having navigated through a lot of pages, the usual way of doing a query and then constraining it with some mechanism like „rownum“, „row_number“, „limit/offset“ to the subset we want to display always requires obtaining the resultset from the beginning to our page. Usually that does not matter, because we navigate through the application manually and that means not too many pages from the beginning. But it can become an issue, if the application is heavily used.

Now we sort by something. Usually not the id, but some business logic relevant keys or a combination. They do not have to be unique, but we can always add the ID (or more generally the primary key) as the last sort key to make it unique. Then the next page can be found by just adding a condition based on our sort key. Assuming we want to show N records. The query might be

SELECT * FROM TABX 
  WHERE cond1 AND cond2 
  ORDER BY A, B, C ASC 
  LIMIT n;

Now the last record of the result has A=a1, B=b1, C=c1.
Then the next page can be obtained by

SELECT * FROM TABX 
  WHERE cond1 AND cond2 
   AND (A>a1 OR A=a1 AND B>b1 OR A=a1 AND B=b1 AND C>c1) 
  ORDER BY A, B, C ASC 
  LIMIT n;

To skip three pages and get the fourth of the following pages is done like this:

SELECT * FROM TABX 
  WHERE cond1 AND cond2 
   AND (A>a1 OR A=a1 AND B>b1 OR A=a1 AND B=b1 AND C>c1) 
  ORDER BY A, B, C ASC
  LIMIT n OFFSET 3*n;

We do use (or simulate) OFFSET here, but in a more intelligent way, because we do not force it to work harder than necessary by rebasing our query on what we already know.

We get a bit more consistency, because new records being inserted in the beginning will not be seen, but they will at least not disturb the succession of our result set. Maybe that is worth more than the performance gain, because we get it with a reasonable effort. caches, snapshots or even keeping it in „session memory“ are not really serious approaches to this issue, they will just sooner or later blowup our application in production, at the worst possible moment.

Share Button

JSON instead of Java Serialization: The solution?

We start recognizing that Serialization is not such a good idea.

It is cool and can really work on a wide range of objects, even including complex and cyclic reference graphs. And it was essential for some older Java frameworks like EJB and RMI, which allowed remote access to Java objects and classes.

But it is no longer the future, Oracle will soon deprecate and later remove it. And it will happen this time, even though they really keep stuff around for a long time due to compatibility requirements.

Just to recap: it opens up security discussion, it opens up hidden behavior and makes it harder to reason about code, it creates tight coupling between remote components and it can result in bugs, that only occur at runtime and cannot be discovered at compile time. In short, it is not resilient.

So we need something else. Obvious candidates are XML, YAML and JSON. XML is of course an option and is powerful enough to do many things, but often a bit too clumbsy and too much boiler plate, so we try to move away from it. YAML and JSON kind of do the same thing, but it seems that JSON is winning the race and we all need to know JSON and many of us tend to skip YAML.

So why not use JSON. It is easy, it has good libraries and we can even find databases that work with JSON.

What JSON can express very well are scalars, lists and maps and combinations of these. This is quite exactly what we have in Perl, JavaScript or Clojure as basic building blocks. These languages support object oriented programming, but for simple stuff we go with these basic building blocks. And objects can be modelled as (hash-)maps, with the attribute names as keys. Actually JSON is valid JavaScript code.

We do have to change our thinking when moving from Java Serialization to JSON. JSON does not store any serializable object but just data. Maybe that is enough and that is what we actually want. It totally works in heterogeneous environments, where we are using different programming languages or different implementations.

There are good libraries. I have tried two, Jackson and GSON which both work well, recently mostly Jackson. It is important to think of Clojure, JavaScript, Perl or something like that without objects. So we loose type information, which can be considered good or bad, but if we can arrange ourselves with it, we avoid the tight coupling. JavaBeans are expressed exactly the same as a HashMap with the attribute names as keys. We can provide the top level class when deserializing, but at the child levels it will not be able to figure that out, if it relies on runtime information.

Example Code

Here it has been tried out. Find full example code on github.

A class that contains all kinds of stuff. Not prepared for really putting in nulls, but it is just experimental code…


package net.itsky.jackson;

import java.util.List;
import java.util.Map;
import java.util.Set;

public class TestObject {
    private Long l;
    private String s;
    private Boolean b;
    private Set set;
    private List list;
    private Map map;

    public TestObject(Long l, String s, Boolean b, Set set, List list, Map map) {
        this.l = l;
        this.s = s;
        this.b = b;
        this.set = set;
        this.list = list;
        this.map = map;
    }

    public TestObject() {
        // only for framework purposes
    }

    public Long getL() {
        return l;
    }

    public String getS() {
        return s;
    }

    public Boolean getB() {
        return b;
    }

    public Set getSet() {
        return set;
    }

    public List getList() {
        return list;
    }

    public Map getMap() {
        return map;
    }

    @Override
    public String toString() {
        return getClass().getSimpleName() + "("
+                "l=" + l + " (" + l.getClass() + ") "
                + " s=\"" + s + "\" (" + s.getClass() + ") "
                + " b=" + b + " (" + b.getClass() + ") "
                + " set=" + set  + " (" + set.getClass() + ") "
                + " list=" + list + " (" + list.getClass() + ") "
                + " map=" + map + " (" + map.getClass() + "))";
    }
}

And this is used for running everything. To play around more, it should probably be moved to tests..

package net.itsky.jackson;

import com.fasterxml.jackson.databind.ObjectMapper;
import com.fasterxml.jackson.databind.ObjectWriter;
import com.google.common.collect.ImmutableList;
import com.google.common.collect.ImmutableMap;
import com.google.common.collect.ImmutableSet;

import java.io.StringReader;
import java.util.List;
import java.util.Map;
import java.util.Set;

public class App {

    public static void main(String[] args) {
        try {
            Set s1 = ImmutableSet.of(1, 2, 3);
            Set s2 = ImmutableSet.of(1, 2, 3);
            Map m1 = ImmutableMap.of("A", "abc", "B", 3L, "C", s1);
            List l1 = ImmutableList.of("i", "e", "a", "o", "u");
            TestObject t1 = new TestObject(30303L, "uv", true, s2, l1, m1);
            Map m2 = ImmutableMap.of("r", "r101", "s", 202, "t", t1);
            List l2 = ImmutableList.of("ä", "ö", "ü", "å", "ø");
            Set s3 = ImmutableSet.of("x", "y", "z");
            TestObject t2 = new TestObject(40404L, "ijk", false, s3, l2, m2);
            ObjectMapper mapper = new ObjectMapper();
            ObjectWriter writer = mapper.writerWithDefaultPrettyPrinter();
            System.out.println("t2=" + t2);
            String json = writer.writeValueAsString(t2);
            System.out.println("json=" + json);
            StringReader stringReader = new StringReader(json);
            TestObject t3 = mapper.readValue(stringReader, TestObject.class);
            System.out.println("t3=" + t3);
        } catch (Exception ex) {
            RuntimeException rex;
            if (ex instanceof RuntimeException) {
                rex = (RuntimeException) ex;
            } else {
                rex = new RuntimeException(ex);
            }
            throw rex;
        }
    }
}

And here is the output:

t2=TestObject(l=40404 (class java.lang.Long)  s="ijk" (class java.lang.String)
  b=false (class java.lang.Boolean)  
set=[x, y, z] (class com.google.common.collect.RegularImmutableSet)
  list=[ä, ö, ü, å, ø] (class com.google.common.collect.RegularImmutableList)
  map={r=r101, s=202, 
t=TestObject(l=30303 (class java.lang.Long)
  s="uv" (class java.lang.String)  b=true (class java.lang.Boolean)
  set=[1, 2, 3] (class com.google.common.collect.RegularImmutableSet)
  list=[i, e, a, o, u] (class com.google.common.collect.RegularImmutableList)
  map={A=abc, B=3, C=[1, 2, 3]} (class com.google.common.collect.RegularImmutableMap))}
 (class com.google.common.collect.RegularImmutableMap))
json={
  "l" : 40404,
  "s" : "ijk",
  "b" : false,
  "set" : [ "x", "y", "z" ],
  "list" : [ "ä", "ö", "ü", "å", "ø" ],
  "map" : {
    "r" : "r101",
    "s" : 202,
    "t" : {
      "l" : 30303,
      "s" : "uv",
      "b" : true,
      "set" : [ 1, 2, 3 ],
      "list" : [ "i", "e", "a", "o", "u" ],
      "map" : {
        "A" : "abc",
        "B" : 3,
        "C" : [ 1, 2, 3 ]
      }
    }
  }
}
t3=TestObject(l=40404 (class java.lang.Long)  s="ijk" (class java.lang.String)  
b=false (class java.lang.Boolean)  
set=[x, y, z] (class java.util.HashSet)  
list=[ä, ö, ü, å, ø] (class java.util.ArrayList)  
map={r=r101, s=202, 
t={l=30303, s=uv, b=true, 
set=[1, 2, 3], 
list=[i, e, a, o, u], 
map={A=abc, B=3, C=[1, 2, 3]}}}
  (class java.util.LinkedHashMap))

Process finished with exit code 0

So the immediate object and its immediate attributes were deserialized properly to what we provided. But everything inside went to maps, lists and scalars.

The intermediate JSON does not carry the type information at all, so this is the best that can be done.
Often it is useful what we want. If not, we need to find something else or see if we can tweak JSON to carry type information.

It will be interesting to explore other serialization protocols…

Links

Share Button

Phone Numbers and E-Mail Addresses

Most data that we deal with are strings or numbers or booleans and combinations of these into classes and collections. Dates can be expressed as string or number, but have enough specific logic to be seen as a fourth group of data. All these have interesting aspects, some of which have been discussed in this blog already.

Now phone numbers are by an naïve approach numbers or strings, but very soon we see that they have their own specific aspects. The same applies for email addresses which can be represented as strings.

Often projects go by their own „simplified“ specification of what an email address or a phone number is, how to parse, compare and render them. In the end of the day the simplification is harder to tame than the real solution, because it needs to be maintained and specified by the project team rather than being based on a proven library. And once in a while „edge cases“ occur, that cannot be ignored and that make the „home grown“ library even more complex.

Behind phone numbers and email addresses there are well defined and established standards and they are hard to understand thoroughly within the constrained time budget of a typical „business project“, because the time should be allocated to enhancing the business logic and not to reinventing the basics. Unless there is a real need to do so, of course.

Just to give an idea: When phone numbers are parsed or provided by user input, they can start with a „+“ sign or use some country specific logic to express, to which country they belong. And then the „+1“, for example, does not stand for the United States alone, but also for Canada and some smaller countries that are in some way associated with the United States or Canada. Further analysis of the number is required to know about that. The prefix for international number is often „00“, but in the United States it is „011“ and there were and are some other variants, that are still frequently used. Some people like to write something like „+49(0)431 77 88 99 11 1“ instead of „+49 431 77 88 99 11 1“. We can constrain the input to the variants we happen to think of and force the supplier of data to comply, but why bother? Why not accept legitimate formats, as long as they are correct and unambiguous?

Now for E-Mail-addresses there is the famous one page regular expression to recognize correct email addresses which is even by itself not totally complete. Find it at the bottom of the article…

Of course it includes some rarely used variants of email addresses that were once used and have not been completely abolished officially, but it is hard to draw and exact border for this.

So the general recommendation is to find a good library for working with email addresses and phone numbers. Maybe the library can even to some extent eliminate input strings that are formally complying the format, but know to be incorrect by knowing about numbering schemes world wide or about email domains or even by performing lookups.

Another strong recommendation is to store data like email addresses and phone numbers in a technical format, that is in the example of phone numbers always starting with a „+“ followed by digits only. For input any positioning of spaces is accepted, for output the library knows how to format it correctly. This allows selecting by the numbers without dealing with complex formatting, by just using the technical format in the query as well.

For Java (and thus for many JVM-languages), C++ and JavaScript there is an excellent library from Google for dealing with phone numbers. For E-Mails something like apache commons email validator is a way to go.

Keep in mind that for E-Mail addresses and phone numbers, the ultimate way of verification is to send them a link or a code that they need to enter. In the end of the day it is insufficient to rely only on formal verification without this final step.

But still issues remain for transforming data into a canonical technical format for storing them, formatting data for display etc. And there is a huge added value, if we can reliably recognize formally false entries early, when the user can still easily react to it, rather than waiting for an email/SMS/phone call being processed, which may fail when the user is no longer on our „registration site“. And we can process data which has already been verified by a third party, but still we want to parse it to recognize obvious errors.

The concrete libraries may be outdated by the time you are reading this, or they may not be applicable for the language environment that you are using, but please make an effort to find something similar.

So, please use good libraries, that are like to be found for the environment that you are using and write yourself what creates value for your project or organization. Unless your goal is really to write a better library. Better invest the time into areas where there are still no good libraries around.

And as always, you may understand email addresses and phone numbers as an example for a more general idea.

Links

E-Mail Regex

Source: https://emailregex.com/:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?: \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\ ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?: (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n) ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\ r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n) ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t] )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])* )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*) *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?: \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031 ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\]( ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(? :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(? :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(? :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<> @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(? :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,; :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\" .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0 00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@, ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])* (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[ ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*( ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:( ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(? :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n) ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n) ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<> @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t] )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)? (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?: \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t]) *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\ .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:( ?:\r\n)?[ \t])*))*)?;\s*)

Share Button

Ranges of Dates and Times

In Software we often deal with ranges of dates and times.

Let us look at it from the perspective of an end user.

When we say something like „from 2020-03-07 to 2019-03-10“ we mean the set of all timestamps t such that

    \[\text{2019-03-07} \le d < \text{2019-03-11}\]

or more accurately:

    \[\text{2019-03-07T00:00:00}+TZ \le d < \text{2019-03-11T00:00:00}+TZ\]

Important is, that we mean to include the whole 24 hour day of 2019-03-10. Btw. please try to get used to the ISO-date even when writing normal human readable texts, it just makes sense…

Now when we are not talking about dates, but about times or instants of time, the interpretation is different.
When we say sonmething like „from 07:00 to 10:00“ or „from 2020-03-10T07:00:00+TZ to 2020-04-11T09:00:00+TZ“, we actually mean the set of all timestamps t such that

    \[givenDate\text{T07:00:00}+TZ \le t < givenDate\text{T10:00:00}+TZ\]

or

    \[\text{2020-03-10T07:00:00}+TZ \le t < \text{2020-04-11T09:00:00}+TZ,\]

respectively. It is important that we have to add one in case of date only (accuracy to one day) and we do not in case of finer grained date/time information. The question if the upper bound is included or not is not so important in our everyday life, but it proves that commonly the most useful way is not to include the upper bound. If you prefer to have all options, it is a better idea to employ an interval library, i.e. to find one or to write one. But for most cases it is enough to exclude the upper limit. This guarantees disjoint adjacent intervals which is usually what we want. I have seen people write code that adds 23:59:59.999 to a date and compares with \le instead of <, but this is an ugly hack that needs a lot of boiler plate code and a lot of time to understand. Use the exclusive upper limit, because we have it.

Now the requirement is to add one day to the upper limit to get from the human readable form of date-only ranges to something computers can work with. It is a good thing to agree on where this transformation is made. And to do it in such a way that it even behaves correctly on those dates where daylight saving starts or ends, because adding one day might actually mean „23 hours“ or „25 hours“. If we need to be really very accurate, sometimes switch seconds need to be added.

Just another issue has come up here. Local time is much harder than UTC. We need to work with local time on all kinds of user interfaces for humans, with very few exceptions like for pilots, who actually work with UTC. But local date and time is ambiguous for one hour every year and at least a bit special to handle for these two days where daylight saving starts and ends. Convert dates to UTC and work with that internally. And convert them to local date on all kinds of user interfaces, where it makes sense, including documents that are printed or provided as PDFs, for example. When we work with dates without time, we need to add one day to the upper limit and then round it to the nearest some-date\text{T00:00:00}+TZ for our timezone TZ or know when to add 23, 24 or 25 hours, respectively, which we do not want to know, but we need to use modern time libraries like the java.time.XXX stuff in Java, for example.

Working with date and time is hard. It is important to avoid making it harder than it needs to be. Here some recommendations:

  • Try to use UTC for the internal use of the software as much as possible
  • Use local date or time or date and time in all kinds of user interfaces (with few exceptions)
  • add one day to the upper limit and round it to the nearest midnight of local time exactly once in the stack
  • exclude the upper limit in date ranges
  • Use ISO-date formats even in the user interfaces, if possible

Links

Share Button

UUIDs revisited

UUIDs have proven useful in many circumstances.
We have basically two main variants:

  • The UUID is calculated as a combination of the Ethernet-MAC-address, the timestamp and a counter.
  • The UUID is calculated using a good random number generator

While variant 1 provides for a good uniqueness, there are some issues with it. Today we use mostly virtualized servers, which means that the MAC-address is coming from a configuration file and no longer guaranteed to be world wide unique. And we give away some information with the UUID, that we do not necessarily want to give away.

Variant 2 can be proven to have an acceptably low risk of collisions, but this is only true when using really good random number generators, which cannot always be guaranteed. Also it introduces an uncertainty in an area where we do not need it. We need to worry about this uniqueness, at least a little but, which is unnecessary.

So the question is, if we can rethink variant 1.

Assuming, that our software runs on our server farm. There may be a few hundred or thousand or even millions of virtual or physical servers. Now the organization does have a way to uniquely identify their servers. Of course we only need to consider the servers that are relevant for the application. Maybe an ID for the service instance instead of the server is even better. We may assume a numerical ID or something or have a table to map IP addresses and the like to such an ID. Some thinking is still required on how to do this. We can fill the digits that we do not need with random numbers.

Putting this ID instead of the MAC address solves the issue of configurable MAC address.

The next problem, that timestamps can be abused to find out something that should not be found out could be resolved by running the timestamp part or even the ID part (including a random number) through a symmetric encryption or simply some bijective function that is kept as a secret.

In many circumstances there is nothing wrong with customizing the UUID-generation to some „local“ standard, if this is well understood and carefully implemented.

Share Button

Flashsort in Scala

There is now also an implementation of Flashsort in Scala.

In order to solve the requirement of sorting part of an array that is needed as part of flashsort, an heapsort implementation in Scala that can be constrained to a part of an array has been included as well. Heapsort was chosen, because it can sort in place and it has a guaranteed performance of O(n \log(n)). Mergesort or quicksort would have been reasonable choices as well. Some implentations even use insertion sort for this step, because the sections are small.

Links

Share Button

Flashsort in Ruby

Deutsch

There is a simple implementation of Flashsort in Ruby, after having already provided an implementation in C. The C-implementation is typically faster than the libc-function qsort, but this depends always on the data and on how well the metric-function has been written, that is needed on top of the comparison function for Flashsort. You can think of this metric function as some kind of monotonic hash function. So we have

    \[\bigwedge_{a,b: a\le b} m(a) \le m(b) \]

This additionally needed function of method is not really there, apart from numerical values, so we really have to invest some time into writing it. This makes the use of Flashsort a bit harder. A good metric function is crucial for good performance, but for typical text files quite trivial implentations already outperform classical O(n \log n) algorithms like Heapsort and Quicksort and Mergesort for larger amounts of data.

This blog article shows other sorting algorithms for Ruby.

Share Button

Indexing of Arrays and Lists

We index arrays with integers. Lists also, at least the ones that allow random access. And sizes of collections are also integers.
This allows for 2^{31}-1=2147483647 entries in Java and typical JVM languages, because integers are actually considered to be 32bit. Actually we could think of one more entry, using indices 0..2147483647, but then we would not be able to express the size in an signed integer. This stuff is quite deeply built into the language, so it is not so easy to break out of this. And 2’000’000’000 entries are a lot and take a lot of time to process. At least it was a lot in the first few years of Java. There should have been an unsigned variant of integers, which would in this case allow for 4’000’000’000 entries, when indexing by an uint32, but that would not really solve this problem. C uses 64 bit integers for indexing of arrays on 64 bit systems.

It turns out that we would like to be able to index arrays using long instead of int. Now changing the java arrays in a way that they could be indexed by long instead of int would break a lot of compatibility. I think this is impossible, because Java claims to retain very good backward compatibility and this reliability of both the language and the JVM has been a major advantage. Now a second type of arrays, indexed by long, could be added. This would imply even more complexity for APIs like reflection, that have to deal with all cases for parameters, where it already hurts that the primitives are no objects and arrays are such half-objects. So it will be interesting, what we can find in this area in the future.

For practical use it is a bit easier. We can already be quite happy with a second set of collections, let them be called BigCollections, that have sizes that can only be expressed with long and that are indexed in cases where applicable with longs. Now it is not too hard to program a BigList by internally using an array of arrays or an array of arrays of arrays and doing some arithmetic to calculate the internal indices from the long (int64) index given in the API. Actually we can buy some performance gain when resizing happens, because this structure, if well done, allows for more efficient resizing. Based on this all kinds of big collections could be built.

Share Button