Binary Data

Having discussed to some extent about strings and text data, it is time to look at the other case, binary data.

Usually we think of arrays of bytes or sequences of bytes stored in some media.

Why bytes? The 8-bit-computers are no longer so common, but the byte as a typical unit of binary data is still present. Actually this is how files on typical operating systems work, they have a length in bytes and that number of bytes as content. When we use the file system, we have to deal with this.

But assuming a perfect lossless world, the typical smallest unit of information is not a byte, but a bit. So we could actually assume that we have an array of bits of a given length $l$ and the length does not need to be a multiple of 8. This might be subtle, but it does describe the most basic case. Padding with 0-bits does not usually do the job, but there need to be other provisions to deal with the bit-length and the storage length. In case of files this could be achieved by storing the last three bits $b$ of the bit length in the first byte of the file and using the file size $s$ the bit length $l$ can be calculated as $8(s-1)+b$ .

How about memory? Often the array of bytes is just fine. The approach to have an array of booleans as bits usually hurts, because typical programming languages reserve 32 bit for each boolean and we actually want a packed array. On the other hand we have 64bit-computers and we would like to use this when moving around data, so actually thinking of an array of 64-bit-integers plus information about how many of the bits of the last entry actually count might be better. This does bring up the issue of endianness at some point.

But we actually want to use the data. In the good old C-days we could use quite a powerful trick: We read an array of bytes and casted the pointer to a pointer to a struct. Then we could access the struct. Poor mans serialization, but it was efficient and worked especially well for records with fixed size, for both reading and writing. Off course variable length records can be dealt with as well by imposing rules that say something about the type of the next record based on the contents of the current record.

The Perl programming language and subsequently the Ruby programming language had much more powerful and flexible mechanisms using methods of functions called pack and unpack.

Schreibe einen Kommentar

Antworten abbrechen