Random-Access and UTF-8

Deutsch

It is a nice thing to be able to use random access files and to have the possibility to efficiently move to any byte position for reading or writing.

This is even true for text files that have a fixed number of bytes per character, for example exactly one, exactly two or exactly four bytes per character. Maybe we should prefer the term „code point“ here.

Now the new standard for text files is UTF-8. In many languages each character is usually just one byte. But when moving to an arbitrary byte position this may be in the middle of a byte sequence comprising just one character (code point) and not at the beginning of a character. How should that be handled without reading the whole file up to that position?

It is not that bad, because UTF-8 is self synchronizing. It can be seen if a particular byte is the first byte of a byte sequence or a successive one. First bytes start with 0 or 11, successive bytes start with 10. So by moving forward or backward a little bit that can be handled. So going to a rough position is quite possible, when knowledge of the average number of bytes per character for that language is around.
But when we do not want to go to a rough character position or to an almost exact byte position, but to an exact character position, which is usually the requirement that we have, then things get hard.

We either need to read the file from the beginning to be sure or we need to use an indexing structure which helps starting in the middle and only completely reading a small section of the file. This can be done with extra effort as long as data is only appended to the end, but never overwritten in the middle of the file.

But dealing with UTF-8 and random access is much harder than with bytes. Indexing structures need to be maintained in memory when accessing the file or even as part of the file.

Share Button

Ein Gedanke zu „Random-Access and UTF-8

  1. Pingback: Random-Access und UTF-8 | Karl Brodowskys IT-Blog

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.


*