Mein kleiner blog – Page 5 – Noch jemand der meint, der Welt etwas erzählen zu müssen

Ausgelesen: Die Diktatur der Dummen

Im Untertitel schreibt Frau Witzer passend dazu: “Wie unsere Gesellschaft verblödet, weil die Klügeren immer nachgeben”. Nur hat der Inhalt des Buches wenig bis gar nicht mit Titel und Untertitel zu tun. Anhand zahlreicher Beispiele bekommt der Leser vorgeführt, wie die Gesellschaft, der Staat und schlussendlich unser ganzes Leben mehr oder weniger subtil von Konzernen und Interessengruppen kontrolliert und gesteuert werden. Letztere stellen sich dabei ganz und gar nicht dumm an, was den Titel angeht, kann also Entwarnung gegeben werden: wir werden nicht von den Dummen regiert, sondern von einigen recht schlauen Leuten. Letztere manipulieren erstere nur, damit sich daran auch in Zukunft nichts ändern wird. Dass für die besonders doof, die nicht zu einer der beiden Gruppen gehören. An diese Leute richtet dann wohl auch das Buch.

Leider wird man als intelligenter Leser nicht viel Spass bei der Lektüre haben: der Sprachstil ist holperig, an vielen Stellen hätte ich mir mehr kritische Distanz gewünscht, während Frau Witzer ihren Emotionen freien Lauf lässt. All das tut sie aus ihrem aktuellen Status heraus, offenbar hat sie einiges erreicht im Leben und kann sich jetzt auch mal entspannt zurücklehnen. Den Luxus hat nicht jeder, während Frau Witzer mir also erklärt, was ich ändern kann oder soll und wieso es früher doch schon mal besser war (Opa erzählt vom Krieg), gehe ich weiterhin einem Job nach, in dem ich das Gegenteil davon mache. Genau wie das Frau Witzer früher auch mal getan hat, schliesslich möchte ich auch mal einem dem ihren ähnlichen Status erreichen.

Dokumentenverwaltung mit ecodms

Daheim habe ich ja schon seit fast 10 Jahren ein papierloses Büro: alle reinkommende Post wird gescannt und digital weiterverarbeitet. Bisher habe ich es mir einfach gemacht und die gescannten Dokumente in einen grossen Ordner geworfen und mich beim Suchen auf Tools wie grep verlassen. Nun ist mir von Bekannten ecodsm als komfortable Alternative für den Hausgebrauch empfohlen worden. Für den Privatanwender kostenlos bringt es neben der obligaten Verschlagwortung auch eine Volltextsuche mit.

Installation und Inbetriebnahme gestalten sich einfach, es gibt ein gutes Handbuch. Das System arbeitet zweistufig: alle neuen Dokumente landen erstmal in einer Inbox, koennen dort verschlagwortet und dann archiviert werden. Ich habe ungefähr 3500 Dokumente, die will ich nicht alle einzeln einlesen. Dafür gibt es Templates: man definiert, welche Eigenschaften wie enthaltenen Text ein Dokument hat, wie es dann verschlagwortet werden soll und schon kann so ein Dokument automatisch archiviert werden. Für den initialen Load habe ich mir also so eine Art wildcard Template angelegt: egal welcher Text, es soll archiviert werden. Ein paar Tests funktionieren, also werfe ich dem ecodsm mal meine gesamten Datenbestand vor die Füsse.

Dass dauert dann doch eine ganze Weile. ecodsm macht selber eine ocr, dass kann schonmal eine gute halbe Stunde pro Dokument dauern. Wie weit er mit dem Import ist, erfährt der Anwender nicht. Irgendwann ist er dann wohl durch und ich bin erstaunt: gut 600 Dokumente sind in der Inbox hängen geblieben. Warum die nicht per Template automatisch archiviert worden sind, sagt ecodsm nicht. Noch spannender wird es, wenn man sich die archivierten Dokumente anschaut: eine Funktion zum zählen habe ich nicht gefunden, aber ecodsm vergibt fortlaufende Nummern und die höchste, die ich finden kann, liegt bei knapp 2500. Satte 1000 Dokumente sind einfach verschwunden. Fehlermeldungen sind keine zu finden. Da ecodms auch keine Duplikate erkennt, verzichte ich auf Wiederholung des Imports.

Die ocr von ecodms basiert auf tesseract. Das hatte ich zum letzten Mal Mitte letzten Jahres getestet, und verworfen. Die Qualität der Ergebnisse war alles andere als überzeugend. Wie macht sich Tesseract mit ecodms? Nicht viel besser, ein paar Volltextsuchen liefern magere Ergebnisse. Das ist vor allem schmerzhaft, da die importierten pdfs schon beim Scannen einer ocr unterzogen worden waren, mit deutlich besseren Ergebnissen.

Also in der Disziplin “Importieren von Altdaten” hat mich ecodms bis jetzt nicht überzeugt. Mal schauen, wie es sich dann im Alltag schlägt. Eine Chance will ihm noch geben.

Dynamic protobuf

This article assumes some basic knowledge of protobuf and Qt.

Protobuf is a nice library for data serialization. It is fast and efficient. The API is ugly (which is not uncommon for google products), but usable. I have played a bit with protobuf with the aim to replace a self-written serializer in a C++ Qt project. As usual the difficulties start with deserialization (maybe that is the reason why all these tools are named “serializers”, I am not aware of any product named “deserializer”). For my current project (a client server application based on message passing) protobuf has two drawbacks:

No transport mechanism
No dynamic deserialization

The first point means protobuf does not define any method to send/receive serialized messages over the wire, it even defines no ways to send/receive a sequence of messages over one connection. That is not a major problem and can be fixed with a few lines of code.
The second point is more important: the old mechanism generated a Qt-signal from the receiving function with a signature like this:

signal:
  void gotMessage(const QVariant& v);

The current message was wrapped into a QVariant. Our old serializer (we used our own message file compiler comparable to the protoc compiler) generated the necessary wrapper code and the right qRegisterMetaType() calls. So each message got a unique type-id. For the wire transfer we used a QDataStream, sent first the type-id and after that the serialized message. The receiving side was able to create the right type from this id (wrapped into a QVariant) and fill it from the serialized message. Each client (in Qt speech: the slot connected to the signal above) was able to restore the original type:

void handleMessage(const QVariant& v);
if (v.userType() == ExpectedMessage::typeId) {
  ExpectedMessage msg = v.value();
  handleExpectedMessage(msg);
}

v.userType() returns the type-id transferred over the wire, where ExpectedMessage::typeId is the “static” type-id returned from qRegisterMetaType(). So each module can register itself as a listener (in Qt: connect) for the gotMessage() signal, filter out the interesting message types and handle them.

This is what I mean by “dynamic deserialization”: feed the deserializer with some data and get back a parsed instance of the right message class. In protobuf each message declared in .proto file becomes a C++ class derived from a basic Message class. That is the same
mechanism we used in our implementation, which makes porting a bit easier. Protobuf does not guess the message type, to read a message protobuf needs know which kind of message it has to read. In an application one have to instantiate the right Message subclass and read the data:

ExpectedMessage msg;
msg.parseFromArray(...)

ExpectedMessage is a class declared as a message in .proto file and derived from google::protobuf::Message. But what I would like to have is:

Message *msg = magicallyReadTheMessage();
if (name(msg) == "expectedmessage") {
  ExpectedMessage *sub = dynamic_cast<ExpectedMessage*>(msg);
  handleSubclass(sub);
}

So is this possible with protobuf too? Fortunately yes, with two small tricks: we include the .proto file into our application and use some magic from the protoc compiler.

The first thing to do is to make the source code of the .proto file available at runtime. In the Qt world we just create a resource file and include the .proto file. (In a real world application, where we create different executables for server and client, we put this resource and all the message handling code into a library and link it to both client and server). Now we read the .proto file at startup and generate type-id from each message in the same manner with qRegisterMetaType() in the old version:

QFile data(":/demo.proto");
if (!data.open(QIODevice::ReadOnly | QIODevice::Text)) {
  qFatal("cannot read proto resource file");
return;
}
QByteArray protoText = data.readAll();

Now we have a string with our messages and can parse it (courtesy goes to https://cxwangyi.wordpress.com/2010/06/29/google-protocol-buffers-online-parsing-of-proto-file-and-related-data-files/):

using namespace google::protobuf;
using namespace google::protobuf::io;
using namespace google::protobuf::compiler;

FileDescriptorProto file_desc_proto;

ArrayInputStream proto_input_stream(protoText.data(), protoText.size());
Tokenizer tokenizer(&amp;proto_input_stream, NULL);
Parser parser;
if (!parser.Parse(&tokenizer, &file_desc_proto)) {
  qFatal("Cannot parse .proto file");
}

Now we can read all message types from file_desc_proto and generate a unique id for each message. In fact we push each message name onto a vector and use the index as id:

    using MessageNames = std::vector<std::string>;
    MessageNames messageNames;
    for(int i=0; i < file_desc_proto.message_type_size();++i) {
        const DescriptorProto& dp = file_desc_proto.message_type(i);
        qDebug() << i << dp.name().c_str();
        messageNames.push_back(dp.name());
    }

All this stuff goes into a initialization function which have to be called at startup. Note I have ignored module prefix stuff (protobuf messages are declared inside a module and get names following the schema module.submodule.messageName).

Sending a message is straight forward:

void sendMsg(const Message& msg) {
    auto id = findIndexOf(msg.GetTypeName()); // looks up a message with this name in the messageNames vector
    QByteArray data((char*)&id, sizeof(id));
    string s = msg.SerializeAsString();
    data.append(QByteArray(s.data()),s.size());
    sendBuffer(data); // send a QByteArray however you like...
}

The corresponding receive method may look like this:

void recvMsg(const QByteArray &data) {
  MessageNames::difference_type idx = *(data.data());
  const string& name = messageNames[idx];
  const google::protobuf::Descriptor *desc = google::protobuf::DescriptorPool::generated_pool()->FindMessageTypeByName(name);
  const google::protobuf::Message *protoMsg = google::protobuf::MessageFactory::generated_factory()->GetPrototype(desc);
  google::protobuf::Message* resultMsg = protoMsg->New();
  resultMsg->ParseFromArray(data.data()+sizeof(MessageNames::difference_type), data.size()-sizeof(MessageNames::difference_type));
  emit handleReceivedMsg(*resultMsg);
  delete resultMsg;
}

After receiving the message type id and looking up the message name in our messagesNames vector we look up a Descriptor for this message and generate a prototype of this message type by calling GetPrototype(). The protoMsg is already an instance of the right subclass. Each protobuf message contains a virtual New() method, to create a fresh instance of this type. We use this to create our own instance and filling it with the received data. Finally we inform all our listening clients about the new message (in Qt speech: emit a signal).
One remaining drawback is, that we need to delete the message after processing it. So we cannot use this signal in queued connections. In my case this is not a problem.

The client (Qt: the receiving slots) can now filter for the right messages:

void handleReceivedMsg(const Message& msg) {
  if (msg.GetTypeName() == "therightmessage") {
    const TheRightMessage& trm = dynamic_cast<const TheRightMessage&>(msg);
    handleTheRightMessage(trm);
  }
}

This code looks very similar to our old code we started with.

Another nice feature of the inclusion of the .proto file is the possibility to create a hash of the contained .proto messages and send it to the server on connect. So we can ensure that both client and server use a compatible .proto declaration.

Of course one could send the message names (instead of the type-ids). But sending long strings instead of short integers is a waste of bandwith. Receiving a fixed size integer is much easier.

As the size of each protobuf message may differ (protobuf put a lot efforts to reduce the size of messages) I consider it a good idea to add a sentinel to each message.

A short demo (using byte buffers to simplify the message transfer part) can be found at https://github.com/valpo/protodemo.