Thursday, March 31, 2011

Parsing huge data with c++

In my job, i need to parse different kind of data files from different data sources.Sometimes i parse them by writing directly c++ code (with the help of qt and boost:D), sometimes manually with a helper program. I must note that data types are so different from each other it is so hard to create common a interface for all of them. But i want to do this job in a more generic way.I am planning to write a library to convert them and it should be easy to add new parser utility in future.I am also planning to use other helper programs inside my program, not manually. My question is what kind of an architecture or pattern do you suggest, Basic condition is library must be extendable via new classes or dll's and also configurable. By the way data can be in text, ascii or something like CSV(comma seperated values) and most of them are specific for a certain data.

From stackoverflow
  • Not to blow my own trumpet, but my small Open Source utility CSVfix has an extensible architecture based on deriving new C++ classes with a very simple interface. I did consider using a plugin-architecture with DLLs but it seemed like overkill for such a simple utility . If interested, you can get the binaries & sources here.

  • I'd suggest a 3-part model, where the common data-format is a String which should be able to contain every value:

    • Reader: In this layer the values are read from the source (ie. CSV-file) using some sort of file-format-descriptor. The values are then stored in some sort of intermediate data structure.
    • Connector/Converter: This layer is responsible for mapping the reader-data to the writer-fields.
    • Writer: This layer is responsible for writing a specific data structure to the target (ie. another file-format or a database).

    This way you can write different Readers for different input files.

    I think the hardest part would be creating the definition of the intermediate storage format/structure so that it is future-proof and flexible.

  • One method I used for defining data structure in my datafile read/write classes is to use std::map<std::string, std::vector<std::string>, string_compare> where the key is the variable name and the vector of strings is the data. While this is expensive in memory, it does not lock me down to only numeric data. And, this method allows for different lengths of data within the same file.

    I had the base class implement this generic storage, while the derived classes implemented the reader/writer capability. I then used a factory to get to the desired handler, using another class that determined the file format.

0 comments:

Post a Comment