Friday, May 6, 2011

Best practices for iterating over MASSIVE CSV files in PHP ...

Ok, I'll try and keep this short, sweet and to-the-point.

We do massive GeoIP updates to our system by uploading a MASSIVE CSV file to our PHP-based CMS. This thing usually has more than 100k records of IP address information. Now, doing a simple import of this data isn't an issue at all, but we have to run checks against our current regional IP address mappings.

This means that we must validate the data, compare and split overlapping IP address, etc.. And these checks must be made for each and every record.

Not only that, but I've just created a field mapping solution that would allow other vendors to implement their GeoIP updates in different formats. This is done by applying rules to IPs records within the CSV update.

For instance a rule might look like:

if 'countryName' == 'Australia' then send to the 'Australian IP Pool'

There might be multiple rules that have to be run and each IP record must apply them all. For instance, 100k records to check against 10 rules would be 1 million iterations; not fun.

We're finding 2 rules for 100k records takes up to 10 minutes to process. I'm fully aware of the bottleneck here which is the shear amount of iterations that must occur for a successful import; just not fully aware of any other options we may have to speed things up a bit.

Someone recommended splitting the file into chunks, server-side. I don't think this is a viable solution as it adds yet another layer of complexity to an already complex system. The file would have to be opened, parsed and split. Then the script would have to iterate over the chunks as well.

So, question is, considering what I just wrote, what would the BEST method be to speed this process up a bit? Upgrading the server's hardware JUST for this tool isn't an option unfortunately, but they're pretty high-end boxes to begin with.

Not as short as I thought, but yeah. Halps? :(

From stackoverflow
  • Perform a BULK IMPORT into a database (SQL Server's what I use). The BULK IMPORT takes seconds literally, and 100,000 records is peanuts for a database to crunch on business rules. I regularly perform similar data crunches on a table with over 4 million rows and it doesn't take the 10 minutes you listed.

    EDIT: I should point out, yeah, I don't recommend PHP for this. You're dealing with raw DATA, use a DATABASE.. :P

    ceejayoz : Bingo. Great feature!
    Some Canuck : What a bummer -- I never did get a "accepted answer" on this.
    Wilhelm Murdoch : My fault, buddy! :)
  • One thing you can try is running the CSV import under command line PHP. It generally provides faster results.

  • The simple key to this is keeping as much work out of the inner loop as possible.

    Simply put, anything you do in the inner loop is done "100K times", so doing nothing is best (but certainly not practical), so doing as little possible is the next best bet.

    If you have the memory, for example, and it's practical for the application, defer any "output" until after the main processing. Cache any input data if practical as well. This works best for summary data or occasional data.

    Ideally, save for the reading of the CSV file, do as little I/O as possible during the main processing.

    Does PHP offer any access to the Unix mmap facility, that is typically the fastest way to read files, particularly large files.

    Another consideration is to batch your inserts. For example, it's straightforward to build up your INSERT statements as simple strings, and ship them to the server in blocks of 10, 50, or 100 rows. Most databases have some hard limit on the size of the SQL statement (like 64K, or something), so you'll need to keep that in mind. This will dramatically reduce your round trips to the DB.

    If you're creating primary keys through simple increments, do that en masses (blocks of 1000, 10000, whatever). This is another thing you can remove from your inner loop.

    And, for sure, you should be processing all of the rules at once for each row, and not run the records through for each rule.

  • If you are using PHP to do this job, switch the parsing to Python since it is WAY faster than PHP on this matters, this exchange should speed up the process by 75% or even more.

    If you are using MySQL you can also use the LOAD DATA INFILE operator, I'm not sure if you need check the data before you insert it into the database though.

  • Have worked on this problem intensively for a while now. And, yes the better solution is to only read in a portion of the file at any one time, parse it, do validation, do filtering, then export it and then read the next portion of the file. I would agree that this is probably not a solution for php, although you can probably do it in php. As long as you have a seek function, so that you can start reading from a particular location in the file. You are right it does add a higher level of complexity but the worth that little extra effort. It your data is pure i.e. delimited correctly, string qualified, free of broken lines etc then by all means bulk upload into a sql database. Otherwise you want to know where, when and why errors occur and to be able to handle them.

  • I may be too late, but have you considered writing an application in a natively compiled language to act as a backend? Furthermore, I haven't seen your code but you are evidently doing something wrong in how you manage your data if it takes that long.

  • 100k records isn't a large number. 10 minutes isn't a bad job processing time for a single thread. The amount of raw work to be done in a straight line is probably about 10 minutes, regardless if you're using PHP or C. If you want it to be faster, you're going to need a more complex solution than a while loop.

    Here's how I would tackle it:

    1. Use a map/reduce solution to run the process in parallel. Hadoop is probably overkill. Pig Latin may do the job. You really just want the map part of the map/reduce problem. IE: you're forking of a chunk of the file to be processed by a sub process. Your reducer is probably cat. A simple version could be having PHP fork processes for each 10K record chunk, wait for the children, then re-assemble their output.
    2. Use a queue/grid processing model. Queue up chunks of the file, then have a cluster of machines checking in, grabbing jobs and sending the data somewhere. This is very similar to the map/reduce model, just using different technologies, plus you could scale by adding more machines to the grid.
    3. If you can write your logic as SQL, do it in a database. I would avoid this because most web programmers can't work with SQL on this level. Also, SQL is sort of limited for doing things like RBL checks or ARIN lookups.

0 comments:

Post a Comment