Thursday, March 31, 2011

Open-source production data for developers?

I'm building a website that will be an open-source, user-contributed content kind of thing, and I think if developers had access to nightly production SQL dumps, they'd be more likely to check out the code from github and play with it.

In line with that idea, I'm considering either:

  • Not collecting private user information at all, using open-id for accounts and making heavy use of memcache for things like session authentication.
  • Anonymizing sensitive data before publishing

Sometimes I get carried away with "wouldn't it be cool if...?" ideas, so I'm hoping for a sanity check here. Any obvious flaws in either approach? Is this a sane idea?

From stackoverflow
  • Sounds like a pretty good idea. The one thing you have to be careful with though is security, since hackers will know the exact schema of your DB. Although this isn't impossible to deal with, just look at most open source projects. But you will need to put a little extra emphasis on security since say a potential SQL injection is now made much easier.

    Another thing is to make sure doubly that the sensitive data is anonymized. Also, some people may (wrongly) try and claim their copyrights on user submitted content is being violated, so you may want to specify a CC license or something just to make everything extra clear and prevent future headaches (even if you're right anyway).

    : Thanks for the response. Both are great points, and the CC license is a good idea.
    Jason Baker : If it's open source, hackers will know your database schema anyway. It just might take more work.
  • Speaking generally, I think you should do both. Any private data you collect is simply a liability for you, and not just because you intend to publish your databases. The less you can collect, the better.

    By the same token, however, you probably realize that it is not just IDs and passwords which are sensitive. Remember the AOL search data leak? Or the Netflix database publication? Even without having IDs, people managed to figure out the real identities of some of the accounts, simply by piecing together trails of user behavior, and corresponding that with data from other places. Some people are embarrassed by their search histories and their movie rentals. Go figure.

    Therefore, I think the general rule should be to collect as little as possible, and anonymize what is left. Even if you don't store the identity of the person corresponding to a certain account, you may want to scramble what the various logins did.

    On the other hand, there some cases where you simply don't care about this kind of privacy. In Wikipedia, for example, pretty much everything you can do on the site is public anyway. At least, everything which gets recorded in the database. If the information is already available through the API, there is no point in hiding it in a database download.

    : Thanks mate. That's some good food for thought.
  • In addition to collecting less data and anonymizing the data you do collect, you could add a bit/flag for the users to select whether their data is included or not. You could make it a CC license flag to give users the warm'n'fuzzies while filling your need.

    : I like the idea of a CC license flag. Pretty cool. Thanks for the response.

0 comments:

Post a Comment