Saturday, January 29, 2011

Architecture for highly available MySQL with automatic failover in physically diverse locations

I have been researching high availability (HA) solutions for MySQL between data centers.

For servers located in the same physical environment, I have preferred dual master with heartbeat (floating VIP) using an active passive approach. The heartbeat is over both a serial connection as well as an ethernet connection.

Ultimately, my goal is to maintain this same level of availability but between data centers. I want to dynamically failover between both data centers without manual intervention and still maintain data integrity.

There would be BGP on top. Web clusters in both locations, which would have the potential to route to the databases between both sides. If the Internet connection went down on site 1, clients would route through site 2, to the Web cluster, and then to the database in site 1 if the link between both sites is still up.

With this scenario, due to the lack of physical link (serial) there is a more likely chance of split brain. If the WAN went down between both sites, the VIP would end up on both sites, where a variety of unpleasant scenarios could introduce desync.

Another potential issue I see is difficulty scaling this infrastructure to a third data center in the future.

The network layer is not a focus. The architecture is flexible at this stage. Again, my focus is a solution for maintaining data integrity as well as automatic failover with the MySQL databases. I would likely design the rest around this.

Can you recommend a proven solution for MySQL HA between two physically diverse sites?

Thank you for taking the time to read this. I look forward to reading your recommendations.

  • Note that you probably cannot use BGP, as the smallest routable block is 4k, a /22, good luck getting one. Probably a DNS based solution is needed.

    Martin : +1 for a dose of reality. You can use a well managed DNS service like UltraDNS and its site monitoring service "SiteBacker"to get you most of the way there.
    Warner : We already have BGP in place. This is outside of the scope of my question.
    Tom O'Connor : No, the smallest routable block is /24. Actually, no.. The smallest physically routable block is /28, but you're likely to get ignored by everyone. The smallest prefix that will get listened to is /24.
  • What about using a VLAN to tie all the servers at the two (or more) data centers together. You could then use CARP for automatic failover. Use database replication to keep everything in sync.

    If you own the data centers you can ensure each data center has multiple WAN uplinks.

    Warner : That was my first thought. Introducing layer 2 to such a degree would require a top down approach between both sites. Other server roles that have redundancy using LinuxHA would have to have similar implementations, such as the firewalls. Otherwise there would be routing issues. Ultimately, even with multiple WAN uplinks between both sites, my comfort level is substantially lower with that than it is with both serial and ethernet uplinks. That is more risk than I can tolerate. Moreover, it seems that there should be a more ideal solution.
    From Matt
  • Your first stage should be to upgrade your current HA solution to one that uses OpenAIS as the Cluster membership layer: this will give you a lot of flexibility, and given low latency links between sites, might be able to reach across. PaceMaker and RHEL Clustering support this.

    For automatic data center failover, you really need a third site to act as a tie-breaker, otherwise your sites will not be able to distinguish between inter-site routing problems and remote site failure. Microsoft has some surprisingly good web-casts covering the area:

    Windows Server 2008 multi-site clustering

    Obviously the exact technology doesn't map onto the Linux domain, but the concepts are the same.

    Warner : Thanks. I'll look into this.
    From Martin
  • You will face the "CAP" theorem problem. You cannot have consistency, availability and partition-tolerance at the same time.

    DRBD / MySQL HA relies on synchronous replication at the block device level. This is fine while both nodes are available, or if one suffers a temporary fault, is rebooted etc, then comes back. The problems start when you get a network partition.

    Network partitions are extremely likely when you're running at two datacentres. Essentially, neither party can distinguish a partition from the other node failing. The secondary node doesn't know whether it should take over (the primary has failed) or not (the link is gone).

    While your machines are in the same location, you can add a secondary channel of communication (typically a serial cable, or crossover ethernet) to get around this problem - so the secondary knows when the primary is GENUINELY down, and it's not a network partition.


    The next problem is performance. While DRBD can give decent** performance when your machines have a low-latency connection (e.g. gigabit ethernet - but some people use dedicated high speed networks), the more latency the network has, the longer it takes to commit a transaction***. This is because it needs to wait for the secondary server (when it's online) to acknowledge all the writes before saying "OK" to the app to ensure durability of writes.

    If you do this in different datacentres, you typically have several more milliseconds latency, even if they are close by.

    ** Still much slower than a decent local IO controller

    *** You cannot use MyISAM for a high availability DRBD system because it doesn't recover properly/ automatically from an unclean shutdown, which is required during a failover.

    Warner : I appreciate your time and thoughts. You described some of the issues I'm attempting to avoid very well. Ideally, I'd like to keep the advantages of active/passive dual master for maintenance and quick failover while minimizing the risk of data corruption. I'd think someone out there has found an acceptable solution.
    Matt Simmons : Indeed. Data does not want to be two places at once.
    From MarkR
  • Sorry this is yet another network aside, but a thought for down the road...

    For the split brain scenario you mentioned, you could have redundant links between to two sites as well to decrease the chance of this happening.

    Warner : I've been going back and forth on that. First, I wrote it off entirely as too risky. Now, I'm reconsidering. Realistically, data corruption risk with even two fully diversified paths is pretty high. It's on my short list right now.
  • Giving a correct answer might be hard depending on the amount of data you have, the amount of servers you want to fit this in, etc. That being said, my answer might not be one, or at least the one you are looking for.

    There is no proven solution for multiple site with MySQL. But there is solution that works. As some pointed out, yes DRDB does work fine but has its limit or possible issue depending of your setup.

    Will you ever need a third site (another datacenter)? If so, how much time and money will you have to do this?

    Considering each time you add a master/slave/dns server, backups, ... you add yourself a server to manage, what is your management capacity in terms of number of servers? If you can define this number, you may have to throw away some possible solutions and work towards those that will fit with your numbers so that management does not become a bottleneck.

    Considering datacenters don't go down often, multiple site means load balancing and some DNS hacking, is this going to be in the same datacenter? If so, if one datacenter goes down for whatever reason you will run into issue because a good part of your DNS and loadbalancing is going to be in this datacenter.

    So you may have to plan that split brain situation. For almsot each possible setup, the way to resolve a spit brain situation is different. Also, each solution takes X amount of time.
    It may also be far more easier to plan to use 3 datacenter from the start. I'm no MySQL expert but I've heard that in production it was easier to have 3 Masters than 2 if you ever run into issue.

    One thing that may help you is load balancing service offered by some networking vendor like Zeus, have a look here There is probably many more offering this kind of service. I am sure it comes at a price but sometimes lets you cut down on some other things.

    Good luck!

    Warner : Data is relatively small, all things considered. A couple hundred gigabytes for the sake of discussion. Third site, probably. If necessary, I'm willing to compromise the architecture for a better solution now and revisit later for a third. "Management bottleneck" or other administrative concerns are outside of the scope of the question. Redundancy will be in place for all production technologies. The focus here is MySQL.
    From Embreau
  • DRBD is not a recommended solution for remote data centers, since it requires bandwidth that can affect the speed of your database and replication. The recommended solution is Master - Master Replication. The only issue with this is that you auto increment fields need to be staggered.

    If you require a truly HA solution for MySQL you would have to go with MySQL Cluster because DRBD can not give you data integrity in case of failures.

    From cargom98
  • I've found blog posts about options available in MySQL and its pros and cons. http://mysqlha.blogspot.com/2010/04/consistency-across-wan.html

    From ms
  • Overcoming the lack of a serial cable is actually really easy, you use a thing from the dark ages called a modem - you have one at each end and then run Heartbeat over the PPP link. You can also use frame relay. Both methods will fix any worries you have with layer1/2 redundant paths.

    However that being said - DRBD running over any link with much more than about 300µs (note thats 0.3ms) latency becomes ridiculous very quickly.

    You would be better served by using standard MySQL replication, and LinuxHA over PPP & eth to do the fail overs.

    At least that is what I have done for clients in the past.

    Warner : Interesting idea. I've used dial-up as failover on a PtP before. While I don't think it would completely eliminate the CAP theorem issue, I do believe this could be supplemental to making split brain less likely to occur. Difficult to create the same level of confidence as created by a several foot direct physical connection.

0 comments:

Post a Comment