ESOS HA OptionsWe have three supported methods for achieving high availability in ESOS dual-head configurations:
- Replication using DRBD
- Linux md-cluster RAID1 (mirror) arrays
- Broadcom/Avago/LSI Syncro hardware RAID controllers
These three choices would be used for the underlying back-end storage technology. You can then run Logical Volume Manager (LVM) on top of these back-end storage devices to provision logical storage devices.
Why Limited ChoicesWhy isn't ZFS, or bcache, or technology X supported for dual-head (HA) ESOS arrays? Its simple: We have to be able to open and access the underlying back-end storage device concurrently from more than one node. You can't mount a standard Linux file system (eg, XFS) on more than one node at a time. You can't activate a ZFS storage pool on more than one node at a time.
"Well, vendor X supports running ZFS in a cluster/HA environment, why can't ESOS do it." Right, we could do that, but those solutions trick the initiators and/or present a faux block device on the non-active/non-primary node, and they use an ALUA policy that tells the initiators that they "shouldn't" read or write data on a path where the target is not in an accessible mode. But what if they do? SCST is supposed to deny these requests and make them fail at that level, but what if the ALUA states are wrong for an instant, and a write is accepted on a dummy device but is actually discarded?Corruption, that's what happens. In ESOS we're playing it safe. In the SCST project there is work being done to better support those types of active/passive back-end storage devices, but its not ready yet. So will we support things like ZFS in a cluster eventually? Probably, yes. But again, for now, we keep the choices limited for performance and safety reasons.
Replication via DRBDFor this HA method, it requires two (2) completely "independent" storage servers. So as an example, lets say we want a 4U 72-drive SAS storage server filled with 2 TB drives. You would use two of these, and data would be replicated synchronously between the two storage servers (each has matching storage amounts/configuration). So you can still make use of things like hardware RAID controllers, RAID5/6, or use MD software RAID with HBA's, or whatever local storage configuration you like on the storage servers.
- Full redundancy, as you have two fully function servers with all the storage in each; you could even lose a RAID array completely on a node, and the data would still be intact since there is a second node with the same storage
- Proven reliability with LINBIT's DRBD -- used in data centers throughout the world for critical high-end projects
- Could even satisfy some minor DR/BC requirements if you put the storage nodes in different rooms or buildings (depending on the connectivity needed)
- Can hamper performance, especially in the case of random I/O (IOPS) performance, as every write has to be sync'd on the first and second nodes
- Requires a full matching (recommended) set of hardware, which costs more
Linux md-cluster RAID1 ArraysThis is the popular Linux software RAID but supports opening/accessing (running) an MD RAID1 arrays from more than one node concurrently. Its a relatively new addition to the Linux kernel, but was deemed stable since Linux 4.9.x (what we use in ESOS). Parity levels (eg, RAID5 or RAID6) are not supported with md-cluster. Only RAID1 (mirroring) but RAID1 is the highest performing option for redundancy, and can be combined with LVM striped logical devices to create a RAID10 configuration. The hardware options for setups employing md-cluster RAID1 arrays would be external SAS enclosures that support dual I/O controllers, and using dual-domain SAS disks (HDD or SSD). There are also dual-head SAS CiB (cluster-in-a-box) options which put a shared set of SAS disks and two server nodes in a single chassis -- very slick. And now for the extreme HA high performance option, they have NVMe CiB units (uses dual-port NVMe drives).
- Using md-cluster RAID1 is the best option for performance, there is some overhead for writes but its tolerable; reads get the performance of two drives, and for writes the performance of a single drive
- Simple configuration, and uses the well-known and highly regarded Linux MD software RAID stack
- Makes use of dual-port NVMe drives, dual-domain SAS drives, and dual I/O module SAS enclosures which can even be daisy-chained to get a lot of storage out of one set of dual-head ESOS storage controllers
- It's RAID1 (mirroring), so you lose half of your storage, which can be expensive
- Requires the higher-end NVMe and SAS drives which are typically more costly
The Syncro Hardware RAID ControllerAh, the Syncro... it never even made it to it's prime. This is a hardware SAS RAID controller "kit" which includes two MegaRAID-based RAID controller cards, one for each server. All of the communication between the two nodes for the HA portion is handled by the adapters themselves via SAS connectivity. Unfortunately, Broadcom recently acquired Avago and killed the Syncro line of products. Its such a simple solution, and works well for creating ESOS dual-head arrays, so if you are able to still find one of these controller kits, it'd be a worthwhile investment. Broadcom may not be producing/selling these kits, but they are still obligated to support them for at least the next few years. For the hardware options used with these, a SAS CiB (cluster-in-a-box) or two storage servers each with a Syncro controller connected to dual I/O SAS enclosures with dual-domain SAS drives (SSD or HDD). The SAS enclosures can be daisy chained, and each Syncro controller kit supports up to (120) devices.
- Simple hardware RAID solution that provides high availability -- just plug in the adapters, and provision MegaRAID virtual drives as usual
- Supports parity RAID levels (eg, RAID5 or RAID6) as well as the traditional RAID0 / RAID1 levels, and even the nested RAID levels (eg, RAID50, RAID10).
- Performance is great, although storage should only be accessed by one of the nodes (the owning node)
- The Syncro kit is somewhat expensive (~ $5,000) and requires the typically more costly dual-domain SAS drives
- Must use SAS drives (HDD or SSD) that are on the hardware compatibility list
- It's EOL'd and support is now limited, and no more units are being produced
Logical Devices and Active/Active ConfigurationsWhile the above HA methods in ESOS configure the underlying storage devices, and make usable RAID volumes available to both ESOS storage nodes, it doesn't address creating perhaps smaller logical devices from those RAID or back-end storage devices. And of course you don't need to creating sub-volumes from the large storage devices, you could just map those storage devices directly to LUN's. But if you do want to create sub-volume logical storage devices, ESOS fully supports this using Logical Volume Manager (LVM).
Using LVM you'll create physical volumes (PV's) from the DRBD or md-cluster RAID devices, and then create one more volume groups (VG's). Finally you'll create your logical volumes (LV's) on the LVM volume groups. These can be striped logical volumes creating a RAID0 effect. This could be combined with either HA storage technology which may increase performance by striping data across redundant storage devices.
You can creating either an "active/active" storage configuration where some devices are accessed primarily from one node, and another set of devices is primary accessed from the opposing node. Its up to the user to "round-robin" placement or mapping of these devices in different SCST device groups (which controls the ALUA path'ing policy). An active/active configuration would be good if you have matching hardware, with the same performance capability on each.
You can also create an "active/passive" configuration, where all devices are only accessed from one node primarily. This type of setup would be good if you have asymmetrical nodes, where maybe one node is faster or more capable performance wise than the other.