Working with RAID

2007-01-08 2-minute read

We’ve been working with a giant backup server with 8 disks in a complex series of RAID arrays.

Last week, we had two disks fail from the same 7 Disk RAID 5 array within 24 hours of each other, causing the RAID to fail (and the whole server to stop responding).

When we re-booted the computer, the RAID in question was reported by /proc/mdstat to be inactive, with two disks missing:

md2 : inactive hde[0] hdk3[6] hdi1[5] hda1[4] hdm[1] 1367569600 blocks

It was missing sda3 and hdc1

Those disks had other partitions on the system that were working fine. And, we did a few read tests on the partitions in questions, and they seemed to fine. Hm.

We ran:

mdadm --examine /dev/sda3

The super block on sda3 reported that it was healthy, but that hdc1 was faulty and removed.

The same test on hdc1 reported that it and all disks were healthy.

So, it appears as though hdc1 went down first, followed by sda3.

We began the recovery with the –re-add command:

mdadm /dev/md2 --re-add /dev/hdc1
mdadm /dev/md2 --re-add /dev/sda3

Then, we tried to bring it up again:

0 root@iz:~# mdadm --assemble /dev/md2 UUID=15a4aefd:d0a95db7:934e8ae1:fce89514
mdadm: device /dev/md2 already active - cannot assemble it
1 root@iz:~#

Woops. Stop the array first:

0 root@iz:~# mdadm --stop /dev/md2
mdadm: stopped /dev/md2
0 root@iz:~#

Then, try again:

root@iz:~# mdadm --assemble /dev/md2 --uuid=15a4aefd:d0a95db7:934e8ae1:fce89514
mdadm: /dev/md2 assembled from 5 drives - not enough to start the array.
1 root@iz:~#

Still not working. Try again with force:

0 root@iz:~# mdadm --assemble /dev/md2 --force \
--uuid=15a4aefd:d0a95db7:934e8ae1:fce89514
mdadm: forcing event count in /dev/sda3(2) from 36300 upto 36308
mdadm: clearing FAULTY flag for device 0 in /dev/md2 for /dev/sda3
mdadm: /dev/md2 has been started with 6 drives (out of 7).
0 root@iz:~#

Bingo. Not sure why it only took one of the disks, but we chose to copy the data off of it in a hurry and worry about that later.