Monday, January 24, 2011

mdadm raid1 fails to resync

Hello, I'm trying to solve this problem I'm having with an mdadm raid1.

I have an ubuntu 9.04 server running on a software 2-drive raid1 with mdadm. Yesterday, one of the drives failed, and so I replaced it with a brand new drive of the same size. I removed the faulty drive, copied the partition from the remaining good drive to the new drive and then added it to the raid. It re-synced and the system worked fine, until the drive that hadn't failed, was also labeled failed.

Now I had the raid running solely on the new drive. So I purchased another drive and repeated the procedure above. So now I had 2 brand new drives and the raid was syncing. However, after a few minutes I checked /proc/mdstat and the raid was no longer syncing.

mdadm --detail /dev/md1 shows: (sdb is the first new drive, and sdc is the second new drive)

root@dola:/home/jjaramillo# mdadm --detail /dev/md1 /dev/md1: Version : 00.90 Creation Time : Sat Dec 20 00:42:05 2008 Raid Level : raid1 Array Size : 974711680 (929.56 GiB 998.10 GB) Used Dev Size : 974711680 (929.56 GiB 998.10 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 1 Persistence : Superblock is persistent

Update Time : Wed Jun  2 10:09:35 2010
      State : clean, degraded

Active Devices : 1 Working Devices : 2 Failed Devices : 0 Spare Devices : 1

       UUID : bba497c6:5029ba0b:bfa4f887:c0dc8f3d
     Events : 0.5395594

Number   Major   Minor   RaidDevice State
   2       8       35        0      spare rebuilding   /dev/sdc3
   1       8       19        1      active sync   /dev/sdb3

I've tried removing and re-adding the drive a few times, but the same thing happens. The raid fails to resync. I've looked at /var/log/messages, and found the following:

Jun 2 07:57:36 dola kernel: [35708.917337] sd 5:0:0:0: [sdb] Unhandled sense code Jun 2 07:57:36 dola kernel: [35708.917339] sd 5:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE Jun 2 07:57:36 dola kernel: [35708.917342] sd 5:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor] Jun 2 07:57:36 dola kernel: [35708.917346] Descriptor sense data with sense descriptors (in hex): Jun 2 07:57:36 dola kernel: [35708.917348] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 Jun 2 07:57:36 dola kernel: [35708.917357] 00 43 9e 47 Jun 2 07:57:36 dola kernel: [35708.917360] sd 5:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed

So it looks like there's some kind of error on sdb (the first new drive). My question is, what would be the best approach to get the raid up and running again? I've thought about dd'ing the /dev/md1 to a blank hard drive, then re-doing the raid from scratch and loading the data back, but there could be an easier solution..

Any help would be appreciated.

  • RE:

    I removed the faulty drive, copied the partition from the remaining good drive to the new drive and then added it to the raid.

    You shouldn't be copying partitions on your own.

    The only thing you should have to do is put the new drive into your system, and use mdadm to add it to your raid group.

    If you really did do a copy (ie. a dd if=/dev/good_disk of=/dev/new_disk), you probably wound up copying raid UUIDs or something that let mdadm know which disk is which, and then it gets confused.

    JuanD : Sorry, I worded that wrong... I meant, I copied the partition scheme from disk to disk, with sfdisk... Thx
    Tom O'Connor : The command you want to copy partitions is `sfdisk -d /dev/gooddrive | sfdisk /dev/newdrive`
  • Install the new hd, partition it like Tom O'Connor suggested and then use mdadm to repair the array. See the man page of mdadm under "For Manage mode:", the --add option:

    mdadm /dev/md0 --add /dev/sda1 
    

    You may have to "--fail" the first replacement drive first.

    From AndreasM
  • You shouldn't attempt to prepare the new drive in any meaningful way unless your raid constituents are actually disk PARTITIONS not disks themselves. In which case, you would create a partition on the new drive that is the same size as the one on the remaining active disk.

    You never need to touch the old drive at all -- it's assumed to be failed and unreliable.

    The correct procedure is to remove the broken drive, add a new, empty drive, and then use mdadm to add that new drive to the array. You'd do it something like this:

    mdadm --add /dev/md0 /dev/<newdrive>
    

    The kernel will then sync the new drive into the array, copying the data from the one remaining good drive.

    From tylerl

0 comments:

Post a Comment