05 January 2017

Sysadmin's tale: Fixing broken logical volume in Linux

Dear respected readers

This time I'd like to share something related to troubleshooting in Linux. What makes me want to share it, is not about how sophisticated the step I did, but how awfully simple it was! Please read on.

Ok, so it started when I was about to left my office. Then suddenly my phone rang and turned out to be whatsapp message from my colleague, saying "we need your help". Anyway, before we proceed, let's state one thing: I try my best to reconstruct the situation based on my brain's memory, so I beg you pardon if something is missed. I hope that still deliver the message.

......Oh and, one more thing, lately when I read "we need your help", I began to think it actually means "we're in deep shit". Hehehehe

The situation is as follow: my co-worker said that a SAN connection was disconnected without clear reason. Thus, it made multipath construction broke and eventually one or more mount points unmounted. And the goal: put the mount point back ASAP.

Clock was ticking and I didn't have time to check the root cause. So, I shifted my attention to seek ways to restore it. Co-worker said that the missing device, /dev/mapper/mpaths was back online. So, it was a bit easier. But he said, the lvm should be back by itself. "Right?". Ehm well, not really....

Then I asked few things, with the hope to know the situation, right before the disaster happen:
Me (M): "how many logical volume is missing?"
Co-worker (C): "one"
M: "and numbers of volume groups?"
C: "one"
M: "and the physical volume?"
C: "yes, one, again"
M:"so you're saying in this mpaths, to the best you could recall, there is only on physical partition and that formatted as physical volume?"
C: "yup, indeed"
M:"and how big is this single partition?"
C: "not sure, but I guess it is as big as the disk"

Pay attention that the last question is critical. Why? Because my plan is to recreate the partition using fdisk. If somehow I create smaller physical partition, then there is a chance I put the entire logical volume in jeopardy. And what if bigger? maybe no danger. But ideally, I have to made it exactly as big as it was.

One more thing: we don't know the start sector of the partition. This might affect sector alignment and again, break the logical volume. So, it was like tossing the dice in the air. Got the right number? Good. Wrong number? ...hehhehe, need to say? :D

Since not much option left, I ask for approval from the team leader while explaining the risk of my method. He answered okay. (actually not okay, but we left with no better choice. So my method is the best bet, before we have to pull out from backups. Oh, forgot to say, this is THE backup server. Double burn, eh? heheheh )

Do backup first:
dd if=/dev/sda of=/mnt/backup/sda.img bs=32K
dd if=/dev/sdb of=/mnt/backup/sdb.img bs=32K

I don't exactly recall why I backup sda and sdb. But usually it was my instinct to be safe. Better safe than sorry, right?

Do partitioning:

fdisk /dev/mapper/mpaths

create new primary partition

size? make it as big as it can. luckily fdisk handle the alignment for us. And for the bonus, fdisk also calculate the biggest possible size for us.

don't forget to write the change

run partprobe.

Then, run a series of:

then check:

And Thank God, it back!!

Oh, like a movie, here is the twist: before I did all the above, actually I cheat and check first few lines of:
strings -a /dev/mapper/mpaths

Then I see that the LVM descriptor was still there, unharmed! And it also mentioned that the one and only logical volume there (which I forgot the name) was based on physical volume named mpaths1. So my co-worker information was indeed correct. Thus once the partition was back, there is no need to recreate physical volume and so on. They were automagically restored by themselves!

Breaking news: the other day, similar situation happened. This time, volume group informed that one of the physical volume was missing.

Again, same interview like above. And from checking first physical partition that formed the volume group using "strings", it said the missing one is, let's name it, /dev/sdf

Okay, /dev/sdf. Why there is no number suffix? like sdf1 or something like that. My guess is, the sys admin was using the entire disk as physical volume.

Again, need permission from the leaders. Got it, then I simply ran:
pvcreate /dev/sdf

Run again series of magic:
pvscan && vgscan && lvscan && pvs && vgs && lvs

Thank God, again! It back!

Moral of the story: you don't need to master "out of the planet scripting language" or "able to read hexcodes" to solve problem. All you need are:
1. Calm down. I never heard or see or read somebody able to tackle an issue while screaming all over the place....
2. Try to reconstruct WHY it happened
3. Based on #3 (hypotheses), create plan
4. before executing plan, create BACKUP!
5. execute your plan. (this is why it is called plan, to be executed. So even if your plan sounds sophisticated, but if you don't have guts to do it, then the value is zero, my friend)

Hope it helps you in day to day operation, until next time!


Mulyadi Santosa

How to execute multiple commands directly as ssh argument?

 Perhaps sometimes you need to do this: ssh user@ ls It is easy understand the above: run ls after getting into via ssh. Pi...