A (possibly) simple fix for GRUB error: file not found

You’re wearing ostrich-skin boots.

2013-05-14    »   debian, grub, software

After a recent upgrade to the kernel and GRUB recently, my Debian system started acting a bit weird, and feeling too tired to spend more than a few minutes trying to diagnose it, went for the Microsoft option – let’s reboot that sucker back to health.

Please note: The long-story-short version of this post is that GRUB wasn’t actually at fault, it just presented that way due to a weird history on one of the drives in this machine, and there’s a tiny chance that you may stumble across here looking for answers that aren’t supplied on most of the other search engine results for ‘_grub error file not found_’. It may in fact be your BIOS.

Anyway … back to the story in the proper order.

After midnight, when you’re making good progress on developing some scripts, is a very bad time to reboot to this screen:

GRUB loading.
Welcome to GRUB!

error: file not found.
Entering rescue mode...
grub rescue>

The first instincts (or primal fears) are that something horrible has happened to your boot loader, or perhaps your initrd.img, though the fact GRUB wasn’t even getting past stage one suggests it wasn’t the latter.

Googling this error gives lots of (usually) good advice on running update-initramfs -u -k all, which may or may not break your system even more badly, depending on your starting position. Other high-ranking advice is to downgrade grub, maybe initramfs, perhaps even the kernel (only the first is likely to assist in changing (not necessarily fixing) this problem.

However, the first basic step with trouble-shooting this problem is to (from the above GRUB prompt) run:

ls

… which will return all the disk devices available, in the format of:

(hd0,msdos1) (hd0,msdos2) (hd1,msdos1) (hd1,msdos2)

… and so on. If your boot and root file systems are visible, this is the first good sign. If you don’t have an offline or off-machine copy of your file system layout, you can identify your devices by running through (the trailing slash is important) commands like this:

ls (hd0,msdos1)/

… for each device, and identifying from those that return which is boot and which is root (assuming you have separate partitions for both).

Once you’ve found the two devices you can then simply:

set prefix="(hd0,msdos1)"
set root="(hd0,msdos2)"
insmod normal
normal

But .. this presupposes that your root file system is not encrypted, which certainly covers the majority of installations, but not mine. All the above confirmed to me is that /boot was visible and populated and mountable. This should have been the big hint, but it was late at night. (That’s my excuse, and stick to it I shall.)

Booting up a rescue disk, and running cryptsetup luksOpen against my rootfs partition meant that I could instantly confirm that my encrypted / file system was just fine too. That’s when I finally started to consider looking further down the stack for the root cause.

Hopping into the BIOS, just on a crazy off-chance, I noted that for reasons that weren’t clear, the Boot Order configuration had changed – it didn’t even look like factory default – bizarrely the two drives that it thought were bootable were /dev/sdb and /dev/sdi. Perversely I boot off /dev/sdf1 for /boot, and /dev/sdf2 (encrypted) for the root file system, and my GRUB lives on /dev/sdf MBR.

What had exacerbated the problem, and led me on a bit of a pointless descent into the joys of GRUB rescue prompts, was that there was an old long-forgotten GRUB MBR living on /dev/sdb, which presumably the BIOS had picked up on (and hence offered it), but which pointed to a partition that was long since erased – on a drive that these days is part of a crypto-RAID1 array in fact.

So, forcing my overly graphical BIOS to offer me at boot option from my actual bootable disk, and then changing the boot order to start with that disk, ended up solving it. Annoying, frustrating, and a smidge weird.