Clobbering grub.conf is Bad

I’m sharing this in the hope of saving someone from an unwelcome surprise.

Background

I recent upgraded an Exadata system from 11.2.3.2.1 to 11.2.3.3.1. Apart from what turns out to be a known bug[1] that resulted in the patching of the InfiniBand switches “failing”, it all seemed to go without a snag. That’s until I decided to do some node failure testing…

Having forced a node eviction I got on with something else while the evicted compute node (database server to non-Exadata folk) booted. After what seemed like a reasonable amount of time, and at an appropriate break in my other work, I attempted to connect to the previously evicted host. No joy. I connected to the ILOM console to see what was going on only to find something like this:

grub_prompt

My first though was, “Is there any conceivable way that causing a node eviction through generation of very heavy swap activity could be responsible for this?”

Investigation

Attempting to boot the host from the grub prompt using the kernel installed as part of 11.2.3.3.1 worked without any issues. Once the host was up I looked at /boot/grub/grub.conf. It was empty (zero bytes). I checked all compute nodes and found the same. Obviously this must have happened after the last reboot of the hosts otherwise they would have failed to boot beyond the grub prompt.

I raised an SR as this seemed like a big deal to me. Not that it wasn’t recoverable, but because it is a gremlin lying in wait to bite when least welcome. It’s easy to imagine a situation where having upgraded to 11.2.3.3.1 the environment is put back to use and at some point later a node is evicted or rebooted for another reason and it doesn’t boot back into the OS. That would suck. I expect my hosts to be in a state that allows them to boot cleanly; if that’s not the case then I want to know about it and be prepared.

The initial response in the SR was that I should run the upgrade again without the “Relax and Recover[2] entry in grub.conf stating, “… as there is some suspicion this might be related.”

Having restored the pre-upgrade backup on one of the compute nodes, I ran the upgrade again and started to investigate in detail. Rather than give a blow-by-blow account of that investigation, I’ll cut straight to the final conclusion.

Culprit

misceachboot (part of the Exadata “validations” framework and as the name suggests it runs every time an Exadata compute node boots) clobbers /boot/grub/grub.conf shortly after host startup if an entry for “Oracle Linux Server (2.6.18-308.24.1.0.1.el5)” is found in grub.conf. This is consistently repeatable with the simple test of copying the backup of grub.conf created by dbnodeupdate.sh over the empty grub.conf and rebooting.

As I’ve stated in the SR with Oracle, this appears to be something that would affect all Exadata systems that are upgraded from 11.2.3.2.1 to 11.2.3.3.1. Oracle Support have created bug 19428028, which is not visible at the time I write this.

If someone else had run into this problem then I’d like her/him to share it publicly so that I didn’t get caught out by it. Hence this blog post.

I would be very interested to hear from other Exadata users that have upgraded to 11.2.3.3.1, particularly if it was from 11.2.3.2.1, and whether or not they have seen the same problem.

Update (20th August 2014)

Having found more time to investigate I believe I’ve found the exact cause of the issue…

In the comments of this post I previously stated:

… the “Oracle Linux Server (2.6.18-308.24.1.0.1.el5)” entry is definitely the trigger.

Well, that’s not true!

Also, having taken the time to identify exactly what part of the image_functions code truncates grub.conf, it is now possible to be confident that the other 11.2.3.2.1 systems I have access to will not be affected.

So anyway, here’s the important points:

Point 1

Within image_functions a function named “image_functions_remove_from_grub” is defined that includes the following:

perl -00 -ne '/vmlinuz-$ENV{EXA_REMOVE_KERNEL_FROM_GRUB} / or print $_' -i $grub_conf

The “-00” part is particularly relevant. This invokes “paragraph mode”, which defines a paragraph as being the characters between two non-consecutive newlines.

The Perl command has the effect of removing any paragraph from $grub_conf (defined as /boot/grub/grub.conf) that contains the string “vmlinuz-$ENV{EXA_REMOVE_KERNEL_FROM_GRUB} “, where EXA_REMOVE_KERNEL_FROM_GRUB is a shell variable.

Point 2

For a reason I have yet to identify the grub.conf files on the compute nodes of the particular Exadata system that had been upgraded to 11.2.3.2.1 had spaces appended to the end of lines. The number of spaces appended was not consistent across nodes and I’ll continue to try to identify what was responsible. Anyway, this resulted in each break between entries in grub.conf not being simply a newline character, but rather a line with a number of spaces before the newline character.

End Result

I probably don’t need to explain, but in case it isn’t obvious: the combination of point 1 and 2 above means that the entire contents of grub.conf is seen as a single paragraph by the Perl command and as that paragraph contains the kernel referenced by the shell variable $EXA_REMOVE_KERNEL_FROM_GRUB it is removed from grub.conf resulting in an empty file.

Other Points

The incorrect assertion that it was the 2.6.18-308.24.1.0.1.el5 kernel entry that triggered the problem is the result of misceachboot repeatedly attempting to remove kernel 2.6.18-308.24.1.0.1.el5. If I’ve followed the logic in misceachboot correctly then the attempt to remove kernel 2.6.18-308.24.1.0.1.el5 happens on each boot because the rpm for that kernel is still “installed” even though the kernel files have been removed from /boot (and the entry removed from grub.conf). This is done because of a dependency between fuse-2.7.4-8.0.5.el5.x86_64 and kernel 2.6.18-308.24.1.0.1.el5. Whereas the attempt to remove kernel 2.6.32-400.21.1.el5uek is only performed once (assuming it is successful) after the first reboot post upgrade to 11.2.3.2.1 during which the rpm is completely removed along with the kernel files (and the entry removed from grub.conf).

Footnotes

1 – The bug is known and documented in MOS ID 1614149.1, however, there is no mention of it in MOS ID 1667414.1, which being entitled “Exadata 11.2.3.3.1 release and patch (17636228 )” and having a section of “Known Issues” seems like a reasonable place to expect it to be referenced. I have suggested to Oracle Support that an update to 1667414.1 referencing 1614149.1 would be a good idea. That was on 4th August 2014 and so far there’s been no update.
2 – Relax and Recover (rear) is a very useful bare-metal recovery tool for Linux.

2 thoughts on “Clobbering grub.conf is Bad

  1. Andy Colvin

    Interesting…did this issue only show up with the “Oracle Linux Server (2.6.18-308.24.1.0.1.el5)” entry in the grub.conf or for any entry? I haven’t seen this crop up with any of the applications of 11.2.3.3.1. If it’s only with the 2.6.18 entry, most people shouldn’t be affected, since 11.2.3.2.1 should be running the UEK unless you’re on a V2.

    Reply
    1. Martin Post author

      Thanks for the comment. I tested with a number of different grub.conf files and the “Oracle Linux Server (2.6.18-308.24.1.0.1.el5)” entry is definitely the trigger for the file being blanked out.

      The system where the problem was encountered is a V2, but the X2-2 I have access to looks likely to hit the same issue. Both V2 and X2-2 systems have the following entries in grub.conf when at 11.2.3.2.1:

      Oracle Linux Server (2.6.18-308.24.1.0.1.el5)
      Oracle Linux Server (2.6.32-400.21.1.el5uek)
      Relax and Recover

      The “Relax and Recover” entry is obviously a customisation in this environment.

      Both the V2 and X2-2 systems have been running with UEK (2.6.32-400.21.1.el5uek) when at 11.2.3.2.1. I don’t recall any specific actions to keep the RHEL kernel installed, so assumed that everyone on 11.2.3.2.1 would have the kernel present and the corresponding entry in grub.conf even if they don’t use it.

      Reply

Leave a Reply

Your email address will not be published. Required fields are marked *