I’m sharing this in the hope of saving someone from an unwelcome surprise.
I recent upgraded an Exadata system from 18.104.22.168.1 to 22.214.171.124.1. Apart from what turns out to be a known bug that resulted in the patching of the InfiniBand switches “failing”, it all seemed to go without a snag. That’s until I decided to do some node failure testing…
Having forced a node eviction I got on with something else while the evicted compute node (database server to non-Exadata folk) booted. After what seemed like a reasonable amount of time, and at an appropriate break in my other work, I attempted to connect to the previously evicted host. No joy. I connected to the ILOM console to see what was going on only to find something like this:
My first though was, “Is there any conceivable way that causing a node eviction through generation of very heavy swap activity could be responsible for this?”
Attempting to boot the host from the grub prompt using the kernel installed as part of 126.96.36.199.1 worked without any issues. Once the host was up I looked at /boot/grub/grub.conf. It was empty (zero bytes). I checked all compute nodes and found the same. Obviously this must have happened after the last reboot of the hosts otherwise they would have failed to boot beyond the grub prompt.
I raised an SR as this seemed like a big deal to me. Not that it wasn’t recoverable, but because it is a gremlin lying in wait to bite when least welcome. It’s easy to imagine a situation where having upgraded to 188.8.131.52.1 the environment is put back to use and at some point later a node is evicted or rebooted for another reason and it doesn’t boot back into the OS. That would suck. I expect my hosts to be in a state that allows them to boot cleanly; if that’s not the case then I want to know about it and be prepared.
Having restored the pre-upgrade backup on one of the compute nodes, I ran the upgrade again and started to investigate in detail. Rather than give a blow-by-blow account of that investigation, I’ll cut straight to the final conclusion.
misceachboot (part of the Exadata “validations” framework and as the name suggests it runs every time an Exadata compute node boots) clobbers /boot/grub/grub.conf shortly after host startup if an entry for “Oracle Linux Server (2.6.18-308.24.1.0.1.el5)” is found in grub.conf. This is consistently repeatable with the simple test of copying the backup of grub.conf created by dbnodeupdate.sh over the empty grub.conf and rebooting.
As I’ve stated in the SR with Oracle, this appears to be something that would affect all Exadata systems that are upgraded from 184.108.40.206.1 to 220.127.116.11.1. Oracle Support have created bug 19428028, which is not visible at the time I write this.
If someone else had run into this problem then I’d like her/him to share it publicly so that I didn’t get caught out by it. Hence this blog post.
I would be very interested to hear from other Exadata users that have upgraded to 18.104.22.168.1, particularly if it was from 22.214.171.124.1, and whether or not they have seen the same problem.
1 – The bug is known and documented in MOS ID 1614149.1, however, there is no mention of it in MOS ID 1667414.1, which being entitled “Exadata 126.96.36.199.1 release and patch (17636228 )” and having a section of “Known Issues” seems like a reasonable place to expect it to be referenced. I have suggested to Oracle Support that an update to 1667414.1 referencing 1614149.1 would be a good idea. That was on 4th August 2014 and so far there’s been no update.
2 – Relax and Recover (rear) is a very useful bare-metal recovery tool for Linux.