I’ve previously written about a problem I encountered when kdump is configured to write to an NFS location with UEK (in Exadata software version 18.104.22.168.1). I’m please to report that the root cause of the problem has been identified and there is a very simple workaround.
There were some frustrating times working this particular SR, the most notable being a response that was effectively, “It works for me (and so I’ll just put the status to Customer Working).”
After a bit more to-ing and fro-ing it emerged that the environment where Oracle had demonstrated kdump could write to NFS had the NFS server on the same subnet as the host where kdump was being tested. After a quick test of my own, using a second compute node in the Exadata system as the NFS server, I confirmed that kdump was able to write to an NFS location on the same subnet in my environment as well.
Soon after reporting the above test in the SR I was pointed to MOS note 1533611.1, which unfortunately is not publicly available (yet) and so I cannot read it… The crux of the issue is that the network interface configuration files have BOOTPROTO=none and kdump is not handling this appropriately, which results in an incomplete network configuration for bond1 when switching to the dump kernel during a crash.
The fix: Change BOOTPROTO=none to BOOTPROTO=static
“static” does not appear to be a formally valid value. In an attempt to find more information about the behaviour I looked for more details and only got as far as Red Hat BZ#805803 and BZ#802928, neither of which I can access directly, but I can see a summary here and here respectively.
In conclusion, it appears that the issue is actually a kdump bug. More specifically a mkdumprd bug. Thankfully the workaround is simple, it just took a long time to get to it.