Clobbering grub.conf is Bad

I’m sharing this in the hope of saving someone from an unwelcome surprise.

Background

I recent upgraded an Exadata system from 11.2.3.2.1 to 11.2.3.3.1. Apart from what turns out to be a known bug[1] that resulted in the patching of the InfiniBand switches “failing”, it all seemed to go without a snag. That’s until I decided to do some node failure testing…

Having forced a node eviction I got on with something else while the evicted compute node (database server to non-Exadata folk) booted. After what seemed like a reasonable amount of time, and at an appropriate break in my other work, I attempted to connect to the previously evicted host. No joy. I connected to the ILOM console to see what was going on only to find something like this:

grub_prompt

My first though was, “Is there any conceivable way that causing a node eviction through generation of very heavy swap activity could be responsible for this?”

Investigation

Attempting to boot the host from the grub prompt using the kernel installed as part of 11.2.3.3.1 worked without any issues. Once the host was up I looked at /boot/grub/grub.conf. It was empty (zero bytes). I checked all compute nodes and found the same. Obviously this must have happened after the last reboot of the hosts otherwise they would have failed to boot beyond the grub prompt.

I raised an SR as this seemed like a big deal to me. Not that it wasn’t recoverable, but because it is a gremlin lying in wait to bite when least welcome. It’s easy to imagine a situation where having upgraded to 11.2.3.3.1 the environment is put back to use and at some point later a node is evicted or rebooted for another reason and it doesn’t boot back into the OS. That would suck. I expect my hosts to be in a state that allows them to boot cleanly; if that’s not the case then I want to know about it and be prepared.

The initial response in the SR was that I should run the upgrade again without the “Relax and Recover[2] entry in grub.conf stating, “… as there is some suspicion this might be related.”

Having restored the pre-upgrade backup on one of the compute nodes, I ran the upgrade again and started to investigate in detail. Rather than give a blow-by-blow account of that investigation, I’ll cut straight to the final conclusion.

Culprit

misceachboot (part of the Exadata “validations” framework and as the name suggests it runs every time an Exadata compute node boots) clobbers /boot/grub/grub.conf shortly after host startup if an entry for “Oracle Linux Server (2.6.18-308.24.1.0.1.el5)” is found in grub.conf. This is consistently repeatable with the simple test of copying the backup of grub.conf created by dbnodeupdate.sh over the empty grub.conf and rebooting.

As I’ve stated in the SR with Oracle, this appears to be something that would affect all Exadata systems that are upgraded from 11.2.3.2.1 to 11.2.3.3.1. Oracle Support have created bug 19428028, which is not visible at the time I write this.

If someone else had run into this problem then I’d like her/him to share it publicly so that I didn’t get caught out by it. Hence this blog post.

I would be very interested to hear from other Exadata users that have upgraded to 11.2.3.3.1, particularly if it was from 11.2.3.2.1, and whether or not they have seen the same problem.

Update (20th August 2014)

Having found more time to investigate I believe I’ve found the exact cause of the issue…

In the comments of this post I previously stated:

… the “Oracle Linux Server (2.6.18-308.24.1.0.1.el5)” entry is definitely the trigger.

Well, that’s not true!

Also, having taken the time to identify exactly what part of the image_functions code truncates grub.conf, it is now possible to be confident that the other 11.2.3.2.1 systems I have access to will not be affected.

So anyway, here’s the important points:

Point 1

Within image_functions a function named “image_functions_remove_from_grub” is defined that includes the following:

perl -00 -ne '/vmlinuz-$ENV{EXA_REMOVE_KERNEL_FROM_GRUB} / or print $_' -i $grub_conf

The “-00″ part is particularly relevant. This invokes “paragraph mode”, which defines a paragraph as being the characters between two non-consecutive newlines.

The Perl command has the effect of removing any paragraph from $grub_conf (defined as /boot/grub/grub.conf) that contains the string “vmlinuz-$ENV{EXA_REMOVE_KERNEL_FROM_GRUB} “, where EXA_REMOVE_KERNEL_FROM_GRUB is a shell variable.

Point 2

For a reason I have yet to identify the grub.conf files on the compute nodes of the particular Exadata system that had been upgraded to 11.2.3.2.1 had spaces appended to the end of lines. The number of spaces appended was not consistent across nodes and I’ll continue to try to identify what was responsible. Anyway, this resulted in each break between entries in grub.conf not being simply a newline character, but rather a line with a number of spaces before the newline character.

End Result

I probably don’t need to explain, but in case it isn’t obvious: the combination of point 1 and 2 above means that the entire contents of grub.conf is seen as a single paragraph by the Perl command and as that paragraph contains the kernel referenced by the shell variable $EXA_REMOVE_KERNEL_FROM_GRUB it is removed from grub.conf resulting in an empty file.

Other Points

The incorrect assertion that it was the 2.6.18-308.24.1.0.1.el5 kernel entry that triggered the problem is the result of misceachboot repeatedly attempting to remove kernel 2.6.18-308.24.1.0.1.el5. If I’ve followed the logic in misceachboot correctly then the attempt to remove kernel 2.6.18-308.24.1.0.1.el5 happens on each boot because the rpm for that kernel is still “installed” even though the kernel files have been removed from /boot (and the entry removed from grub.conf). This is done because of a dependency between fuse-2.7.4-8.0.5.el5.x86_64 and kernel 2.6.18-308.24.1.0.1.el5. Whereas the attempt to remove kernel 2.6.32-400.21.1.el5uek is only performed once (assuming it is successful) after the first reboot post upgrade to 11.2.3.2.1 during which the rpm is completely removed along with the kernel files (and the entry removed from grub.conf).

Footnotes

1 – The bug is known and documented in MOS ID 1614149.1, however, there is no mention of it in MOS ID 1667414.1, which being entitled “Exadata 11.2.3.3.1 release and patch (17636228 )” and having a section of “Known Issues” seems like a reasonable place to expect it to be referenced. I have suggested to Oracle Support that an update to 1667414.1 referencing 1614149.1 would be a good idea. That was on 4th August 2014 and so far there’s been no update.
2 – Relax and Recover (rear) is a very useful bare-metal recovery tool for Linux.

Changes in Oracle Linux 7

The following is a list of changes I noted during installation of Oracle Linux 7. I imagine the same will apply to Red Hat Enterprise Linux 7, but I haven’t verified that to be the case.

Setting the Hostname

I was accustomed to updating /etc/sysconfig/network with the hostname of my machine as documented here for Oracle Linux 7. I did this and was then surprised to find that my hostname was not picked up. On returning to the documentation the following seemed worth looking into:

For more information, see /usr/share/doc/initscripts*/sysconfig.txt.

I’d never looked at this file in the past, so found the information very interesting. If you’ve never looked either I suggest you do… Anyway, the key text is:

obsoleted values from earlier releases:
.
.
.
  HOSTNAME=<fqdn by default, but whatever hostname you want>
    This is now configured in /etc/hostname.

If you were following closely you’ll have noticed that the Oracle Linux 7 documentation includes an example for /etc/sysconfig/network that is not applicable to Oracle Linux 7 :-)

File System Options

Mention of XFS as the default file system is something that had caught my eye before I’d even downloaded the ISO. During my first installation of Oracle Linux 7 I did what I have previously recommended to others: select “I will configure partitioning” (only it was labelled differently in earlier versions). I was very interested to see the “New mount points will use the following partition scheme” drop-down. The options are listed below along with what they translated to on my single disk VM:

  • Standard Partition: Creates 3 partitions and uses XFS as the file system for both /boot (partition 1) and / (partition 3). Partition 2 is used for swap.
  • BTRFS: Creates 3 partitions and uses XFS as the file system for /boot (partition 1). BTRFS is used for /home and /, both of which are subvolumes created from a volume (partition 3). Partition 2 is used for swap.
  • LVM: Creates 2 partitions and uses XFS as the file system for /boot (partition 1). Partition 2 is used as a LVM physical volume, which is split into logical volumes for swap and /. XFS is the file system on /.
  • LVM Thin Provisioning[1]: As for LVM, but creates an LVM “thin pool” from which the logical volume for / is created.

Text Installer

The text installer is still available in Oracle Linux 7, but a significant change is that attempting to install on a machine with less than 1G RAM will no longer force the (reduced functionality) text installer.

Footnotes

1 – I completely missed the introduction of LVM thin provisioning in 6.4 – I couldn’t find any mention of it in the Oracle Linux 6 documentation, but it is covered here for Red Hat 6.4

Define Your Own Role for Database Target Access in EM12c

Scenario

  1. Enterprise Manager 12c (EM) installed and agents rolled out to database servers
  2. Access to EM offered to development teams with the primary purpose of allowing them to investigate application related database performance issues

Documentation

The EM documentation covers a selection of privileges you might want to grant to users in database targets in order to allow them to be used for accessing EM functionality. The privileges mentioned are:

  1. SELECT ANY DICTIONARY
  2. CREATE SESSION
  3. EXECUTE on DBMS_WORKLOAD_REPOSITORY
  4. OEM_ADVISOR

It interests me that OEM_MONITOR is is not mentioned.

Target Credentials

The development team have access to the EM console (via Active Directory credentials) and in order to access the Performance Pages for a particular database they need to either know the username and password of a user that has sufficient privileges in the database or have access to a Named Credential that has sufficient privileges in the database[1].

Testing

Past experience has shown that CREATE SESSION and SELECT_CATALOG_ROLE allow access to the Performance Pages. Unfortunately the mention of granting SELECT_CATALOG_ROLE caused the security folk to say, “No, no, no.”[2]

Quite reasonably a user with only CREATE SESSION hits the following when attempting to access a database target:

EM - DB Target - Denied

Creating an exact copy of SELECT_CATALOG_ROLE, named EM_SELECT_CATALOG_ROLE, and granting this to a user results in the same error.[3] This points to the role rather than the underlying privileges being checked.

The following test was carried out:

  1. Create a guaranteed restore point
  2. Drop the SELECT_CATALOG_ROLE role
  3. Create an empty SELECT_CATALOG_ROLE role
  4. Grant SELECT_CATALOG_ROLE to test user

Connecting to the database target worked. Once connected there is a lot of functionality that is not available because the underlying object privileges are not granted to the user. This is something that can be addressed in order to enable the functionality that a given user is permitted to access.

Investigation

Enabling SQL Trace for a session connecting to a database target from within EM in order to verify the above testing shows the following SQL statements hit the database target:

  1. select sign(count(*)) from session_privs where privilege = ‘SELECT ANY DICTIONARY’
  2. select sign(count(*)) from session_privs where privilege = ‘SELECT ANY TABLE’
  3. select role from session_roles

The SQL alone doesn’t make it explicit what privilege validation is performed, but we can take a good guess based on the documentation and earlier test results.

Problem Summary

Connecting to a database target from within EM checks for a number of privileges, the least contentious of which for the security team is SELECT_CATALOG_ROLE, however, this is not permitted for developers. The problem remains: How do we move forward and get the development teams the access they will find so valuable?

Oracle Support

At the request of the client an SR was raised, which has turned into an enhancement request:

Ability to use an appropriately privileged custom role for access to a database target via Enterprise Manager 12c

The Hack

The enhancement request may not get very far and even if it does then it will take some time to reach this environment, so other options were considered…

The privilege check can easily be tricked by creating a view owned by the database target user:

create view <EM USER>.session_roles as
  (select role
     from sys.session_roles
    union
   select 'SELECT_CATALOG_ROLE'
     from dual);

In case it isn’t obvious, <EM USER> will now always appear to have SELECT_CATALOG_ROLE, at least to itself. This allows the EM privilege checks to be passed and the connection to the database target to be successful. Without additional privileges the user will get a lot of blank screens, however, if the user has the required privileges for a given page then EM works.

Footnotes

1 – Named Credentials are a great feature, however, be mindful that a user can execute ad-hoc SQL in EM via the SQL Worksheet screen.
2 – Whether or not granting SELECT_CATALOG_ROLE to developers is a reasonable thing to do or not is beyond the scope of the post.
3 – The intention was that privileges would be removed once there had been opportunity to discuss with the security team exactly which of the privileges SELECT_CATALOG_ROLE has that they did not want developers to have.

Working Around an ASM Startup Problem

Scenario

I recently hit the problem covered in MOS note “ASM Instance Is Not Coming Up ORA-00064 (1,4468736,Kfchl Array) Kfchl Array (Doc ID 1328620.1)” mid-way through a Grid Infrastructure upgrade from 11.2.0.2 BP16 to 11.2.0.4 BP3 on Exadata. Specifically, it was the application of prerequisite patch 17783101 (required for downgrades) to node 3 that hit the problem. At this point nodes 1 and 2 had been successfully patched with the prerequisite patch and rootupgrade.sh for 11.2.0.4 had been run without issue on these nodes, i.e., the cluster was in a rolling upgrade.

The approach used to resolve the problem (detailed below) is applicable to similar issues on non-Exadata systems, so even if you don’t work on Exadata this might be interesting.

The support note cites a few options to fix the problem of ASM not starting:

  1. Recreate the spfile from a pfile where the offending parameter (__shared_pool_size) has been removed – This would allow the startup problem for +ASM3 to be addressed, but there are a couple of point to note: a) “create spfile from pfile” would need to use a different diskgroup for the spfile than the original spfile resides in (at least as an intermediate step) due to the limitation of not being able create an ASM parameter file in the same diskgroup as an in use ASM parameter file[1]; b) “create spfile from pfile” updates the Grid Plug and Play (GPnP) profile to point to the newly created spfile on all nodes, so any restarted ASM instance would be running from a different spfile than the currently in use spfile on all running ASM instances. Neither of these are catastrophic consequences, but I was keen to resolve the issue without needing to restart ASM/Grid Infrastructure on any node other than node 3.
  2. Update the spfile file using “alter system reset” to remove the offending parameter (__shared_pool_size) for the problem instance from an running ASM instance on another node – That was not possible due to the command returning ORA-32000 (write to SPFILE requested but SPFILE is not modifiable) as a result of the rolling upgrade that was in-flight.

A third solution is mentioned that is very similar to number 1. It also requires the creation of a pfile followed by recreation of the spfile from the pfile. That was out for the same reasons.

It seems worth commenting that the support note states that it applies to “Oracle Exadata Storage Server Software – Version 11.2.2.1.0 to 11.2.2.3.0 [Release 11.2]“, but the system where this was encountered was running 11.2.3.2.1 so that is not accurate.

What next?

Step 1

The first thing I tried was to start +ASM3 from a pfile where I’d manually removed “__shared_pool_size” from the file for that instance – It worked. ASM started and then everything else sprang into life.

The above got the database instances on node 3 up, but left me with a problem. If Grid Infrastructure was restarted, which it would be during the pending second attempt to apply the prerequisite patch and again during the execution of rootupgrade.sh as part of the actual upgrade to 11.2.0.4, then ASM would fail due to ORA-00064 again.

An interesting situation.

Step 2

After a bit of pondering I realised that an option to try would be updating the GPnP profile only on node 3 using gpnptool. This would allow +ASM3 to be started via “crsctl start crs” using the pfile I’d already used to successfully start it, but leave the GPnP profile on all other nodes pointing to the current spfile stored in ASM.

Updating the GPnP Profile:

[+ASM1@c01db03 ~]$ gpnptool edit -asm:asm_spf=/tmp/asm.ora -p=$ORACLE_HOME/gpnp/$(hostname -s)/profiles/peer/profile.xml -o=$ORACLE_HOME/gpnp/$(hostname -s)/profiles/peer/profile.xml -ovr
Resulting profile written to "/u01/app/11.2.0.2/grid/gpnp/c01db03/profiles/peer/profile.xml".
Success.
[+ASM1@c01db03 ~]$

Signing the Profile:

[+ASM1@c01db03 ~]$ gpnptool sign -p=$ORACLE_HOME/gpnp/$(hostname -s)/profiles/peer/profile.xml -o=$ORACLE_HOME/gpnp/$(hostname -s)/profiles/peer/profile.xml -ovr -w=file:$ORACLE_HOME/gpnp/$(hostname -s)/wallets/peer -rmws
Resulting profile written to "/u01/app/11.2.0.2/grid/gpnp/c01db03/profiles/peer/profile.xml".
Success.
[+ASM1@c01db03 ~]$

After the profile change ASM (and the rest of Grid Infrastructure) was restarted using “crsctl stop crs” followed by “crsctl start crs”. The Grid Infrastructure stack started as expected, which meant the patching and upgrade work could continue.

Problem solved with no cluster-wide outage or need to restart ASM/Grid Infrastructure on any node other than the one where the problem was encountered.

Step 3

The important “step 3″ was to use “alter system reset” to remove the offending parameter from the spfile for +ASM3 once the upgrade to 11.2.0.4 had been completed, reverse the update to the GPNP profile and restart +ASM3 to get all ASM instances referencing that same spfile again. Note that the reversal of the GPnP profile change needed to be run in the 11.2.0.4 Grid Infrastructure Home as it was performed after the upgrade completed.

All done.

Footnotes

[1] – Attempting to create an ASM spfile in a diskgroup containing an in use (by any ASM instance) spfile, including via an alias results in either ORA-32002 or ORA-17502, depending on whether or not the diskgroup is specified for the spfile:

SYS@+ASM1> !asmcmd spget
+FRA/c01/ASMPARAMETERFILE/registry.253.843340551

SYS@+ASM1> !asmcmd find --type asmparameterfile + \*
+DATA/c01/ASMPARAMETERFILE/REGISTRY.253.825178021
+FRA/c01/ASMPARAMETERFILE/REGISTRY.253.843340551

SYS@+ASM1> create spfile from pfile='/tmp/asm.ora';
create spfile from pfile='/tmp/asm.ora'
*
ERROR at line 1:
ORA-32002: cannot create SPFILE already being used by the instance

SYS@+ASM1> create spfile='+FRA' from pfile='/tmp/asm.ora';
create spfile='+FRA' from pfile='/tmp/asm.ora'
*
ERROR at line 1:
ORA-17502: ksfdcre:4 Failed to create file +FRA
ORA-15268: internal Oracle file +FRA.253.1 already exists.

SYS@+ASM1> create spfile='+FRA/asmspfile.ora' from pfile='/tmp/asm.ora';
create spfile='+FRA/asmspfile.ora' from pfile='/tmp/asm.ora'
*
ERROR at line 1:
ORA-17502: ksfdcre:4 Failed to create file +FRA/asmspfile.ora
ORA-15268: internal Oracle file +FRA.253.1 already exists.

SYS@+ASM1>

Internet Access with VirtualBox & Host-only Networks (on OS X Mavericks)

Introduction

When creating VMs on my laptop I like to configure the minimum number network interfaces. I also tend to end up with environments where I want multiple VMs to be able to see each other, see the internet and see my physical host. It seems many people using VirtualBox use the approach of having a “Host-only Adapter” interface and a “NAT” interface. The only reason I have for not liking this is that it is possible for a “Host-only Adapter” to be able to access the wider world via the physical host and therefore I see the NAT interface as surplus to requirements.

In the past when running Snow Leopard I’d worked out that enabling “Internet Sharing” on the Mac allowed my VMs with “Host-only Adapter” to be routed out to through whatever network my Mac was connected to (assuming the network, resolv.conf, etc were appropriately configured on the VM). I’ve suggested this approach to others in the past without questioning what OS X version they were running. I had the odd report of it not working and people resorting to adding a NAT interface, but didn’t have opportunity to investigate.

Anyway, I got a new computer recently and it came with OS X Mavericks. On creating my first VM in VirtualBox I opted for my preferred approach of using “Host-only Adapter”. I then spent a while working out how to get my Mac to NAT the VMs. Here is the solution I came up with.

Commands

1. Enable IP forwarding:

$ sudo sysctl -w net.inet.ip.forwarding=1

If you want this to be persistent across reboots then you can add it to /etc/sysctl.conf.

2. Edit the pfctl configuration file (/etc/pf.conf) adding the following line below “nat-anchor”:

nat on en0 from vboxnet0:network -> (en0)

The above assumes that your Mac is connected to the internet over Airport/WiFi (en0) and that you want to allow the first VirtualBox “Host-only Network” (vboxnet0) to be NAT’d.

It is also possible to use natd & ipfw, as covered here, but they are deprecated in Mavericks, so you should probably adopt pfctl now.

3. Once pf.conf has been modified the file needs to be loaded:

$ sudo pfctl -f /etc/pf.conf

4. … And you need to ensure that PF is enabled:

$ sudo pfctl -e

The VMs will now be able to access the internet via their “Host-only Adapter” through the physical host.

Result

This gives me a situation where my Mac can see all my VMs, my VMs can see each other and importantly my VMs can get out to the internet – All with a single interface on each VM. KISS.

[Update] Persistent Through Reboot

As mentioned above the sysctl change can be made persistent through reboots by writing it to /etc/sysctl.conf, which you will probably need to create unless you’ve already been tinkering.

The changes to /etc/pf.conf will remain in place through restarts, but there are two changes required to ensure that PF is brought up automatically after startup. The first change is covered in section 2 of this Mavericks Server knowledge base article from Apple. The part that is relevant is:

$ sudo defaults write /System/Library/LaunchDaemons/com.apple.pfctl ProgramArguments '(pfctl, -f, /etc/pf.conf, -e)'

The chmod & ptutil commands that follow the above in the article were not required in my case as the permissions on the file were already as appropriate and the plist file was already in XML format. That said their is no harm in running them.

The other change you will need to make is to modify the syntax used in /etc/pf.conf. At the point PF is started the “vboxnet0″ interface will not exist and therefore PF will not be able to determine the required network information for them, which results in PF not being enabled. In order to avoid this problem it is necessary to switch to the following syntax (assuming your haven’t changed vboxnet0 from the default configuration):

nat on en0 from 192.168.56.0/24 -> (en0)

If you have changed from the default configuration or use multiple “Host-only Networks” in your VirtualBox environment then I’d imagine you can work out how to match the above to your environment.

Kdump to NFS in UEK (Solution)

I’ve previously written about a problem I encountered when kdump is configured to write to an NFS location with UEK (in Exadata software version 11.2.3.2.1). I’m please to report that the root cause of the problem has been identified and there is a very simple workaround.

There were some frustrating times working this particular SR, the most notable being a response that was effectively, “It works for me (and so I’ll just put the status to Customer Working).”

After a bit more to-ing and fro-ing it emerged that the environment where Oracle had demonstrated kdump could write to NFS had the NFS server on the same subnet as the host where kdump was being tested. After a quick test of my own, using a second compute node in the Exadata system as the NFS server, I confirmed that kdump was able to write to an NFS location on the same subnet in my environment as well.

Soon after reporting the above test in the SR I was pointed to MOS note 1533611.1, which unfortunately is not publicly available (yet) and so I cannot read it… The crux of the issue is that the network interface configuration files have BOOTPROTO=none and kdump is not handling this appropriately, which results in an incomplete network configuration for bond1 when switching to the dump kernel during a crash.

The fix: Change BOOTPROTO=none to BOOTPROTO=static

The part of all this that I find interesting is that the documentation for RHEL 5 and RHEL 6, and even RHEL 7, all list the following options for BOOTPROTO:

  • none
  • bootp
  • dhcp

“static” does not appear to be a formally valid value. In an attempt to find more information about the behaviour I looked for more details and only got as far as Red Hat BZ#805803 and BZ#802928, neither of which I can access directly, but I can see a summary here and here respectively.

In conclusion, it appears that the issue is actually a kdump bug. More specifically a mkdumprd bug. Thankfully the workaround is simple, it just took a long time to get to it.

Hidden Perils of Thin-style Service Name Syntax with UCP & FCF

I’ve touched on this issue before over at my previous blog and also talked about it at UKOUG Tech13. The presentation at Tech13 was more than 90% demos so hasn’t left a lasting record of the issue and the previous blog post doesn’t go into significant detail. I think it is worth highlighting what I see as a potentially very significant issue when configuring Oracle’s Universal Connection Pool (UCP) for Fast Connection Failover (FCF).

I’ll cut straight to the chase: If you use Thin-style Service Name Syntax (aka Easy Connect) in the URL for UCP when using FCF then you are leaving yourself open to wondering why your connection pool has not opened any connections on the instance that you just started the service it uses on. If you have enabled logging for UCP (11.2) then you would find yourself looking at the following:

oracle.ucp.jdbc.oracle.OracleDatabaseInstanceInfoList.getConnectionToNamedInstance URL invalid for connecting to named instance

The scenario in which you encounter this is specific: A service is started on an instance that didn’t offer the service at the time the connection pool was established.

A few examples of when this can happen are:

  1. You have a service that is defined as preferred on some instances and available on others then one of the instances where the service is preferred dies and it starts on an available instance.
  2. You have a service that is preferred on an instance, but that instance is not up and therefore not offering the service at the time the application connection pool is established.
  3. You have a service that has failed over to an available instance at the time the application connection pool is established and you then want to move the service from an available instance to a preferred instance without forcing a disconnect.

Returning to the error message “URL invalid for connecting to named instance” and considering the full syntax for Easy Connect, specifically the option to include the “instance_name”, this seems like a JDBC bug (as I think I recall Tim Hopkins suggesting).

So what’s the answer? Simply ensure that your clients use an Oracle Net connection descriptor. In case it needs stating: there is no problem with using the SCAN in the Oracle Net connection descriptor. I currently believe that using LDAP or TNS entries will also work fine as they ultimately result in the client having an Oracle Net connection descriptor, but I haven’t found the time to test them yet.

Anyway, I live in a world where restarting your application or relying on inactiveConnectionTimeout in order to establish connections to all instances that offer the relevant service would not be seen as acceptable, and rightly so. We have the technology available to allow us to move database services around without the application feeling it. You just need to make sure you have everything set up exactly how it needs to be.

It’s worth mentioning that Martin Bach recently published a post on Application Continuity where he also encountered a problem using the Thin-style Service Name Syntax.

12c Improvement in Database Service Management

Removing a service via srvctl has not historically resulted in the service being fully removed from the database and it would still be visible in DBA_SERVICES as show below in an 11.2.0.3 database:

Create the service:

$ srvctl add service -d orcl -s demo -r "ORCL1,ORCL2"

Have a look in DBA_SERVICES:

SYS@ORCL1> select name service_name from dba_services;

SERVICE_NAME
----------------------------------------------------------------
SYS$BACKGROUND
SYS$USERS
ORCL

Start the service:

$ srvctl start service -d orcl -s demo

Have another look in DBA_SERVICES:

SYS@ORCL1> select name service_name from dba_services;

SERVICE_NAME
----------------------------------------------------------------
SYS$BACKGROUND
SYS$USERS
ORCL
demo

Stop the service:

$ srvctl stop service -d orcl -s demo

Remove the service:

$ srvctl remove service -d orcl -s demo

Have yet another look in DBA_SERVICES:

SYS@ORCL1> select name service_name from dba_services;

SERVICE_NAME
----------------------------------------------------------------
SYS$BACKGROUND
SYS$USERS
ORCL
demo

So as you can see the service remains in the database and is visible through DBA_SERVICES.

The only way I’m aware of to remove the service from the database is via DBMS_SERVICE.DELETE_SERVICE:

SYS@ORCL1> exec dbms_service.delete_service('demo') 

PL/SQL procedure successfully completed.

SYS@ORCL1>

A final look in DBA_SERVICES:

SYS@ORCL1> select name service_name from dba_services;

SERVICE_NAME
----------------------------------------------------------------
SYS$BACKGROUND
SYS$USERS
ORCL

I can imagine you’re thinking, “So what?”

Well, this can have some knock on implications as I recently demonstrated during my presentation on Fast Connection Failover at UKOUG Tech13.

Anyway, I’m please to report that Oracle have done a bit of tidying up in 12c and now when you remove a service via srvctl it is also removed from the database (most of the time).

Repeating the tests above in a 12.1.0.1 database [Strictly I should be using the new syntax for srvctl in the 12c example that follows, but as the old syntax still works and is consistent with the 11.2.0.3 example above I've stuck with it here]:

Create the service:

$ srvctl add service -d orcl -s demo -r "ORCL1,ORCL2"

Have a look in DBA_SERVICES:

SYS@ORCL1> select name service_name from dba_services;

SERVICE_NAME
----------------------------------------------------------------
SYS$BACKGROUND
SYS$USERS
ORCLXDB
ORCL

Start the service:

$ srvctl start service -d orcl -s demo

Have another look in DBA_SERVICES:

SYS@ORCL1> select name service_name from dba_services;

SERVICE_NAME
----------------------------------------------------------------
SYS$BACKGROUND
SYS$USERS
demo
ORCLXDB
ORCL

Stop the service:

$ srvctl stop service -d orcl -s demo

Have another look in DBA_SERVICES:

SYS@ORCL1> select name service_name from dba_services;

SERVICE_NAME
----------------------------------------------------------------
SYS$BACKGROUND
SYS$USERS
demo
ORCLXDB
ORCL

It’s still there.

Remove the service:

$ srvctl remove service -d orcl -s demo

Have yet another look in DBA_SERVICES:

SYS@ORCL1> select name service_name from dba_services;

SERVICE_NAME
----------------------------------------------------------------
SYS$BACKGROUND
SYS$USERS
ORCLXDB
ORCL

SYS@ORCL1> 

It’s gone.

As I write this I realise most people (even those working with Oracle databases) aren’t going to be that trilled by this minor improvement, but it pleases me.

The question that immediately occurred to me is, “So what happens in the database is down when the service is removed?”

Add and start the service:

$ srvctl add service -d orcl -s demo -r "ORCL1,ORCL2"
$ srvctl start service -d orcl -s demo

Verify that it exists in DBA_SERVICES:

SYS@ORCL1> select name service_name from dba_services;

SERVICE_NAME
----------------------------------------------------------------
SYS$BACKGROUND
SYS$USERS
demo
ORCLXDB
ORCL

Stop the database, remove the service and start the database:

$ srvctl stop database -d orcl 
$ srvctl remove service -d orcl -s demo
$ srvctl start database -d orcl

Reconnect to the database and have a look in DBA_SERVICES:

SYS@ORCL1> conn / as sysdba
Connected.
SYS@ORCL1> select name service_name from dba_services;

SERVICE_NAME
----------------------------------------------------------------
SYS$BACKGROUND
SYS$USERS
demo
ORCLXDB
ORCL

SYS@ORCL1> 

So in this situation you’re left to clear things up with DBMS_SERVICE yourself.

At the risk of going overboard with testing scenarios I also wondered what would happen if only the local instance was down. I’ll save you the output and simply stated that if the local instance is down then the service is still cleaned up in the database (by oraagent.bin).

Kdump to NFS Broken in UEK

There were problems that affected UEK and NFS when 11.2.3.2.1 was initially released (as covered by Andy Colvin). As mentioned in the comments of Andy’s post: Oracle released an updated ISO with fixes for this problem (patch 16432033).

There were also problems with kdump not functioning after 11.2.3.2.1 as listed in “Issue 1.17″ of Oracle MOS “Exadata 11.2.3.2.1 release and patch (14522699) for Exadata 11.1.3.3, 11.2.1.2.x, 11.2.2.2.x, 11.2.2.3.x, 11.2.2.4.x, 11.2.3.1.x, 11.2.3.2.x (Doc ID 1485475.1)”.

After updating to 11.2.3.2.1 using the updated ISO (with NFS fixes) and installing the later version of kexec-tools as per the fix for “Issue 1.17″ I was not able to get kdump to write to NFS during a crash.

I had previously tested kdump to NFS when running Exadata software version 11.2.2.4.2, so I knew that worked. 11.2.2.4.2 uses the RHEL kernel (“non-UEK” as Oracle would have you say) so I decided to test 11.2.3.2.1 after switching to the RHEL kernel, which also involves reverting the kexec-tools to the earlier version. That confirmed the issue was only evident when using UEK. Time for an SR…

Bug 17730336 – UNABLE TO KDUMP OVER NFS WITH KEXEC-TOOLS-2.0.3-3.0.4.EL5.X86_64.RPM INSTALLED

The bug listed above has been raised, but no fix has been supplied yet.

Not being able to direct kdump to NFS is probably not going to be your biggest worry, but definitely something to be aware of if you’re running UEK.

Note that I have only tested with 2.6.32-400.21.1el5uek. Oracle have confirmed they have reproduced the issue with 2.6.39-400.126.1.el5uek. Other UEK kernels may or may not be affected.

Sudo Keystoke Optimisation

If like me, and a couple of others I’ve spoken to recently, you were not previously aware of “sudo -i”[1] then you might be interested to know that you can save yourself two keystokes by switching from:

sudo su -

To:

sudo -i

From the man page:

‑i‑-login

Run the shell specified by the target user’s password database entry as a login shell. This means that login-specific resource files such as .profile or .login will be read by the shell. If a command is specified, it is passed to the shell for execution via the shell’s ‑c option. If no command is specified, an interactive shell is executed. sudo attempts to change to that user’s home directory before running the shell. The command is run with an environment similar to the one a user would receive at log in. The Command Environment section in the sudoers(5) manual documents how the ‑i option affects the environment in which a command is run when the sudoers policy is in use.

Simple, but useful if you’re someone that enjoys the quest to reduce keystokes.

Footnotes
[1] – This post was originally written about “-s”, but as pointed out by Paul in the comments this is not equivalent to “sudo su -”.