An Unusual Problem Starting Oracle Clusterware
Andy Kerber (@dbakerber), Senior Consultant
NOTE: The environment described below is SuSe 12/ Oracle RAC 12.1.02.
A few days ago we ran into a really strange problem at one of our sites. We were putting together a test system, and a mistake was made on our three-node cluster configuration – it was configured with too many HugePages. This resulted in series of crashes before the problem was resolved.
Once resolved, with HugePages set correctly, Oracle Clusterware would only come up completely on two of the three nodes. On the third node, it only came up partially:
oracle@trac1:~> crsctl check crs CRS-4638: Oracle High Availability Services is online CRS-4535: Cannot communicate with Cluster Ready Services CRS-4529: Cluster Synchronization Services is online CRS-4534: Cannot communicate with Event Manager
We spent hours going round and round on this. We rebooted the server, analyzed the Clusterware log, everything we could think of. This was in the Clusterware alert log:
2017-07-13 16:35:13.387 [OCTSSD(13691)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 13691 2017-07-13 16:35:13.389 [OCTSSD(13691)]CRS-2405: The Cluster Time Synchronization Service on host trac1 is shutdown by user 2017-07-13 16:35:13.390 [OCTSSD(13691)]CRS-8504: Oracle Clusterware OCTSSD process with operating system process ID 13691 is exiting 2017-07-13 16:35:51.675 [ORAROOTAGENT(13363)]CRS-5019: All OCR locations are on ASM disk groups [OCR], and none of these disk groups are mounted. Details are at "(:CLSN00140:)" in "/oracle/app/oracle/diag/crs/trac1/crs/trace/ohasd_orarootagent_root.trc".
We could not figure it out. We were not shutting down the CTSS process. Google indicated that this sort of thing is often caused by the inability to create a pid file. But the file referenced was always located in /oracle/app/22.214.171.124/grid/crs/init.
On our service request (SR), we asked Oracle for the location for the pid for CTSS, and they listed this location: $ORACLE_BASE/crsdata/trac1/output/octssd.pid. The file was present.
Finally after struggling with this for a couple of days, and with the instructions given by Oracle support, we attempted a strace of ‘crsctl start res ora.asm –init’
root@trac1: strace –of /tmp/crstrc.log crsctl start res ora.asm –init
In the log file, /tmp/crsctrc.log, we found a message that it was unable to open the file: /oracle/app/126.96.36.199/grid/ctss/init/trac1.pid.
Well, from there I remembered several blogs about having problems with the same file, but under CRS rather than CTSS, so I attempted the same solution. As root I ran:
root@trac1: touch /oracle/app/188.8.131.52/grid/ctss/init/trac1.pid
Then I shut down Clusterware.
Next I started Clusterware, and everything came up fine. The lines in the alert log were as expected this time:
2017-07-16 13:42:12.378 [OCTSSD(29082)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 29082 2017-07-16 13:42:13.374 [OCTSSD(29082)]CRS-2403: The Cluster Time Synchronization Service on host trac1 is in observer mode. 2017-07-16 13:42:13.470 [OCTSSD(29082)]CRS-2407: The new Cluster Time Synchronization Service reference node is host . 2017-07-16 13:42:13.471 [OCTSSD(29082)]CRS-2401: The Cluster Time Synchronization Service started on host trac1. 2017-07-16 13:42:32.663 [OSYSMOND(29296)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 29296 2017-07-16 13:42:33.684 [CRSD(29336)]CRS-8500: Oracle Clusterware CRSD process is starting with operating system process ID 29336
The problem was resolved. I mentioned in the notes of our SR that Oracle should update their documentation on this subject, because it does mention pid files, but primarily just the files under CRS. One thing that would be really helpful to see from Oracle as well is a list of all the locations where Clusterware puts files. Among the locations I have learned to check so far are:
/tmp/.oracle /var/tmp/.oracle $ORA_CRS_HOME/<process name (crs, ctss, etc)>/init $ORACLE_BASE/crsdata/<node_name>/output
No doubt there are more locations, but these are the directories where I have looked at to date.
In this blog I described an unusual situation that we ran into when starting Oracle Clusterware. Hopefully by sharing our resolution of the issue, it will help lead you the right direction if you discover that you are missing a pid file.