Recovering from DRAMA errors
The observing system uses
DRAMA
for inter-process communication. DRAMA can
fail in several interesting ways, all of which disable a part of the system.
Here's a procedure for recovering the DRAMA links by doing a full restart.
Symptoms
You should run the recovery procedure if any of these problems come up:
- The autoguider display complains about DRAMA communications being
uninitialized. This is most likely to occur when starting the system.
- If the run command (or bias, flat, glance etc.) complains that it gets
no response from any of the DAS servers. This can happen spontaneously
after successful runs; it may be connected with starting the autoguider
display.
- If you get messages from any part of the of the system complaining of
buffer-size problems, too many files being open, or errors concerning
semaphores. These are all to do with DRAMA.
- If you have to restart the autoguider VME computer for any reason.
- If the system isn't working and you don't know of another recovery
procedure (but see `anti-symptoms', below).
- If it's cloudy and you've nothing better to do. Nobody knows for certain
whether gratuitous restarts help or hinder. The case against says that
a working system shouldn't be disturbed. The case in favour says that
regular restarting removes gradual corruption of the DRAMA message net
and the system would fail eventually anyway.
Anti-symptoms
There are certain well understood problems where a full restart is a waste
of time:
- If the DAS' FOX rack floods, abort any exposures in progress, then give
the command dasreset
on the console of the DAS computer (the VT220 terminal next to the
observer's bench). Don't give the commands tat the SYS> or TO>
prompts because they don't work on the system computer. Now link the
DAS back into the overall system by giving the command
startobssys
at the observer's or TO's console.
- Comms. failures to the shutter/filter controller aren't DRAMA failures.
You need to recover the terminal server in the Cass. cluster.
- If you have an unwanted run in progress, or if your run has hung, please
try to clear it with the abort command (or the abort button on the WFC
GUI). The abort has worked if the run button pops out (if it stays
blue and pushed in, then the run is still in progress)
and the percentage-written-to-file displays change from blue to green
or red. If you find you can't abort, do the restart procedure.
The procedure
The first stage is to eliminate any programs that are holding DRAMA connections
to or in the system computer:
The second stage is to bring the various sub-systems on-line in the best
order:
- Wait 60 seconds for the autoguider to reboot.
- At the DAS console, enter the command
startobssys
and wait until you see the `loaded' message: there should be one
of these messages for IDS and four for WFC.
- At the SYS> prompt, enter the command
startobssys
and wait for the SYS> prompt to come back, signalling the end of the
start-up sequence. This loads the server programs on the system
computer and takes about three minutes.
Mistakes to avoid
There are a few errors of timing and ordering which you need to avoid for the
recovery to work reliably:
- Omitting to reboot the autoguider. There is circumstantial evidence
that the autoguider rack is involved in all the DRAMA failures, DRAMA
for VxWorks being rather weaker than DRAMA for Solaris or VMS. It is
quite likely that `ghost' connections held open by the autoguider can
prevent DRAMA on the system computer from restarting.
- Omitting to close down the DAS. Corruption of DRAMA may have propagated
to the DAS computer even if there are no overt symptoms there. Or the
DAS may have caused the corruption.
- Not allowing the autoguider enough time to reboot.
- Doing startobssys from the TO> prompt at the same time as
doing it from the SYS> prompt.
The two start-up scripts will fight over the resources and one of them
must lose.
- Doing startobssys at the TO> prompt when no-one is logged in
on lpss13's console. If no windowing system is available on lpss13,
then the observer's displays don't come up.
- Starting the central programs on lpss13 before starting the autoguider,
DAS or TCS. In this case, you'll not have DRAMA links to the satelite
sub-systems (the connection displays on the WFC mimic will show orange).
Run startobssys again to reconnect.