R2D2 Fault Recovery
Possible failures on Startup
Failure at startup may mean no data all night, if repair
requires visiting the Tower and no other member of ING staff is
present to supervise your safety.
To avoid this, shutdown and crashes must be correctly handled at
the end of each session.
Here is a list of symptoms to distinguish between different faults that can show up at the start of observing and a description of how to address them.
- Telescope fails to slew: coords do not change on GUI, differential coords
remain same on server feed, meassage
#3 Cannot Perform Slew seen on server feed
- This occurs if the Server is not shut down at the end of the previous session. Once the Mount receives the Park command, the monitor cannot be restarted from the GUI. The mount must be initialised by following carefully every step of procedures
Shutdown and then Startup in the Manual.
- If the USB camera cannot connect, you get a crash on the server with messages
"Creating camera...Device not found...Segmentation fault".
- This will probably require reseating USB connector on the PC. Since this requires going to the Tower, it's worth trying a soft reboot of the PC (sudo reboot) first.
- Caught system exception TRANSIENT --- unable to contact the
server. This is a Corba error and
requires restart of service, with this command:
- If it fails to contact the atlas database, the program will
stop there and the cause will be evident. Try ping atlas or
check status of network nameserver. If either fails, report network problems to Luis or Alegria.
- If the GUI starts up with an additional pop-up with message
Unable to establish
communication with DIMM Server , this looks like a Corba failure. However two
occurences have required rebooting the mount. This requires going to the tower and pressing the rocker switch on the mount.
Pointing problems:
If you suspect pointing problems, check first the server message spool to check if the program is searching and reepatedly says "No stars founds" (literally). Next verify that sky conditions are clear or that other DIMMs are producing data. If both of these are true then it may still be a focus problem, so check the reponse to the [Focus+] button in the GUI in the server message spool. If this shows no response (it may even terminate the server, then report is in the Fault database.
Next check that the time being updated on the GUI is UTC. If it is a few minutes wrong, check the time on dimmserver using the "date" command and force sync to ntpserver2 or 3 if necessary. Put this info in a fault report. If it the GUI time is exactly 1 hour wrong, report it as a Fault.
Stop Loop , wait for parking to complete and Start Loop again is a reasonable response to bad pointing, if none of the above clarify the problem.
Failures while monitor is running:
-
If the GUI freezes, i.e. is unresponsive, preventing any shutdown commands being sent to Server, don't panic! The Dome can be closed and the monitor left running until tracking automatically stops. However, control may be recoverable and a clean shutdown carried out, by starting a second GUI (on the same machine or another, see Start Client, point 3 above). From this second GUI, the Stop Loop and Park commands can be sent, but please note the Server needs to be restarted following a Park. See GUI operations, points 4 and 5 above.
-
If the Server window freezes, check how long the program has been searching for a star without result. If this has accumulated about 1 hour, the cause of this crash is likely to be the memory leak. Requires note only a hard restart of PC but also a reset of the focuser.
If however the Server terminates, returning the prompt, while the telescope is slewing or tracking, it can be restarted without difficulty. The monitor processes will have stopped during the crash, but the mount will finish the slew and continue tracking and its new status will be read by the Server. The GUI, once restarted should work as before.
Server termination error messages
- Message
Terminate after throwing an instance of 'std_out_of_range' what( ):vector::_M_range_check:__n where n os a very large number. This is caused by low flux found in 1 image during analysis and can be caused by bad seeing on a fainter star such as Alpheratz or Regulus.
- Solution: Exit GUI, Restart Server then GUI, press [Start Loop]
- Message
Calculated box out of ccd dimensions, stopping machine . Caused by failing to find one of the star images during the centering phase during cloud or bad seeing on a fainter star.
- Solution: Exit GUI, Restart Server then GUI, press [Start Loop]
terminate called after throwing an instance of boost_exception_detail::clone_impl<.... - crash in calculation phase, probably caused by a missing image.
- Solution: Exit GUI, Restart Server then GUI, press [Start Loop]
- The above crashes may also be caused if the Focuser has failed for a long time to correct the focus. The Server will not be crashed by a Timeout on the focuser itself. If you see errors in the Server related to Focuser or otherwise suspect problems with it, you should check the Delta centroid plot in the
R2D2 Chart web page. The red points show the mean_y value and should be in the range -15 to -25. If they are not, the data quality may be affected and would require a reset of the Focuser.
- mvIMPACT::acquire::EValTooSmall - communication problem with USB camera. May require a reboot, see below.
Initialization of hardware:
Soft Reboot of Dimmserver PC:
> sudo reboot (enter password)
This will not cause the focuser to lose its parameters or position.
Hard Restart of Dimmersver PC:
This is the only way to recover from a PC fatal crash, e.g., caused by the memory leak (see top of page). The Hard Reset appears to result in the focuser resetting its position to 0 (although physically it has not moved) and in losing its software settings, requiring the reset described here.
- Browse to URL masspdu.ing.iac.es and enter using apc/apc
- Click Device Manager tab
- Click Control (left menu)
- From Control Action menu, select "Reboot Immediate"
- Tick outlet no. 7, dimmserver PC power (currently
on)
- Click Next button
- Click Accpt having checked action says "dimmserver PC
power selected for Reboot Immediate"
- After about 1 minute, the PC will be up again and
accepting logins
- Log off APC connection (to allow other users to log in)
Shutdown and Restart of Mount
control box:
- Soft shutdown of Mount control box using handset:
- Menu button, navigate with arrow keys (top right of main
keypad) down to Settings submenu, Enter
- Navigate down to bottom using arrow keys - last item ,
Shutdown Mount, Enter
- Press Enter again to confirm
OR
- Hard reset by holding down rocker switch on unit for 10
seconds to power down
- Wait 30 seconds before startup
- Power up by pressing rocker switch on unit for 1 second to
|