Troubleshooting the IMPB Observing System

This page is part of the ING manual WHT-INGRID-2: INGRID IMPB Software Operations Manual

Diagnosing Problems

Before doing anything please check that the condition you are experiencing is not a known problem. If it is please use the suggested workaround.

For unexpected or new fault conditions your first port of call is the observing system diagnostic commands. These allow you to verify the state of the TCS link, to check that the Unix filesystem is ok, and to verify that the SDSU and EPICS Controller are in a state ready for observing. Using these commands you should be able to get a good idea where the problem lies.

For hardware / computer infrastructure problems please take a look at the Computer Infrastructure diagram. For problems that seem more software related take a look at the software environment.

Known problems and workarounds

The table below provides a list of all known problems with INGRID. This list may seem exhaustive, but don't worry: generally people have experienced few difficulties when using INGRID. The intention here is to provide as much information as possible to minimise any loss of observing time.

Please note: this list applies to the observing system release: ISP:V2.7, RTD:V2.1, SCRIPTS:V1.13, SDSU:V3.1 and to the TCS Release W14-1-10.

Problem Symptom Comment and Workaround

1. Bringing up the Vax ICL system when the TCS is already running will cause it to switch into "DECNET" communications mode. This will break the Observing System's DRAMA link to the TCS and cause failure of all subsequent TCS commands.
If you type 'tcsstat' at the IMPB Observing System prompt this will correctly diagnose the cause of the failure.
Don't restart the ICL system when INGRID is running (it's ok if it's already running: in fact you will need it to set up the instrument light path at the start of the night).
Don't type: "TEL_LINK_TO_TCS" either: this will cause the same problem.
If the system has got into this state the simplest solution is to shutdown the TCS and restart it - this will bring the TCS up in it's default DRAMA communications mode.
An alternative solution (that allows you to keep the TCS running) is as follows:
1. Stop any existing TCS TELD task by closing the appropriate window at the TO's X terminal.
2. At the TCS prompt, type: 'COMMS DECNET' followed by 'COMMS DRAMA'. Ignore any worrying messages !
4. At the TO's X terminal logon to lpas4 again using the WHT_LOGIN account.
5. Manually restart the TELD task by going into the "Options" menu and selecting the TELD option.
6. To confirm that the observing system link has been successfully reestablished type: 'tcsstat' at the IMPB Observing System prompt.
The TCS startup procedures are currently under review. We hope to come up with a simplified solution before long.

2. The link to Telescope Control System (TCS) is fragile. Occasionally, communication is lost between the Observing System and the TCS. This will cause all TCS related activities to fail. This includes dithering and collection of telescope information for the observational data files.
This is the only part of the Observing System that relies on DRAMA communications.
A good source of diagnostic information is provided by the 'tcsstat' command. This will generally provide guidance on the best course of action, whether it be restarting the TCS, the associated TELD task or the Observing System.
Note: you should very rarely need to reboot any computers.
If you need to bring up the Observing System for a second time you no longer need to shut it down beforehand. Just reissue 'startobssys'.

3. The offset command is unreliable offset command fails - it may appear to work, but repeated use of it is dangerous as the DRAMA messaging system runs out of buffer space. The dither scripts have now been recoded to use the TCS SLOWOFF facility, so unless you specifically want an offset there is no longer a problem.
If you need to do an offset manually you may use the SLOWOFF facility type:
cmd TCS@LPAS4 SLOWOFF <x_offs> <y_offs> <rate>
The <rate> parameter should be specified in arcsec/second.
Note: cmd is a Unix system alias, not a core part of the Observing System. It may not be supported in the future, so be aware when incorporating into any scripts.

4. TCS writes incorrect RA and DEC headers When executing the dither scripts the telescope RA and DEC header items are sometimes half what they should be. We believe this is caused by performing data acquisition cycles (ie run/scratch) when the telescope has not settled down properly ie when the TCS display says that either the telescope or CASS rotator is "MOVING".
It is usually obvious by looking at the Observing Log when this has happened.
Make sure nobody causes the telescope to move when a dither script is being executed.
A long term solution is still being investigated.

5. TCS commands timeout before telescope movement has been completed Depending on the telescope's position on sky some telescope movements may take a long time to complete. The timeouts in the so-called TCS "TELD" task are currently too short. This means that observing scripts may not synchronise properly with telescope movement. The 'gocat' command is vulnerable to this. Additionally if you use SLOWOFF to slew across large areas of sky (>400 arcsec) at a slow rate you may run into difficulties.
The new TCS release W14-1-11 should provide a cure to this. Check whether it has been made operational yet.

6. SDSU Controller sometimes responds incorrectly to commands or locks up. There are currently (18/11/00) two problems that we know about:
1. Occasionally the SDSU Controller fails to send back the replies expected of it eg when queried for a camera id, it replies with error status.
2. The SDSU Controller sometimes gets into a 'lockup state' where it stops responding altogether.
If this happens the dither script (or higher level scripts that use them) will fail. First you need to determine whether the SDSU Controller has locked up or not.
Type: 'detstat'
If the system replies with normal status information all is well - continue observing where you left off.
If the system comes back with an error condition you will need to reset and reload the SDSU Controller software. To do this:
a/ Type 'startobssys -interactive'
b/ Respond with N to all questions, except the one relating to initialising the SDSU.
Having reloaded the SDSU software, confirm all is well with a 'detstat', then continue observing where you left off.
A long term solution is still being investigated.

7. SDSU Controller temperature information is sometimes incorrect Typing 'detstat' may display alarmingly high temperatures such as 253K etc. This information will also be shown recorded in the observational data files. If this happens during the Daily Checks you could try reloading the SDSU software ('startobssys -interactive') or power cycling the SDSU Controller.
Of course, if the problem doesn't go away it's a good idea to check that the INGRID cryostat really IS still cold !
If this happens at night time you can ignore the problem unless you're worried about the incorrect info being in your headers. Don't worry: INGRID image quality is not affected.
A long term solution is still being investigated.

8. EPICS Controller is unreliable Sometimes the commands which move the INGRID mechanism wheels fail.
ie filter, fwheel1, fwheel2 or pstop return with error status.

This problem has been traced down to the EPICS mechanism controller which sometimes fails returning CAR_ERROR status.
The Observing System has now been changed to perform a single retry on mechanism failure. So far (18/11/00) this has recovered the situation in all cases, making the problem invisible to Observers.
A long term solution is still being investigated.

Problem	Symptom	Comment and Workaround
1. Bringing up the Vax ICL system when the TCS is already running will cause it to switch into "DECNET" communications mode.	This will break the Observing System's DRAMA link to the TCS and cause failure of all subsequent TCS commands. If you type 'tcsstat' at the IMPB Observing System prompt this will correctly diagnose the cause of the failure.	Don't restart the ICL system when INGRID is running (it's ok if it's already running: in fact you will need it to set up the instrument light path at the start of the night). Don't type: "TEL_LINK_TO_TCS" either: this will cause the same problem. If the system has got into this state the simplest solution is to shutdown the TCS and restart it - this will bring the TCS up in it's default DRAMA communications mode. An alternative solution (that allows you to keep the TCS running) is as follows: 1. Stop any existing TCS TELD task by closing the appropriate window at the TO's X terminal. 2. At the TCS prompt, type: 'COMMS DECNET' followed by 'COMMS DRAMA'. Ignore any worrying messages ! 4. At the TO's X terminal logon to lpas4 again using the WHT_LOGIN account. 5. Manually restart the TELD task by going into the "Options" menu and selecting the TELD option. 6. To confirm that the observing system link has been successfully reestablished type: 'tcsstat' at the IMPB Observing System prompt. The TCS startup procedures are currently under review. We hope to come up with a simplified solution before long.
2. The link to Telescope Control System (TCS) is fragile.	Occasionally, communication is lost between the Observing System and the TCS. This will cause all TCS related activities to fail. This includes dithering and collection of telescope information for the observational data files.	This is the only part of the Observing System that relies on DRAMA communications. A good source of diagnostic information is provided by the 'tcsstat' command. This will generally provide guidance on the best course of action, whether it be restarting the TCS, the associated TELD task or the Observing System. Note: you should very rarely need to reboot any computers. If you need to bring up the Observing System for a second time you no longer need to shut it down beforehand. Just reissue 'startobssys'.
*3. The offset* command is unreliable**	offset command fails - it may appear to work, but repeated use of it is dangerous as the DRAMA messaging system runs out of buffer space.	The dither scripts have now been recoded to use the TCS SLOWOFF facility, so unless you specifically want an offset there is no longer a problem. If you need to do an offset manually you may use the SLOWOFF facility type: cmd TCS@LPAS4 SLOWOFF <x_offs> <y_offs> <rate> The <rate> parameter should be specified in arcsec/second. Note: cmd is a Unix system alias, not a core part of the Observing System. It may not be supported in the future, so be aware when incorporating into any scripts.
4. TCS writes incorrect RA and DEC headers	When executing the dither scripts the telescope RA and DEC header items are sometimes half what they should be.	We believe this is caused by performing data acquisition cycles (ie run/scratch) when the telescope has not settled down properly ie when the TCS display says that either the telescope or CASS rotator is "MOVING". It is usually obvious by looking at the Observing Log when this has happened. Make sure nobody causes the telescope to move when a dither script is being executed. A long term solution is still being investigated.
5. TCS commands timeout before telescope movement has been completed	Depending on the telescope's position on sky some telescope movements may take a long time to complete. The timeouts in the so-called TCS "TELD" task are currently too short. This means that observing scripts may not synchronise properly with telescope movement.	The 'gocat' command is vulnerable to this. Additionally if you use SLOWOFF to slew across large areas of sky (>400 arcsec) at a slow rate you may run into difficulties. The new TCS release W14-1-11 should provide a cure to this. Check whether it has been made operational yet.
6. SDSU Controller sometimes responds incorrectly to commands or locks up.	There are currently (18/11/00) two problems that we know about: 1. Occasionally the SDSU Controller fails to send back the replies expected of it eg when queried for a camera id, it replies with error status. 2. The SDSU Controller sometimes gets into a 'lockup state' where it stops responding altogether.	If this happens the dither script (or higher level scripts that use them) will fail. First you need to determine whether the SDSU Controller has locked up or not. Type: 'detstat' If the system replies with normal status information all is well - continue observing where you left off. If the system comes back with an error condition you will need to reset and reload the SDSU Controller software. To do this: a/ Type 'startobssys -interactive' b/ Respond with N to all questions, except the one relating to initialising the SDSU. Having reloaded the SDSU software, confirm all is well with a 'detstat', then continue observing where you left off. A long term solution is still being investigated.
7. SDSU Controller temperature information is sometimes incorrect	Typing 'detstat' may display alarmingly high temperatures such as 253K etc. This information will also be shown recorded in the observational data files.	If this happens during the Daily Checks you could try reloading the SDSU software ('startobssys -interactive') or power cycling the SDSU Controller. Of course, if the problem doesn't go away it's a good idea to check that the INGRID cryostat really IS still cold ! If this happens at night time you can ignore the problem unless you're worried about the incorrect info being in your headers. Don't worry: INGRID image quality is not affected. A long term solution is still being investigated.
8. EPICS Controller is unreliable	Sometimes the commands which move the INGRID mechanism wheels fail. *ie filter, fwheel1, fwheel2* or pstop return with error status.**	This problem has been traced down to the EPICS mechanism controller which sometimes fails returning CAR_ERROR status. The Observing System has now been changed to perform a single retry on mechanism failure. So far (18/11/00) this has recovered the situation in all cases, making the problem invisible to Observers. A long term solution is still being investigated.