TCS alpha crashes
Introduction
These notes are mainly aimed at computing staff who have
to investigate why an Alpha has crashed. The section about general
recovery procedures may be of interest to the Duty Engineer.
General Recovery Procedures
Try to establish if the Alpha is responding. From
a sparc
ping lpasn
Check on system console (monitor by alpha in WHT, VT220
in INT) whether there is
any activity (when an Alpha crashes it will write messages
to screen and go on to dump
memory to disk (crash dump). After writing the crash
dump, the alpha reboots by itself.
If there is no response at the console, one can
interrupt the system by pressing the halt
(reset) button or control-P on the console keyboard.
Then follow the instructions to take
a crash dump
If there is still no response then power off the alpha
and power back on. Bear in mind that
it may also be necessary to power cycle CAMAC.
Common causes of a Crash
INVEXECPTN
Investigating a Crash
Log in as SYSTEM or other privileged account.
1) SHOW SYSTEM shows processes running
and how long the system has been up.
2) The operator log shows
SET DEF DSA0:[SYS0.SYSMGR]
TYPE/PAGE OPERATOR.LOG
Can look at previous logs with file version number
TYPE/PAGE OPERATOR.LOG;-1
3) The system error log, will note hardware errors and crashes
SET DEF DSA0:[SYS0.SYSERR]
Convert format
ANAL/ERR/ELV CONVERT ERRLOG.SYS
ANAL/ERR/SINCE=dd-mm-yyyy ERRLOG.CVT
SHO ERROR
will show device errors
An explanation of messages can be found in:
OPENVMS
System Messages
4) CLUE will analyze a dump when machine is rebooted.
When crash happens, the machine will write memory to
the dump file and then reboot. Sometimes the
(duty) engineer will power cycle/press reset button on the alpha
before dump can be completed.
SET DEF DSA0:[SYS0.SYSCOMMON.SYSERR]
DIR /SINCE=dd-mmm-yyyy CLUE*.*;* /DATE
TYPE/PAGE CLUE$LPASn_ddmmyy_hhmm.LIS
5)
TCS Software Manager
Last modified: FJG 12 Mar 2007