NIRSPEC Reliability Improvement Progress report: 12
Mar 2004
Overview:
During this report period, significant progress has been made on a number
of fronts.
It appears that time lost on sky is decreasing.
We are ready to test what we hope will be the first release version of
the new keyword and rotator server codes.
Investigation of the power supplies is well underway.
Ia. Speeding recovery from server crashes:
The recovery script was finished about six weeks ago.
Time lost to weather has been high for some time now. We may
still be in the regime of small number statistics in terms of evaluating
our success. Compared to last summer, when we first
started working on (what became) this task, time lost has decreased by
about a factor of two or three. Since the last meeting,
the recovery script has been run three times at night.
On the first two occasions, the script correctly diagnosed the problem
as not being a server crash and took the appropriate action.
No time on sky was lost as a result. On the third occasion,
the script failed because the iBoot device addressing the computer room
black box seemed to be off-line.
IIa. Upgrade instrument host:
This task (as befitting a background task) has not seen much progress.
Work has started though on replacing explicit references to waimea with
an environment variable that will allow future hot swaps.
IIb. Correlation research:
This task continues as a pure background task.
IIc. Crash free periods:
It appears we have exhausted this avenue of research and we now consider
this task complete.
IId. Characterize communications chain:
This task is now finished. No glaring deficiency
was found in the communications chain hardware which would account for
the frequency of server crashes witnessed over the past year.
IIe. Examine power supplies:
Prompted by NIRSPEC's sensitivity to power glitches, this task got underway
during this report period.
The instrument is on double-conversion type UPS. In theory
no transients should make it through to NIRSPEC.
The computer room UPS switches from line to battery during interuptions.
Analysis of a memory card should show if recent power glitches due to the
snow storm created transients.
No evidence has been observed that any of the power supplies are overburdened.
Documentation is scanty, but we have found one PSU that is drawing much
more than the number quoted (about 3 Amps versus 1 Amp).
Further investigation is planned.
IIIa. Reduce communications traffic:
This task accounted for most of our effort during this report period.
Going into our most recent engineering night, the new keyword server
and rotator server codes worked in stationary mode but not PA mode.
From the half night of engineering on March 3:
The codes now work for most of the sky in PA mode. Above
roughly 80 deg elevation, PA mode was failing.
The problem with high elevations is now thought to be understood and corrected.
No server crashes were encountered despite running the instrument in the
manner thought to make them most likely.
Work on removing the handling of temperature data by the transputers has
been deferred.
Lack of time for testing on sky is a concern. The next scheduled
night is June 25.
Issues and Concerns:
The shortage of engineering time continues to be our biggest concern.
The E-TAC was very helpful in granting the time on January 27 and March
3.
We are investigating a trade with NIRC2 (March 30 for June 25).
Without more engineering time, options to maintain progress may include
taking time from observers and/or releasing the new server codes without
further testing.