NIRSPEC Reliability Improvement Progress report: 4
Feb 2004
Overview:
A month ago the project was asked to make significant progress on three
tasks. The tasks and their status are:
Fast/smart recovery script is finished.
Characterization of the communications chain is nearly complete (about
2 more days work remaining)
The engineering night on January 27 was very successful.
The new server and rotator codes appear to work well in stationary mode.
The problems encountered with PA mode are understood and should be fixed
soon.
We have added a new task to examine NIRSPEC power supplies.
Three tasks (upgrading host, correlation research, crash free periods)
have moved further into the background.
Ia. Speeding recovery from server crashes:
This task is now complete. The fast/smart recovery script was
released in time for the late January NIRSPEC run and seems to be working
well. Depending on the nature of the server crash,
recovery should now be possible in 5 to 15 minutes. Just
as importantly, the script requires almost no decisions by the observer
and intitial reactions have been very positive. The failure
to accomplish swift recovery from two crashes on the night of February
1, was unrelated to the script itself. A link to a library
file necessary for operation of the iBoots was incorrect but has now been
fixed.
IIa. Upgrade instrument host:
This task has moved further into the background. Any
work that does take place in the coming months will consist of replacing
explicit references to waimea in high level scripts with a variable host
name followed by testing to confirm functionality.
IIb. Correlation research:
This task continues as a pure background task. Recent
crashes have reinforced our suspicion that a significant fraction of server
crashes occur during times of high rotator demand.
In addition, our interest is piqued by the correlation between power glitches
and server crashes.
IIc. Crash free periods:
Despite our continued search, no further NIRSPEC hardware or software changes
have been found which coincide with the rapid increase in the rate of server
crashes last spring. Our interest in NIRSPEC's
susceptibility to power glitches has led us to note however that the current
Keck II instrument UPS was brought into service around the same time.
IId. Characterize communications chain:
Delivery of the fiber attenuator and SCSI analyser near the end of December
allowed work to begin during this report period. Work was hampered
somewhat by weather and unrelated fiber work, but despite this the task
is now nearly complete.
The transmit fiber appears to be tolerant of the addition of 4-5 dB of
extra attentuation.
The receive fiber is somewhat more sensitive but still has 3-4 dB of headroom.
No anomalous messages were seen via the SCSI analyser during server crashes
induced by extreme attenuation on either fiber.
Continued testing with the attenuator is desirable as fibers and connectors
are upgraded.
IIe. Examine power supplies:
This is a new task which we have recently planned and hope to start in
the coming weeks. Historically, NIRSPEC has been extremely
sensitive to power glitches with such events almost always causing a server
crash. The possibility that one or more components in
the communications chain, the transputers, or a motor is inadequately supplied
or buffered will be investigated.
The first sub-task naturally is to check what is on the UPS and what is
not.
If we are satisfied that everything that should be, is on the UPS we will
then compare supplies with their ratings.
A third sub-task will be to measure supplied and drawn power during operation.
IIIa. Reduce communications traffic:
Work on the new keyword server and rotator codes designed to reduce communications
volume is well along and much progress was made during this report period.
A half night of engineering was made available for further testing on January
27.
An arrangement with the first half observer allowed her to have most of
the night for science in exchange for using the new code with occasional
interuptions by us to work on/with it.
The new code appears to work fine in rotator stationary mode and was used
successfully this way for much of the night with no crashes.
A (now understood) problem was encountered using the rotator in PA mode.
We estimate this problem will be solved with another week's work.
The next available night for engineering testing appears to be March 3.
Work on removing the handling of temperature data by the transputers is
still tentatively scheduled to begin in February but the start has slipped
slightly (by two weeks).
Issues and Concerns:
Maintaining progress will be challenging as budgeted effort levels for
several team members drop, and support duties ramp up for Lyke and
Hill.
Still no significant downturn in time lost statistics.
We are still constrained by a shortage of engineering time.
The E-TAC was very helpful in granting the time on January 27.
The next possible (not yet granted) time is a half night on March 3.
We have let it be known we are prepared to take advantage of any time made
available by the laser team should they have to give up due to weather.