NIRSPEC Reliability Improvement Progress report: 7 Jan 2004
Overview:
A month ago, we described the project as being at something of a crossroads.
To some extent this is still true, since phase I is taking longer
to complete than estimated, and we are still in the early stages of phase
III. The past month included the Christmas holidays/slowdown
and some team members took vacation time. Despite this, there has
been some noteworthy progress. We have focussed our efforts
on three of the six tasks which make up this project.
Ia. Speeding recovery from server crashes:
In terms of nearness to completion, this task is roughly in the same state
as at the last report. This statement belies the fact
though that much work has been accomplished during the reporting period.
Of the two remaining subtasks, one is still active and the other has been
deferred.
Fast/smart recovery script.
This subtask is taking longer to complete than estimated in the previous
report.
Extensive testing continues to reveal the need for additional decision
branches to deal with various "gotchas" encountered during recovery
from server crashes.
Currently we hope to have the script available for release by mid-January.
Server restarts possible without motor re-inits?
Since the last report we have (roughly) estimated that such restarts may
be possible only about 20 percent of the time.
Pressure to finish other tasks has led to the decision to defer work
on this task until March.
IIa. Upgrade instrument host:
To some extent this task has moved into the background with little progress
during the reporting period, and none anticipated until mid-January at
the earliest. We do plan to return this task to the foreground
though as soon as resources are available.
Currently there exists an enormous number of scripts, gui's, parameter
files, etc. etc. in which the name of the host is hard-wired to "waimea"
instead of being a variable that can be set in one place.
Finding all these files, replacing "waimea" with something like $NIRSPECHOST
and confirming functionality, will be a gradual process.
Much care will need to be taken, and work only done when NIRSPEC is off
sky.
Progress could (in theory) be much more rapid by simply naming the Ultra
60 waimea and swapping it in. This presents some complications
though in terms of system managment and would represent something of a
band-aid approach.
Results of the Jan 5 engineering night imply that upgrading the host will
facilitate our efforts at reducing communications traffic.
IIb. Correlation research:
This task continues as a pure background task.
IIc. Crash free periods:
Investigation of day logs have not (yet?) revealed any hardware changes
that represent a smoking gun. Investigation of other logs and
software records continues although largely as a background task.
IId. Characterize communications chain:
Although steady progress has been made on this task, the schedule has had
to be revised due to the later than anticipated delivery of the fiber attenuator,
and the difficulty in obtaining a SCSI analyser.
The attenuator capable of correctly testing the tolerances of the fibers
to signal loss has been delivered.
A SCSI analyser has been obtained.
Three days of testing and measurement are on the summit work schedule for
the week of January 5.
It is anticipated that testing with the attenuator may create server crashes.
If so this will allow further, more realistic testing of the recovery script.
IIIa. Reduce communications traffic:
A month ago, a new version of the server which reduces the frequency of
motor commands had just been finished and awaited testing.
At that time, work on the rotator code was envisioned as beginning in early
to mid January.
Day testing of the new server was finished during this report period but
revealed the need for changes to the rotator server code.
Work on the rotator code began a few weeks earlier than scheduled in response
to the previous point.
The changes to the rotator code to date are largely for compatibility
with the new server.
These changes have been tested as best as possible during the day.
We still envision a much larger effort on the rotator code which will make
it consistent with other Keck rotator codes and (hopefully) realize some
reduction in communications volume.
Work on removing the handling of temperature data by the transputers is
still tentatively scheduled to begin in February.
The engineering night on Jan 5 indicated that more work is needed on the
rotator code. The new server code appeared to function
fine and may be more resistant to crashes. It requires
a new version of the rotator code though and so is not ready for release.
Issues and Concerns:
A shortage of engineering time continues to be our biggest concern.
Nov 26 was lost to weather, no time became available
in mid-December and Jan 5 was hampered by guider problems.
Jan 5 was very useful but underscored the need for more on sky testing.
No other engineering time is currently scheduled. A half night
in March is the next available "TBD". Our only other option
(currently) is to hope for some time as a backup programme for other engineering.