NIRSPEC Reliability Improvement Progress report: 12 Nov 2003
Overview:
The tasks comprising this project fall into one of three phases.
The goals are respectively, speeding recovery from server crashes,
determining the cause(s) of these crashes, and reducing their
frequency. The numbers preceeding the task summaries
below reflect this division.
Ia. Speeding recovery from server crashes:
This task is proceeding well and subtasks are being completed in a timely
fashion. Three out of the four subtasks have been completed
and work has started on the fourth. Barring significant
redirections of time spent by AH and JL, we hope to be finished this last
item by December.
iBoot power control devices:
Devices were installed c/w routers by the third week of October.
Control scripts were written and tested the last week of October and ready
for use with the NIRSPEC run starting November 1.
Two unanticipated operational complications cost about 1.5 hours.
Both have been fixed.
Devices and scripts are working well. Time from "pull
the trigger" to "back in business" is now about 10 minutes.
Simple diagnostic popup.
This work was accomplished considerably ahead of schedule.
tkLogger was already present and just needed a simple typo fix to work.
We have found two log file strings, either of which seem indicate a server
crash.
Guesswork now largely removed from diagnosing a server crash.
This probably is saving on order of 10 minutes alone.
Fast/smart recovery script.
Work on this sub task is just starting.
Intention is to incorporate existing scripts and make recovery decisions
faster and more reliably than observers can.
The script will most likely prompt observers for decisions that a human
must make.
It is hoped that this work will help avoid human error which accounted
for a significant fraction of the time lost during the past week.
IIa. Jumpering part of the communications chain:
By eliminating about half the communications chain we hope to determine
if that part of the chain includes a hardware limitation.
The chain will be shortened by placing a clone of waimea on right nas.
This effort is proceeding well but the schedule is very tight.
We were assigned an engineering night about three weeks earlier than requested,
forcing us to try to be ready for on-sky testing that much sooner.
Network configuration and OS installation were completed for the clone
by the third week of October.
Enclosure for waimea2
Our intention is to have an enclosure on right nas for waimea2 to eliminate
environmental interaction with the dome.
If enclosure is not ready, fall back option is to run for single
engineering night without it and revert to waimea in computer room.
Preliminary software installation was completed the first week of November.
Testing
Two days of testing are currently planned with the clone in the computer
room.
Three days of day testing are allocated for the clone on right nas.
Final hurdle is testing under real observing conditions on the night
of November 26.
IIb. Correlation research:
There is some suspicion that server crashes may correlate with one or more
"variables" most of which involve how the instrument is being used.
Brainstorm list of all conceivable variables. This subtask
is complete.
Compile statistics.
Work has not started on this subtask yet. In this regard
we are behind our target schedule.
We may wish to consider some redirection of resources to get this subtask
back on track.
IIc. Crash free periods:
Historically, there have been some periods where server crashes have been
less frequent. This raises the question of what (if anything)
was done to create and end these periods.
This task is currently envisioned to consist of mining the day and night
logs for information. The schedule plans these
two subtasks to both start in December.
IId. Characterize communications chain:
Presently, although we suspect the communications chain may be overburdened
we do not know what its tolerances and capabilities are.
A purchase order has been placed for an attenuator capable of correctly
measuring the tolerances of the fibers to signal loss.
During gaps in the December NIRSPEC observing schedule it is planned
to measure fiber tolerances and communications bandwidth.
IIIa. Reduce communications traffic:
Based on the suspicion that the comm's chain is overburdened we have already
planned this phase III task.
Only one subtask planned so far; removing temperature polling from
comm's traffic. Start date is scheduled for December
1st.
Issues and Concerns:
No significant downturn yet in time lost on sky.
A detailed comparison to historical numbers is still being calculated.
13 hours were lost during the first week of November.
Of this time, 9 hours were directly attributed to NIRSPEC.
Of the 9 hours attributed to NIRSPEC, one could argue that
4 hours were due to human error. An additional 1.5 hours (as
noted above) represented to some extent "teething pains" for the new recovery
procedures.
The amount of time lost to human error implies we need to do a better
job of getting both observers and OA's informed of how to use the new recovery
procedures.
The task to jumper part of the communications chain is on a very tight
schedule and there is some risk that we will not be fully ready for the
engineering night on November 26.
Unforeseen network, software, or operating system complications could
derail schedule.
The enclosure may not be ready, forcing us to run without it, then
move the clone back down to the computer room.
We may wish to negotiate with following observers to trade time
back and forth.
We are behind schedule on the task to search for correlations between
server crashes and system/observer/telescope variables.
Some redirection of resources may be necessary.
Suspicion that crashes are triggered by overburdening the communications
chain is becoming strong.
We may want to re-evaluate
the effort level assigned to the task addressed at reducing communications
traffic.