NIRSPEC Update

NIRSPEC Reliability Improvement
Progress report: 12 Nov 2003

Overview:

The tasks comprising this project fall into one of three phases. The goals are respectively, speeding recovery from server crashes, determining the cause(s) of these crashes, and reducing their frequency. The numbers preceeding the task summaries below reflect this division.

Ia. Speeding recovery from server crashes:

This task is proceeding well and subtasks are being completed in a timely fashion. Three out of the four subtasks have been completed and work has started on the fourth. Barring significant redirections of time spent by AH and JL, we hope to be finished this last item by December.

iBoot power control devices:

Devices were installed c/w routers by the third week of October.
Control scripts were written and tested the last week of October and ready for use with the NIRSPEC run starting November 1.
Two unanticipated operational complications cost about 1.5 hours. Both have been fixed.
Devices and scripts are working well. Time from "pull the trigger" to "back in business" is now about 10 minutes.

Simple diagnostic popup.

This work was accomplished considerably ahead of schedule. tkLogger was already present and just needed a simple typo fix to work.
We have found two log file strings, either of which seem indicate a server crash.
Guesswork now largely removed from diagnosing a server crash. This probably is saving on order of 10 minutes alone.

Fast/smart recovery script.

Work on this sub task is just starting.
Intention is to incorporate existing scripts and make recovery decisions faster and more reliably than observers can.
The script will most likely prompt observers for decisions that a human must make.
It is hoped that this work will help avoid human error which accounted for a significant fraction of the time lost during the past week.

IIa. Jumpering part of the communications chain:

By eliminating about half the communications chain we hope to determine if that part of the chain includes a hardware limitation. The chain will be shortened by placing a clone of waimea on right nas. This effort is proceeding well but the schedule is very tight. We were assigned an engineering night about three weeks earlier than requested, forcing us to try to be ready for on-sky testing that much sooner.

Network configuration and OS installation were completed for the clone by the third week of October.
Enclosure for waimea2

Our intention is to have an enclosure on right nas for waimea2 to eliminate environmental interaction with the dome.
If enclosure is not ready, fall back option is to run for single engineering night without it and revert to waimea in computer room.

Preliminary software installation was completed the first week of November.
Testing

Two days of testing are currently planned with the clone in the computer room.
Three days of day testing are allocated for the clone on right nas.
Final hurdle is testing under real observing conditions on the night of November 26.

IIb. Correlation research:

There is some suspicion that server crashes may correlate with one or more "variables" most of which involve how the instrument is being used.

Brainstorm list of all conceivable variables. This subtask is complete.
Compile statistics.

Work has not started on this subtask yet. In this regard we are behind our target schedule.
We may wish to consider some redirection of resources to get this subtask back on track.

IIc. Crash free periods:

Historically, there have been some periods where server crashes have been less frequent. This raises the question of what (if anything) was done to create and end these periods.

This task is currently envisioned to consist of mining the day and night logs for information. The schedule plans these two subtasks to both start in December.

IId. Characterize communications chain:

Presently, although we suspect the communications chain may be overburdened we do not know what its tolerances and capabilities are.

A purchase order has been placed for an attenuator capable of correctly measuring the tolerances of the fibers to signal loss.
During gaps in the December NIRSPEC observing schedule it is planned to measure fiber tolerances and communications bandwidth.

IIIa. Reduce communications traffic:

Based on the suspicion that the comm's chain is overburdened we have already planned this phase III task.

Only one subtask planned so far; removing temperature polling from comm's traffic. Start date is scheduled for December 1st.

Issues and Concerns:

No significant downturn yet in time lost on sky. A detailed comparison to historical numbers is still being calculated.

13 hours were lost during the first week of November. Of this time, 9 hours were directly attributed to NIRSPEC.
Of the 9 hours attributed to NIRSPEC, one could argue that 4 hours were due to human error. An additional 1.5 hours (as noted above) represented to some extent "teething pains" for the new recovery procedures.
The amount of time lost to human error implies we need to do a better job of getting both observers and OA's informed of how to use the new recovery procedures.

The task to jumper part of the communications chain is on a very tight schedule and there is some risk that we will not be fully ready for the engineering night on November 26.

Unforeseen network, software, or operating system complications could derail schedule.
The enclosure may not be ready, forcing us to run without it, then move the clone back down to the computer room.
We may wish to negotiate with following observers to trade time back and forth.

We are behind schedule on the task to search for correlations between server crashes and system/observer/telescope variables.

Some redirection of resources may be necessary.

Suspicion that crashes are triggered by overburdening the communications chain is becoming strong.

We may want to re-evaluate the effort level assigned to the task addressed at reducing communications traffic.

Grant M. Hill <ghill@keck.hawaii.edu>