NIRSPEC Update

NIRSPEC Reliability Improvement
Progress report: 6 May 2004

Overview:

With the impending release of new server codes that (it is hoped) will greatly reduce server crashes the project finds itself at a crossroads. Several options exist and this report attempts to examine some of them in the context of the project's history, goals, budget and schedule.

Project History and Plan:

By mid-summer of 2003 it was clear that the rate of server crashes had risen considerably and significant time was being lost on sky. What would become the first phase of the NIRSPEC reliability improvement project began around this time. Most of the current team members were involved from this early point.

With the increased resources that resulted from formal project status in October 2003, work on phase I was able to accelerate, while phase II was planned and proceeded in parallel. A third phase was envisioned at this point but detailed planning would require the results of phase II.

Overall goals would be achieved via phase III and are described below. At the outset, due to the nature of the project, the budget and schedule were educated guesses. To date, we have overspent the original manpower estimate of 0.79 FTE by about 25 percent. None of the $10K set aside for procurement has been spent. We are currently one month behind the originally envisioned schedule.

Phase I: Speed recovery from server crashes.

No specific goals were set for this phase but time lost on sky was reduced by a factor of 2 or 3 via four developments:

Improved understanding, documentation and strategy for recovery from server crashes.
Ability to remotely power cycle communications devices.
Faster, less ambiguous indication of a crash having occurred.
Faster, automated recovery via an intelligent script.

Work on this phase was performed slightly under budget but slightly behind schedule.

Work began around mid-summer 2003 and when incorporated within the project was scheduled to be completed by January, 2004.
The majority of work was completed near to schedule.
Formal completion extended to late January 2004 as testing and use revealed a few complications.

Phase II: Determine cause(s) of server crashes.

The goals for this phase were straightforward. We can not yet tell with certainty if they have been met.

We believe that most server crashes result from an overburdening of the communications chain by a high frequency of motor commands.
We suspect that some fraction of server crashes may result from power fluctuations.

Work on this phase was performed under budget and on schedule (assuming the goals have been met).

Originally this phase consisted of a number of tasks some of which were to extend through January 2004.
By November 2003 confidence in having found a smoking gun was high enough that work began on a solution.
As confidence grew that we had met the goals of this phase some of the tasks moved into the background which helped maintain budget.

Phase III: Prevent server crashes.

It is unknown whether the goals of this phase have been met. They are by and large those of the project as a whole:

No ticket greater than 10 minutes.
Average ticket about 3 minutes or less.
Average time lost per night comparable to HIRES.

Work on this phase was performed over budget and behind schedule.

The first attempt at predicting the schedule for this phase estimated completion by April 30, 2004. At that point though, this was little more than a guess.
As results from phase II started guiding subsequent iterations on the schedule, completion by March 2004 was estimated.
Slip with respect to later schedules can be attributed to work on rotator server and limited engineering time.

Proceeding From Here:

As mentioned above, there is some question how we proceed from here. A number of project tasks remain to be completed:

Release the new server codes. They appear to function as well as the current codes but there remains some concern with the overall look, feel and display of the rotator GUI.
Gather statistics on crash frequency after new code release to confirm/refute that improvement has resulted.
Upgrade the host. This task is still on the schedule as a background task and is a reliability issue in that a failure has the potential to cost many nights on sky.
Investigate the transputer power supply which appears out of spec.
Await (or create) power fluctuations and look for transients in output from instrument UPS.

There exist ways to improve the reliability of NIRSPEC which are currently not within the purview of this project:

Attempt to fix rotator instability via software. Early on, it was decided that the project would focus on server crashes but that, pending a solution to these, attention could turn to rotator problems.
Reduce or eliminate instances of XNIRSPEC freezing. This problem is easily solved via simple restarts but can cost some time due to the possibility that it indicates a server crash.
Make the rotator GUI resiliant to DCS disconnects.

Finally, note that via this project we have used (so far) roughly 1.0 FTE to gain about 5 percent of time on sky. A NIRSPEC efficiency improvement project would gain more time on sky with similar or less effort. Currently, 20 to 25 percent of time on sky is claimed lost due to instrument inefficiency. For example:

Redundant hardwiring of observing scripts costs 1 - 2 percent of time on sky.
Non-adjustable flat lamp intensity can cost up to 3 - 4 percent of the time on sky.
Faint object set-up can be difficult due to lack of automated scripts and poor display tools. This is harder to quantify, but may be costing 5 to 10 percent of a night when at its worst.

We are faced with the question of how and when to ramp down this project. Three options are suggested:

1. End the project now.

Immediately frees up all allocated resources for other projects.
NIRSPEC is left markedly improved but with no guarantee project goals have been met.
Leaves completion of the tasks described above somewhat uncertain.
Limits flexibility in re-attacking the problem if necessary.

2. Maintain the project to the end of semester 2004A.

If resource allocation is reduced, frees up some resources immediately with the possibility of further redirection by August.
Maintains team while statistics on crash reduction accumulate.
Facilitates completion of some of the reliability related tasks described above.

3. Continue through FY04.

Further limits re-allocation of resources but still allows some reduction.
Maintains team while statistics on crash reduction accumulate.
Facilitates completion of most of the reliability related tasks mentioned above.
Provides a natural seque into an efficiency improvement project in FY05.