|
NIRSPEC Reliability Improvement
Progress report: 6
May 2004
|
|
Overview:
With the impending release of new server codes that (it is hoped) will
greatly reduce server crashes the project finds itself at a crossroads.
Several options exist and this report attempts to examine some of them
in the context of the project's history, goals, budget and schedule.
Project History and Plan:
By mid-summer of 2003 it was clear that the rate of server crashes had
risen considerably and significant time was being lost on sky.
What would become the first phase of the NIRSPEC reliability improvement
project began around this time. Most of the current
team members were involved from this early point.
With the increased resources that resulted from formal project status
in October 2003, work on phase I was able to accelerate, while
phase II was planned and proceeded in parallel.
A third phase was envisioned at this point but detailed planning would
require the results of phase II.
Overall goals would be achieved via phase III and are described below.
At the outset, due to the nature of the project, the budget and schedule
were educated guesses. To date, we
have overspent the original manpower estimate of 0.79 FTE by about 25 percent.
None of the $10K set aside for procurement has been spent.
We are currently one month behind the originally envisioned schedule.
Phase I: Speed recovery from server crashes.
-
No specific goals were set for this phase but time lost on sky was reduced
by a factor of 2 or 3 via four developments:
-
Improved understanding, documentation and strategy for recovery from
server crashes.
-
Ability to remotely power cycle communications devices.
-
Faster, less ambiguous indication of a crash having occurred.
-
Faster, automated recovery via an intelligent script.
-
Work on this phase was performed slightly under budget but slightly behind
schedule.
-
Work began around mid-summer 2003 and when incorporated within the project
was scheduled to be completed by January, 2004.
-
The majority of work was completed near to schedule.
-
Formal completion extended to late January 2004 as testing and use
revealed a few complications.
Phase II: Determine cause(s) of server crashes.
-
The goals for this phase were straightforward.
We can not yet tell with certainty if they have been met.
-
We believe that most server crashes result from an overburdening
of the communications chain by a high frequency of motor commands.
-
We suspect that some fraction of server crashes may result from power
fluctuations.
-
Work on this phase was performed under budget and on schedule (assuming
the goals have been met).
-
Originally this phase consisted of a number of tasks some of which
were to extend through January 2004.
-
By November 2003 confidence in having found a smoking gun was high
enough that work began on a solution.
-
As confidence grew that we had met the goals of this phase some of the
tasks moved into the background which helped maintain budget.
Phase III: Prevent server crashes.
-
It is unknown whether the goals of this phase have been met. They
are by and large those of the project as a whole:
-
No ticket greater than 10 minutes.
-
Average ticket about 3 minutes or less.
-
Average time lost per night comparable to HIRES.
-
Work on this phase was performed over budget and behind schedule.
-
The first attempt at predicting the schedule for this phase estimated completion
by April 30, 2004. At that point though, this was
little more than a guess.
-
As results from phase II started guiding subsequent iterations on the schedule,
completion by March 2004 was estimated.
-
Slip with respect to later schedules can be attributed to work on rotator
server and limited engineering time.
Proceeding From Here:
As mentioned above, there is some question how we proceed from here.
A number of project tasks remain to be completed:
-
Release the new server codes. They appear to function
as well as the current codes but there remains some concern with the overall
look, feel and display of the rotator GUI.
-
Gather statistics on crash frequency after new code release to confirm/refute
that improvement has resulted.
-
Upgrade the host. This task is still on the schedule
as a background task and is a reliability issue in that a failure has the
potential to cost many nights on sky.
-
Investigate the transputer power supply which appears out of spec.
-
Await (or create) power fluctuations and look for transients in output
from instrument UPS.
There exist ways to improve the reliability of NIRSPEC which are currently
not within the purview of this project:
-
Attempt to fix rotator instability via software. Early
on, it was decided that the project would focus on server crashes but that,
pending a solution to these, attention could turn to rotator problems.
-
Reduce or eliminate instances of XNIRSPEC freezing.
This problem is easily solved via simple restarts but can cost some time
due to the possibility that it indicates a server crash.
-
Make the rotator GUI resiliant to DCS disconnects.
Finally, note that via this project we have used (so far) roughly 1.0 FTE
to gain about 5 percent of time on sky. A NIRSPEC efficiency
improvement project would gain more time on sky with similar or less effort.
Currently, 20 to 25 percent of time on sky is claimed lost due to instrument
inefficiency. For example:
-
Redundant hardwiring of observing scripts costs 1 - 2 percent of time on
sky.
-
Non-adjustable flat lamp intensity can cost up to 3 - 4 percent of the
time on sky.
-
Faint object set-up can be difficult due to lack of automated scripts and
poor display tools. This is harder to quantify, but may
be costing 5 to 10 percent of a night when at its worst.
We are faced with the question of how and when to ramp down this project.
Three options are suggested:
1. End the project now.
-
Immediately frees up all allocated resources for other projects.
-
NIRSPEC is left markedly improved but with no guarantee project goals
have been met.
-
Leaves completion of the tasks described above somewhat uncertain.
-
Limits flexibility in re-attacking the problem if necessary.
2. Maintain the project to the end of semester 2004A.
-
If resource allocation is reduced, frees up some resources immediately
with the possibility of further redirection by August.
-
Maintains team while statistics on crash reduction accumulate.
-
Facilitates completion of some of the reliability related tasks described
above.
3. Continue through FY04.
-
Further limits re-allocation of resources but still allows some reduction.
-
Maintains team while statistics on crash reduction accumulate.
-
Facilitates completion of most of the reliability related tasks mentioned
above.
-
Provides a natural seque into an efficiency improvement project in FY05.