NIRSPEC is possibly our worst instrument in terms of time lost on sky.
Until now, work to remedy this situation has proceeded on a "as time allows"
basis, with predictable results.
With the backing and encouragement of upper managment we now have the resources
at our disposal to take a more serious approach.
A core team consisting of Hill, Honey, Lyke, Nordin and Rodriguez will
be responsible for producing results.
The core team will report once a month to Beletic, Conrad, Goodrich, Lewis
and Matsuda who will also work in an advisory capacity.
Project description:
We will take a project style approach. We have project and
task numbers for effort reporting. We have specified fractions
of people's time allocated. For each member of the core
team, the project has been assigned a specified priority with respect to
all their other tasks.
Intial focus will be soley upon server crashes. Only after
time lost to these has been eliminated, or significantly reduced, will
attention turn to other problems.
The figure of merit used to measure success will be time lost on sky.
Goals are:
no tickets greater than 10 minutes
average ticket about 3 minutes or less
average time lost per night comparable to HIRES
The core team will meet weekly to review status, and plan.
The full team will meet once a month.
Personnel resources to be called upon include Brims and Spencer at UCLA
and Chock, Nance, and Mason at CARA.
Current plan of attack includes three phases which are defined by focus
rather than date.
Phase I (near term):
Emphasis will be on speeding recovery from server crashes.
Offer observers and/or OA's and/or SA's the ability to remotely power cycle
both black boxes and the match box.
First milestone is to have iBoot devices operational by Oct 20.
NIRSPEC goes back on sky Nov 1, allowing 1.5 weeks for testing and script
writing/testing.
Provide automated tklogger style popup to speed diagnosis.
Work to start on Nov 1. Preliminary version available by Dec 1.
Refinement to proceed afterward with most of work completed by Jan 1.
Tool will warn observers a crash has occurred (or perhaps is impending).
Advice will be offered regarding how to recover.
Final version will actually prompt for appropriate recovery action and
perform it.
Investigate whether server restarts without motor re-inits are possible.
Some detailed research into the state of the transputers after a crash
is needed.
If their state allows a restart without a programme reload then scripts
and procedures to do so will be constructed.
Phase II (mid term):
Primary goal will be gaining a better understanding of cause(s) of server
crashes.
Cloning instrument host will allow us to put a waimea on right nas, thus
bypassing much of comm's chain. Performance should indicate
if a communications hardware limitation exists and narrow down its location.
There have been extended periods where NIRSPEC was relatively reliable.
What (if any) changes were made to the system near the beginnings and ends
of these periods? Mining log files for this information
will be tedious and labour intensive but may turn up insights.
The bandwidth and throughput of the comm's chain is not known.
Proper characterization of these may tell us if the system is marginal
or not.
Much of what we know about server crashes is folklore, including what (if
anything) they correlate with, how best to recover from them, and
how many different flavours there are of them. Some of
this information can be recovered from log files.
Information that can not, might be acquired via a short check-list style
form to be filled out when crashes occur.
The extent of communications bottlenecks and collisions is not understood
perfectly. In particular, rotator and temperature information
is handled differently by NIRSPEC and NIRC2. Observing
in real time, under real night time conditions how these possibly interact
with each other may indicate the best path to follow for crash prevention.
Secondary goal may be attacking rotator problems, but only if work can
be done in parallel and will not inhibit primary goal?
Phase III (long term):
Results of phase II used to work toward preventing server crashes.
If bang for buck implies rotator problems are more worthy target
then refocus efforts?