NIRSPEC Reliability Improvement Progress report: 10 Dec 2003
Overview:
At the time of this report, the project finds itself at something of a
cross roads. Our work to speed recovery has
been successful and although continuing, is no longer our main focus.
Efforts to better understand server crashes are continuing and guided by
what we have learned so far, work to reduce their frequency is starting.
One task, originally envisioned as a diagnostic test has been de-scoped
due, largely, to financial constraints.
Ia. Speeding recovery from server crashes:
This task, as previously defined, is essentially complete.
The iBoot remote power control devices and the scripts which run them are
now in routine operation. The strings chosen to
trigger tkLogger seem to correctly warn of a server crash.
Work remains on two subtasks:
Fast/smart recovery script.
The script is finished and has been tested as best as possible without
real server crashes.
Further testing and characterization is desirable.
Server restarts possible without motor re-inits?
This question has been discussed in the past but we have not unequivocally
answered it yet.
Work will proceed on this as a background sub-task.
If such restarts are possible, the goal is to have the recovery script
decide when that (faster) path is available and follow it.
IIa. Upgrade instrument host:
Our original intentions for this task were twofold. Upgrading
the host computer for NIRSPEC to an Ultra 60, will make it more like
NIRC2 which suffers far fewer server crashes.
Additionally we had hoped to place a host computer directly at the instrument.
This would effectively have jumpered out much of the communications chain
and in theory may have shown if a problem existed in hardware.
During this report period, the cost of an enclosure for the host
was found to be much higher than anticipated. Combined
with some sense of lingering opposition to this test, the decision was
made to de-scope this task to just an upgrade of the host.
Network configuration, operating system installation, and licencing work
complete.
Low level software installed and tested.
Network problems slowed readiness for engineering night which was cancelled
anyway.
Day testing will continue through December so that we are ready if any
part of an engineering night becomes available.
Preliminary possibility is the second half of the night on January 5.
IIb. Correlation research:
Although still at an early stage, this work may have started to pay dividends.
A possible correlation of server crashes with the frequency of motor move
commands has been noticed. An experimental version
of the server code has been created which limits the frequency of motor
commands to one per second. This version of the
server awaits testing.
IIc. Crash free periods:
A plot of the frequency of server crashes over
the past few years has been created. The top panel
shows the number of server crashes per hour of clear dark sky.
The points are month by month and are not weighted in any way by the number
of hours NIRSPEC was on sky that month. The line however,
is a three month smoothing of the data where each three month average is
weighted according to number of hours on sky. This
plot has allowed us to identify a software change that may have coincided
with the increase in server crashes last spring. A version
of the server which corrects this change has been created, tested and is
now the default. Work continues, looking for hardware
changes which may have occurred as well.
IId. Characterize communications chain:
This task, designed to investigate the tolerances and capabilities of the
communications chain is planned to get underway in the coming weeks.
The attenuator capable of correctly measuring the tolerances of the fibers
to signal loss is expected to be delivered around the time of this report.
Borrowing or purchasing a SCSI analyser is being investigated.
Gaps in the December NIRSPEC observing schedule will provide an opportunity
for fiber measurement.
IIIa. Reduce communications traffic:
A long held suspicion that server crashes correlate with excessive communications
traffic is appearing more and more likely. We have
started working on a reduction in traffic:
The new version of the server mentioned above which reduces the frequency
of motor commands is finished and ready for testing.
It appears that a new version of the rotator server code is desirable:
An update will enable consistency with changes to the keyword server.
The code needs to be cleaned up for consistency with other Keck rotator
codes.
Some reduction in communications rate and volume is achievable via an update
to the rotator server code.
Removing the handling of temperature data by the transputers can allow
additional reduction in comm's traffic.
This work will require the help of UCLA since changes to the OCCAM code
are likely necessary.
The path to achieving this goal is not yet clear and some preliminary
research will be necessary.
Issues and Concerns:
The engineering night on November 26 was lost to weather.
LGS-AO engineering is coming up the nights of December 15,16,17.
Vulnerability of this engineering to weather implies we should investigate
moving NIRSPEC back from AO in time to take advantage of any availability
of time.
The second half of January 5 is currently NIRC2 but may be available.
LGS-AO has engineering time at the beginning of February and March
and we should plan so that we can take advantage of any time available
due to weather.
The task to jumper part of the communications chain was de-scoped but this
still remains a desirable test.
We need to investigate other methods of selectively removing elements
of the communications chain.
Although correlation research is now started and may even have produced
some results, we are still not where we should be on this work.