NIRSPEC Reliability Improvement 
Progress report: 10 Dec 2003

Overview:

At the time of this report, the project finds itself at something of a cross roads.     Our work to speed recovery has  been successful and although continuing,  is no longer our main focus.       Efforts to better understand server crashes are continuing and guided by what we have learned so far, work to reduce their frequency is starting.    One task, originally envisioned as a diagnostic test has been de-scoped due, largely, to financial constraints.

Ia.  Speeding recovery from server crashes:

This task, as previously defined, is essentially complete.    The iBoot remote power control devices and the scripts which run them are now in routine operation.     The strings chosen to trigger tkLogger seem to correctly warn of a server crash.     Work remains on two subtasks:

IIa.  Upgrade instrument host:

Our original intentions for this task were twofold.    Upgrading the host computer for NIRSPEC to an Ultra 60,  will make it more like NIRC2 which suffers far fewer server crashes.      Additionally we had hoped to place a host computer directly at the instrument.    This would effectively have jumpered out much of the communications chain and in theory may have shown if a problem existed in hardware.   During this report period,  the cost of an enclosure for the host was found to be much higher than anticipated.     Combined with some sense of lingering opposition to this test, the decision was made to de-scope this task to just an upgrade of the host.

IIb.  Correlation research:

Although still at an early stage, this work may have started to pay dividends.    A possible correlation of server crashes with the frequency of motor move commands has been noticed.     An experimental version of the server code has been created which limits the frequency of motor commands to one per second.     This version of the server awaits testing.

IIc.  Crash free periods:

A plot of the frequency of server crashes over the past few years has been created.     The top panel shows the number of server crashes per hour of clear dark sky.    The points are month by month and are not weighted in any way by the number of hours NIRSPEC was on sky that month.    The line however, is a three month smoothing of the data where each three month average is weighted according to number of hours on sky.     This plot has allowed us to identify a software change that may have coincided with the increase in server crashes last spring.    A version of the server which corrects this change has been created, tested and is now the default.    Work continues, looking for hardware changes which may have occurred as well.

IId.  Characterize communications chain:

This task, designed to investigate the tolerances and capabilities of the communications chain is planned to get underway in the coming weeks.

IIIa.  Reduce communications traffic:

A long held suspicion that server crashes correlate with excessive communications traffic is appearing more and more likely.     We have started working on a reduction in traffic:

Issues and Concerns: