NIRSPEC Reliability Improvement 
Progress report:  11  February  2005

Overview:

This update is to discuss the 3-month statistics gathering period from November 2004 - January 2005. During this time, no changes were made to the keyword server or any rotator software.

Fault Statistics:

In general, NIRSPEC has performed well with time lost to faults decreasing every month. The actual numbers are different between METRICS and the Nightlogs, but they are qualitatively the same. The Nightlog fault numbers are lower than those from METRICS.

Instrument Comparison

HIRES

NIRSPEC

The December 2004 fault percentage is actually 2.8% when duplicate tickets are removed.


Current Problems

There are apparently four differing problems none of which are completely understood.

Cannot abort SPEC exposures

Sending an abort of an exposure usually (not always) results in the system eventually becoming disfunctional. This problem also plagues NIRC2. We believe the problem lies in the coding of the Transputers themselves.

Disfunctional rotator server

On rare occassions the rotator server becomes disfunctional. This does not appear to be DCS disconnect related as that situation has been forced to occur and, on other occassions it has spontaneously occured yet the system recovers as expected. When the server does become unresponsive there is insufficient feedback to glean anything useful. Additional logging was added on the most recent software modifications but the output from that logging has never occurred.

Unresponsive keyword server--transputers OK

The keyword server becomes unresponsive. When this problem occurs the thermal data continues to be transimitted from the transputers. So the problem appears to be in the keyword server itself. It may be that this problem has existed for a long time and was masked by the other problems so we never previously addressed it or perhaps recent modifications have created this problem. We have a procedure in place to force 'core dumps' when this problem occurs. So far that has only been successful once, during the day. The indications are that the server is simply blocked waiting for data as though a command to the transputers was lost so no response is ever sent back.

Unresponsive keyword server--transputers not OK

The keyword server becomes unresponsive and the transputer do not transmit thermal data. This problem most likely resides in the Transputers. We do not think it is in the communication chain as power cycling the various communication components does not bring the system back to life.

Note that the "Unresponsive keyword server" problems combined to occur about once out of every 3 nights of observing. These "crashes" cost about 20 minutes of observing time for recovery and target re-acquisition.


Next Steps

There are essentially 3 different paths we can choose:

  1. Do Nothing

    We assume the current fault trend will continue and NIRSPEC will not develop any other problems.

    Pros:
    little to no additional effort
    Cons:
    this is as good as it gets


  2. Continue as before...tracking down problems in the server

    We look for the cause of the "unresponsive keyword server", but do not improve instrument infrastructure.

    Pros:
    We may drop the time lost to faults even further.
    Cons:
    We have not found the cause in the previous 3 months so we may not find it.
    Little bang for the buck.


  3. Begin a new project with a different focus

    We have greatly decreased the crash recovery time but we still see crashes. Perhaps there are other areas for improvement.

    The new project will have 2 areas of focus:

    1. Efficiency

      • Adjustable flat lamp intensity
      • faster execution of scripts from EFS

    2. Long-term NIRSPEC improvements

      • Make NIRSPEC more like NIRC2
        • Remove temperature and motor control from server
        • Upgrade the host
      • Prepare for life after transputers

    Pros:
    NIRSPEC becomes more reliable
    Can use OSIRIS s/w infrastructure
    Transputers were obsolete when NIRSPEC installed
    Cons:
    Costly in FTEs