![](../../logo.jpg) |
NIRSPEC Reliability Improvement
Progress report: 11
February 2005
|
![](../../logo.jpg) |
Overview:
This update is to discuss the 3-month statistics gathering period from
November 2004 - January 2005. During this time, no changes were made
to the keyword server or any rotator software.
Fault Statistics:
In general, NIRSPEC has performed well with time lost to faults
decreasing every month. The actual numbers are different between
METRICS and the Nightlogs, but they are qualitatively the same. The
Nightlog fault numbers are lower than those from METRICS.
Instrument Comparison
HIRES |
![](../../images/monthChart_HIRES.png) |
NIRSPEC |
![](../../images/monthChart_NIRSPEC.png) |
The December 2004 fault percentage is actually
2.8% when duplicate tickets are removed. |
Current Problems
There are apparently four differing problems none of which are
completely understood.
Cannot abort SPEC exposures
- Sending an abort of an exposure usually (not always) results in the
system eventually becoming disfunctional. This problem also plagues
NIRC2. We believe the problem lies in the coding of the Transputers
themselves.
Disfunctional rotator server
- On rare occassions the rotator server becomes disfunctional. This does
not appear to be DCS disconnect related as that situation has been forced
to occur and, on other occassions it has spontaneously occured yet the
system recovers as expected. When the server does become unresponsive
there is insufficient feedback to glean anything useful. Additional
logging was added on the most recent software modifications but the
output from that logging has never occurred.
Unresponsive keyword server--transputers OK
- The keyword server becomes unresponsive. When this problem occurs the
thermal data continues to be transimitted from the transputers. So the
problem appears to be in the keyword server itself. It may be that this
problem has existed for a long time and was masked by the other problems
so we never previously addressed it or perhaps recent modifications have
created this problem.
We have a procedure in place to force 'core dumps' when this problem
occurs. So far that has only been successful once, during the day.
The indications are that the server is simply blocked waiting for
data as though a command to the transputers was lost so no response
is ever sent back.
Unresponsive keyword server--transputers not OK
- The keyword server becomes unresponsive and the transputer do not
transmit thermal data. This problem most likely resides in
the Transputers. We do not think it is in the communication chain
as power cycling the various communication components does not
bring the system back to life.
Note that the "Unresponsive keyword server" problems combined to
occur about once out of every 3 nights of observing. These "crashes"
cost about 20 minutes of observing time for recovery and target re-acquisition.
Next Steps
There are essentially 3 different paths we can choose:
Do Nothing
We assume the current fault trend will continue and NIRSPEC will
not develop any other problems.
- Pros:
- little to no additional effort
- Cons:
- this is as good as it gets
Continue as before...tracking down problems in the server
We look for the cause of the "unresponsive keyword server", but do
not improve instrument infrastructure.
- Pros:
- We may drop the time lost to faults even further.
- Cons:
- We have not found the cause in the previous 3 months so we may not
find it.
- Little bang for the buck.
Begin a new project with a different focus
We have greatly decreased the crash recovery time but we still see
crashes. Perhaps there are other areas for improvement.
The new project will have 2 areas of focus:
Efficiency
- Adjustable flat lamp intensity
- faster execution of scripts from EFS
Long-term NIRSPEC improvements
- Make NIRSPEC more like NIRC2
- Remove temperature and motor control from server
- Upgrade the host
- Prepare for life after transputers
- Pros:
- NIRSPEC becomes more reliable
- Can use OSIRIS s/w infrastructure
- Transputers were obsolete when NIRSPEC installed
- Cons:
- Costly in FTEs