|
NIRSPEC Reliability Improvement
Progress report: 12
April 2004
|
|
Overview:
Steady progress continues to be made toward a release version of the new
keyword and rotator server codes. Development and
testing of the new codes accounts for much of our effort over the past
month. The target date for release has slipped to the end of April.
Evidence continues to accumulate that power supplies and/or power quality
may cause some server crashes. Pursuit of this
possibility represents our second major area of resource dedication.
Time lost on sky continues to decline despite no significant reduction
in the rate of server crashes.
Speeding recovery from server crashes:
We are now confident that the crash recovery procedures this project has
developed are leading to a reduction in time lost on sky.
During this reporting period, frozen gui's or genuine server
crashes prompted the recovery script to be run on seven occasions.
For the three of these that were genuine crashes, the run and recover time
averaged less than ten minutes. On the other four occasions,
the script correctly diagnosed the situation, restarted the problematic
gui's and no time was lost.
Prevention of server crashes: reduction of communications traffic
Parts of four engineering half-nights have now been used for testing of
new keyword and rotator server codes. These codes should prevent
those server crashes that result when the transputers are swamped by high
rates of motor commands. Our most recent chance to test on
sky was two hours on the night of March 31.
-
We tested four variations of instrument operation:
-
PA mode with SCAM guiding at an elevation of 70 degrees.
-
PA mode with SCAM guiding at an elevation of 88 degrees.
-
PA mode with offset guiding at an elevation of 84 degrees.
-
Stationary mode with SCAM guiding at an elevation of 87 degrees.
-
All four modes worked successfully including those previously failing;
PA modes above 80 degrees.
-
Over the course of the test at 88 degrees, the rotator moved through about
50 degrees of physical angle and maintained a constant PA on sky as indicated
by the double star observed.
Although we got past our previous failure point, the testing revealed two
problems with the rotator code:
-
Switches back and forth between stationary and PA mode via the rotator
gui were not getting caught by the code.
-
Crashes of the rotator server code, which can happen anywhere on the sky
are less gracefully handled at high elevations:
-
The current solution to this problem involves a wrapper which automatically
restarts the rotator server.
-
At high elevations demands on the rotator can be so high that the
restart time compromises performance.
-
A fix is thought to be in hand and is being tested this week during
the day.
The next unclaimed engineering time is a half night on May 30.
Pending the results of day testing starting April 12, we may approach an
observer with time at the end of April offering to trade time on May 30.
Prevention of server crashes: power quality?
Evidence continues to accumulate that some fraction of server crashes may
result from problems with power quality. The correlation
between power interuptions and server crashes continued with the most recent
brownouts on April 2.
-
A monitor on the computer room UPS indicated no problem with the power
it supplied during the previous outage.
-
During the current period in which NIRSPEC is off sky we plan to:
-
put a monitor on the instrument UPS
-
further investigate the power supply previously found to be anomalous
(supplying 3A versus spec of 1A)
Issues and Concerns:
-
If we are unable to work a trade to get some on sky time at the end of
April, we are faced with a number of options:
-
Claim, with no compensation, some time from observers for further testing.
-
Release the code without complete testing.
-
Wait until May 30 for further testing.