June 30, 1987
Dr. Robert Moon
Assistant to the President
Dear Dr. Moon,
After assisting a site experiencing over two weeks of unsched-
uled computer down-time, I felt it best to summarize some
aspects for your review.
The site was NASA Goddard in Greenbelt, Maryland. Evidently
their Sigma 9 system was affected by a "power hit" about June
5. They attempted a reboot from PO tape, but had 96 GENMD
errors in the process. The resulting operating system did
not run properly. All available hardware diagnostics were
run, but they did not show any errors. NASA is currently
maintained by Western Computer Systems, but were maintained
by Honeywell up through December 31, 1986. I was contacted
the evening of June 18, and was on-site early the morning of
June 20. Because of the TeleXchange conference in Washing-
ton D.C. it was convenient for me to go to NASA. I found the
problem and suggested a correct repair after four and a half
hours. I was on-site less than six hours.
Following are some of my observations and further comments:
First, since I was not helping them as part of my work for
Andrews University or for Telefile, I will cover the travel
expenses to the area, and several phone calls, if you see fit.
I realize that, whereas it had been Telefile's policy to pro-
vide my travel expenses to attend TeleXchange, and whereas
this policy was being "bent" concerning this meeting, that
portion of the expense of attending this conference was
unplanned for Andrews University. You are aware that I feel
very strongly about our other staff having the opportunity
to occasionally attend such functions and am willing to help
make certain this is a reality.
It is not yet clear how much I will be paid for this assist-
ance. Since they had been down for so long, I did not feel it
was right to expect payment if I could not help them, especial-
ly since little money was required to cover expenses. Jimmy
Nishry of Telefile had made contact with Michael Sutch of
Lockheed (primary contractor with NASA for this project) to
assure that if I did help them, I would be paid "what it was
worth".
page 2
Second, although the timing was quite convenient as far as my
trip was concerned, the fact that I arrived in the area
Friday evening and they were still down presented a dilemma.
Applying the statement in Luke 14:5 about "an ox fallen into a
pit", I felt it necessary to assist them even on the Sabbath
day. In so doing I was careful to not misrepresent my
church's and primary employer's fundamental beliefs concerning
proper keeping of the seventh-day Sabbath.
Even at our site, some computerized functions must now be
maintained on the Sabbath day. Although manual backup facili-
ties have been provided, if a simple repair can restore ser-
vice we have been performing it. In my search for Luke 14:5
I came across Luke 13:15 concerning watering cattle on the
Sabbath and felt much better about occasionally stopping in
the computer room on the Sabbath to look, feel, smell, and
listen for any abnormalities. Although I personally had no
problems with this, I really did not have any good answers if
someone questioned this activity.
Third, anytime a computer site has third-party, or even vendor
maintenance, they are very vulnerable to the finger-pointing
syndrome. With vendor maintenance the vendor must keep the
customer happy if they expect a continued relationship. The
original vendor will have not only hardware experts, but also
software experts. They probably also have the best contacts
with experts in the field. In their situation, Honeywell was
really also a third-party maintenance firm since Xerox took
the plunge in 1975. Honeywell also seems to have chosen to
let these sites fade away. The only choices other than third-
party maintenance available to such Xerox/Sigma users (and
avoid costly software conversion) are new, compatible equip-
ment or self-maintenance.
With third-party maintenance, you are much less likely to have
software expertise available, and there is little incentive
to bring in costly expert trouble-shooters. It is not clear
to me how much you are at the vendor's mercy for maintenance,
but it seems that with most standard contracts, if they can
demonstrate that they are trying to solve the problem, it could
take forever. Although it was before you were director here,
you may recall the situation we were in with Honeywell. They
would not provide experienced customer engineers so many prob-
lems were resolved after they left for the day. We were not
suppose to touch the hardware, so the failing module would be
reinstalled in the morning and we would "help" them find it.
Self-maintennace may have its own set of problems, but finger-
pointing really isn't one of them. A little for jest does occur,
but it is a healthy way to discuss the situation and
resolve the problem. It usually provides learning for all
involved as reasons must be given and hypotheses tested.
One area self-maintenance shines in, above all other types, is
in dealing with intermittant problems. This, of course, is
only true when detailed logs and continuity of personnel are
maintained.
page 3
Fourth, at our site we try to use multiple equipment of the
same type. This has not always been the case, but risk analy-
sis has been performed, sparing levels evaluated, and contin-
gencies planned when we deviate from this rule. Fault isola-
tion is greatly aided by this arrangement. Doing our own
maintenance has greatly assisted in this since we have been
able to buy used equipment at low prices and hence upgrade
our capacity, yet maintaining multiplicity.
To assist in such fault isolation, you will recall I sought
and obtained your permission to take a disk pack with a disk-
swapping, single pack operating system CP-V version C00. Hav-
ing tested it at Andrews University, I knew that it could help
eliminate the 7212 RADs, Honeywell MPC tape drives, PO tape
and CP-V version F00 as possible problems. This pack was
not used in the fault isolation process.
I also took with me a copy of our Telefile diagnostic tape
with Sigma 9 snap-data. Since they had already had a Telefile
field service engineer help the weekend before, I expected
no problem securing permission from Telefile for its use,
perhaps after the fact. The tape was unused and remained in
the car. However, it seemed that no one had even thought to
run the CPU diagnostics forcing a comparison with the snapdata
(a standard option). In fact, it seems that they did not have
the data available. Although this is very helpful in finding
problems where the tests fail, I think I have seen this com-
pare fail even when the answers agreed! (The diagnostic tape
was written at 800 bpi since the MPC tape controllers do not
need their firmware loaded to read such).
Fifth, we maintain multiple copies of critical tapes such as
PO tapes, source tapes, diagnostic tapes, backup tapes, etc.
These copies are not only maintained on-site, but also at
various locations off-site. Since NASA only had one copy
of their PO tape, much time had been spent before it was
checked elsewhere to determine that it was not the problem.
Since few, if any, similar hardware configurations exist,
it was not tested completely. A system source tape was also
located just prior to my arrival (we have a copy as well).
After they had been down a week, they called in their former
systems programmer. He had been gone for four years and thus
was very rusty. It is also not clear to me that he ever got
very involved with many operating system internals.
Sixth, we maintain operating systems listings on paper and on
microfiche. In addition we have several copies of technical
manuals which describe the internals of our operating system's
predecessor (UTS). This documenation is invaluable in a sit-
uation like this, and appropriate protection is used to be
certain that it remains usable (i.e. does not wander off or
is otherwise destroyed). I took our microfiche copy of the
operating system version they are running, but it was not
needed since they had already obtained a copy from elsewhere
the day before.
page 4
Given all my preparation, what I used to solve their problem,
I carried there in my head. We are still very dependant on a
few individuals who understand how everything fits together and
works. This is much more than just systems programmers who
can code in assembly, or board-swapping customer engineers.
Since most of our routine maintenance procedures are not docu-
mented, this information is "fragile".
These individuals cannot learn this in classrooms or by doing
only routine day-to-day maintenance. It must be learned by
trying to expand features and develop new capabilities. This
takes years. George was ready to accept the responsibility
of self-maintenance after six long, hard years of preparation.
I had to accept the responsibility after five such years, but
during much of this time we were experiencing rapid expansion
and had a critical mass of development talent. Our current
maintenance staff all have three or less years of full-time
experience.
This means that development and documentation is essential to
our continued internal training. With the current amount of
computer and communications hardware as well as software to be
maintained and the current staffing, this is often and contin-
ually squeezed out. Everything seems to run at a fevered
pitch and the pressure means that everything must succeed.
There is little, if any, time to develop our diagnostic tools
(experiments), but whose gains outweigh such risks. It is
also hard to schedule time to take classes since we only have
single coverage of our major areas. Unless an environment
can be created where regular hours and a normal workload
exist, we will continue to lose valuable personnel.
Well, last I knew NASA's Sigma 9 was busily processing data
from the Dynamics Explorer satelite. Although it has been
a while since we had any significant downtime, one really
never knows when a really nasty gremlin will strike. Meanwhile,
we will continue to solve the problems as they come along and
hopefully we will be ready for the hard ones!
A transcription (with minor editing) of my sitelog entry is
also enclosed.
Best regards,
s/Keith G. Calkins
Keith G. Calkins
AU Computer Systems Manager
and XDX Consultant
cc: Jimmy Nishry
Michael Sutch
6/20/87
08:15 Back on System
Checked memory after PCL read of tape
Contents intact.
Checking transfer to user area.
11:45 Move Byte String failure (with certain conditions)
[Greg Molenaar]
12:14 Replaced module (fixed) - Jack Muff
6/20/87 NASA-Goddard sitelog entry by Keith G. Calkins.
I arrived on-site with Jack Muff (of Western) about 7:45 a.m.
We talked about the problem and I understood that CP-V F00
"ran", but had 96 GENMD errors when booted from tape. The GENMD
errors were the result of "holes" in the REF/DEF record of
load modules off from tape. They (a Honeywell Bull CE had
also arrived) assured me that the tape was verified good on the
Sigma 5 "downstairs". "Holes" were 30-35 words long. I asked
if they were 32 words long, but they were not certain. They
indicated that copying the LMN's off of tape resulted in holes
when copied to UC(X). They had already discovered that it
made a difference whether or not the HEAD record was first.
With the HEAD record first, it failed. They were certain it
was coming in off tape ok, but didn't know yet where the
problem was.
My first step was to look at SBUF1 in memory to find out
if the data was ok there, since it was bad in the user area.
It was ok so we knew it was likely a CPU problem.
The next step was to try to find the actual monitor code
which moved the data into the user page. To do this we had
to find the physical page receiving the data by getting it out
of JX:CMAP. With this [and the PCP stop address] we estab-
lished that RBLK6 in the RDF module of CP-V actually moved
the data with an MBS of 252 bytes, except for the remainder.
We established that it moved the 252 byte blocks ok, but that
the remainder of 192 (.c0) was to be moved. We established
that the MBS,12 0 at .6FA9 with 12/.00025740/ and
and 13/.C00316E4/ only moved 64 (.40) bytes instead of 192
(.C0).
We then analyzed the MBS/CBS phase sequencing charts and
single-stepped the instruction. In PH44 the E register
had .40 instead of .C0. This came from E+1 in PH43 which
was ok at .BF. We looked at the logic equations for the
register frame and decided that the FG18 in 22K contained
most of this logic, and was thus suspect. It was swapped
with 21K which had bits 4-7 of the E register and the
floating-point diagnostic failed solid. We replaced the
FG18 from Western spares and the problem was resolved.
We did not resolve which component had failed, but I sus-
pect one of the many diodes present.
The problem was emphasized by the format of records on labeled
tape, and the way data is moved into the user's data area. It
thus could only be found with an extensive knowledge of CP-V.