Study of Warner's Intermittant
Sigma Hardware Failures and
CP-V Software Slowdowns

 

 

For Michael Granick
Vice President, Data and Communications Services

Warner Computer Systems, Inc.
17-01 Pollitt Drive
Fair Lawn, NJ   07410

 

 

By Keith G. Calkins
Sigma Specialist

XDX
610 N. Main St. Berrien Springs, MI    49103

 

 

April 28, 1989

 


Go directly to page: 1 (Summary),   2 (Recommendations),   3 (Details),   4, 5


 

Purpose and Scope

The purpose of this study was to 1) debug the Warner Computer Systems, Inc. (WCS) W9 computer system and 2) analyze the performance of the WCS C9 computer system. In addition, intermittant failures of the WCS K9 computer system were resolved. The remainder of this report is divided into three sections: Executive Summary, Recommendations, and Detailed Discussion.

Executive Summary

The WCS C9 computer system was found to be operating as I would expect with the current hardware configuration, operating system software, application software, and data base, under the existing user load.

The WCS C9 computer system appears to be operating close to the non-linear region of the system response curve, hence minor perturbations periodically cause it to process inefficiently ("thrash") under the day-time user load.

Several factors contributing to these slowdowns were identified. Among these were: 1) the recent migration of the Hanover files to the Amgro files, 2) frequent searches for invalid policy numbers, 3) ADJS private pack I/O rate approaching physical limits, 4) low slave CPU utilization, and 5) high swap rates.

Use of the WCS K9 computer system was required for analysis of the WCS C9 system and for production normally run on the WCS W9 system. Unreliable memory components contributing to K9 system failures were isolated. Since this system was only recently assembled, further "infant" mortality and intermittants should be expected. Hopefully normal error-checking techniques will detect them.

Examination of the WCS W9 computer system revealed two interleaved, intermittant failures occurring with W1 as a mono-CPU system. These were: 1) map parity errors and 2) data bit 0 set incorrectly with no associated parity errors. During one CVA failure, bit 4 was also observed reset incorrectly.

Problem 1 is clearly CPU and one failing map module was isolated when cycling them through the decimal unit and running the decimal diagnostic.

Problem 2 appears to be memory, since it persisted for the duration of a crash. There was insufficient information to determine unit or bank.

The problem running WCS W9 with dual-CPU's appears to have been in the "near" memory unit. Exchanging the PBA boards between the memory units resulted in disappearance to time of writing.

I have enjoyed working with your highly competent staff, particularily Dennis German, Jim Holland, Tony Chapman, John Deschner, and Dick Ingulli. I have shared in their frustrations regarding prioritizing frequent interruptions and attempting to simultaneously solve multiple, intermittant failures.


 

Recommendations

Improve operating system performance and utilization by:

Repairing the operating system read-ahead problem,
Further study of master CPU utilization,
Debugging the SUMMARY system statistics processor, and
Consideration of equitible allocation of space of multivolume private pack sets.

Improve data base utilization by:

Analyzing OK files to understand current application usage,
Further study and possible redistribution of disk accesses,
Reducing searches for invalid policy numbers,
Eliminating the need for system page cleaning, and
Considering the use of padded records to reduce fragmentation.

Improving system failure analysis by:

Merging appropriate diagnostics onto one diagnostic tape,
Correcting CP-V multiprocessor M:STIMER scheduling bug,
Training Hardware Support personnel to function with more
     independence when using software clues, or
Obtaining additional personnel to assist Dennis in assisting Hardware
      Support personnel.

Improve intermittant failure trouble-shooting techniques by:

Chronologically logging intermittant failures by system,
Collecting additional pertinent information at time of failure,
Improving utilization and understanding of the logic analyzer, and
Documenting expected diagnostic results (failures).

Detailed Discussion

Although the C9 computer system is currently running at or near the "knee" of the system response curve, some options are open to accommodate further expansion. These are summarized below under Hardware, Operating System, Application Software, and Data Base. In addition, comments regarding techniques to improve the trouble-shooting of intermittant failures and other pertinent observations are provided.

Definition

ETMF      Estimated Time Multiplication Factor.
This is usually calculated as the number of computable users on the system.

Hardware

Even at times of high day-time ETMF, the slave CPU is only 25% utilized; hence a third CPU would not significantly improve system response (excepting redundancy during a CPU failure). However, high slave CPU utilization during evening processing indicates CPU intensive activities in spite of the heavy I/O activity.

The current disk drives have a theoretical average access time of 40 milliseconds. This is equivalent to 1500 reads or writes per minute. Preliminary evidence suggests that this is being exceeded on at least the ADJS private pack during the day-time processing when accesses should be approximately random.

Evening processing generates I/O rates far exceeding the day-time rates, but this evening processing is primarily sequential and thus access times substantially less than average can be expected. However, the evening data may be used to show that the two channels (dual access) are not a limiting factor during day-time processing. Obviously, disks with faster access times would improve this bottleneck.

Abnormally high swap rates at times of heavy usage indicate that the current 512 K-words of memory may become a limiting factor as the number of users grows.

Operating System

All system generation parameters were examined and many were discussed, in detail, with Dennis. It was verified that the following are not currently a performance limiting factor:

  1. Swap space (1500 granules free with 150 users)
  2. IOQ entries (runs out very infrequently)
  3. Read-ahead table entries (but see below)
  4. Users not queuing up in state SQR (open/close, etc.)
  5. Memory not tied up with I/O (state SIOW).

All other values looked reasonable, although time only allowed a preliminary determination that MPOOLs and COC buffers were not normally a limiting


factor. Refined tools would be needed for final determination on these two parameters.

Read-ahead is not functioning for either file directories or sequential file processing. This significantly affects evening processing. If fixed, a patch supplied by Andrews University in years past for multi-volume pack sets should also be implemented.

Some CPU activity is restricted to the master CPU. It appears that such activities currently limit performance on the WCS C9 computer system.

Two such items which could absorb a high amount of master CPU utilization are 1) scheduling (which could be possibly reduced by increasing MINQ1 and perhaps QMIN) and 2) COC output processing (which could possibly be reduced by buffering short writes). COC output is running in excess of 80,000 characters per minute during peak times. Although not particularily high in and of itself, if these are primarily short writes, then this could help explain high user service time.

It was observed that all compute-bound users at times of high ETMF had a "special compute" priority boost. Time did not permit a full investigation of the nature of these compute requests, but half appeared to be logging on or off -- a frequent process on this system.

Some operating system tuning may optimize performance. Different parameter values may be appropriate for day-time versus evening processing. Strict configuration control and record keeping is essential when tuning system parameters. Normal variations in the day-time load will make interpretation difficult.

It was observed that when ETMF rises, swap wait and I/O&swap in I/O rate. System parameter tuning may reduce this thrashing. Further analysis of system statistics were hampered by bugs in the SUMMARY processor.

Application Software

The following items will limit system productivity, if not now, in the future:

Clerks keying in obviously invalid policy numbers (such as N, NJA during a REIN command) and the system subsequently searching for it. This results in a worst-case binary search on the large keyed file and thus excessive disk activity. Simple validity checks in the application program should alleviate this overhead.

Rewritten records become repeatedly fragmented, often with only a slight increase in size. These fragments then require additional disk access. The actual extent of this problem was not determined.

Operating system page cleaning, due to poor programming (uninitialized data, aggravated when used as a shared processor) is additional unwarranted overhead.


Brokers and/or other users may be accessing the data in inefficient ways, thus reducing system responsiveness. Examples may include excessive alpha searches and data retrieval at microcomputer instead of human speeds.

Data Base

I/O accesses are not distributed uniformly over the EVE3/4 (EVE1/2), ODD3/4 (ODD1/2), and ADJS private packs. Although evening disk activity is significantly higher than day-time activity, day-time responsiveness may be affected by this non-uniform distribution, without an increase in ETMF. In addition, as the Hanover data base migrates into the Amgro data base, a new disk access pattern is being established, with additional "hot-spots".

This access imbalance is due to several factors such as: 1) space is completely allocated on one pack before the other, 2) the minor and broker files are both on the same pack set, and 3) although certain "hot spots" (today, fragments, and level 1 keys) are on one pack, the bulk of the data is on the other.

Staff

The hardware technical support personnel have well developed hardware trouble-shooting techniques through years of Sigma experience; however, they rely far too heavily on Dennis for assistance in analyzing available software clues, especially while solving intermittant failures.

When miscellaneous failures occur with diagnostics and exercisers in various test configurations, incorrect conclusions are often drawn. For example, there appears to be a CP-V multiprocessing scheduling problem which causes M:STIMER exerciser aborts. Some personnel knew to ignore these, others did not.

Additionally, when stalking intermittants, it becomes very important to keep a chronological log, by system, of the failures. It appears that Dennis is the only one making an effort to do this (but not a permanent record). Both he and the maintenance personnel believe that all prior information becomes invalid when a hardware correction or adjustment is made. It is my experience that these intermittant problems will manifest themselves in many different ways and only a concerted effort to gather/organize the associated information will lead to their timely resolution.

Examples of incomplete data collection or testing include: 1) a ZAP and boot after the FORTRAN shared library was corrupted (instead of an operator recover so that its location in memory could be determined) 2) SNAP mode of the decimal diagnostic was not run after a failing configuration was discovered (board swapping is fine only for solid failures), and 3) the remaining map modules were not cycled through the decimal unit after a failure was located.