Checking a defective memory module with QBone

Page content

This post describe my effort to fix a broken M8044 PDP11 RAM module. For tracing down the bug, I’ve used QBone.

I own a defective M8044 RAM board. This board came as part of my OBA-11 PDP11. It was obviously checked by DEC, found to be broken and has a red defective badge.

Setting up test bed

I’ve created a testbed, consisting of:

  • QBone
  • An unused backplane H9278-A with 8 slots
  • The device under test, the M8044
  • As power supply, I took a 5V/4Ampere power supply from my parts bin
  • +12V (used by M8044) supply comes from a lab power supply

First test with QBone: Check if board responses to bus requests

QBone has its tests and many features in a single executable. This is currently called demo, which shows somehow the early state of that software.

This software can check if there is physical memory present. At first I removed the RAM module and checked what QBone detects:

root@qbone:~# ./demo.sh -aw 18
iarg1=8, iarg2=15
[16:12:03.074794 Inf    APP] Printing verbose output.
demo  - QUniBone QBUS test application.
    Version DBG v1.5.0, compile Jul 13 2022 14:58:03.
...

>>>dc
[16:12:34.140390 Inf    APP] Connecting to PRU.
[16:12:34.143141 Inf DDRMEM] Shared DDR memory: 4194304 bytes available, 4194304 bytes needed.
[16:12:34.144631 Inf DDRMEM]   Virtual (ARM Linux-side) address: 0xb53f9000
[16:12:34.145640 Inf DDRMEM]   Physical (PRU-side) address:9d100000
...

>>>m i 10000
Disable memory emulation, size physical memory ...
Now emulating QBone memory in range 000000..010000 with DDR memory.

The last lines shows that QBone has not found any physical memory at addresses 0..10000 and is now emulating that memory with RAM from BBB. This is what I expected.

Next, I have inserted my RAM module and do the same thing again.

When trying to emulate RAM at position 0..10000, QBone complains that there is physical RAM and nothing needs to be emulated:

DC>>>m i 10000

Disable memory emulation, size physical memory ...
Found physical memory in range 0..010000, no emulation necessary!

I increased end address in command m i <endaddress>. Up to 177777 octal, QBone finds physical memory. Last command goes beyond the physical memory, then the memory not covered by the RAM module is emulated by QBone.

DC>>>m i 270000

Disable memory emulation, size physical memory ...
Now emulating QBone memory in range 200000..270000 with DDR memory.

Result so far

To summarize up to here, the RAM module responds in its expected address range to read access via QBus. This means the module is not totally defective and unresponsive. The defect must be more subtile, maybe this can be detected in read/write tests.

Second test with QBone: Check if there are Write/Read errors

Write/Read errors at some specific address range would mean there are RAM chips, that are broken.

My initial idea was to write a write/read test which fills each word of the module with several bit patterns and re-read these. Test code will be like:

void readWriteTest(int address, int value) {
    write_to_memory(address, value)
    int read = read_from_memory(address)
    if (read != value) {
        printf("Write/Read failure on address %o (wrote 0x%x, read 0x%x)\n\r", 
            address, value, read);
    }
}

for (i=0; i<maxadr; i++) {
    readWriteTest(i, 0x55); // 0101.0101
    readWriteTest(i, 0xaa); // 1010.1010
}

Because the RAM module works in general as proven in the first test, the test could be done in my working PDP11, by simply writing a PDP11 assembler file.

But to get more acquainted with QBone, I want to write the test program in QBone, this means in C++ on BeagleBone Board.

Finally, I found that the demo tool already contains several memory tests. There is a linear test (writing a value and then read it again and again), and a random test (writing and reading words with random values to all addresses).

Next I used that test with my RAM module. They threw Read/Write errors.

TM>>>tr 0 200000

Testing 000000..177776 randomly (stop with ^C) ...
WR
Memory mismatch #1 at 000000: expected 175544, found 177777, diff mask = 002233.  
 Test value 175544 was never written in this pass.
Memory mismatch #2 at 000002: expected 047332, found 177777, diff mask = 130445.  
 Test value 047332 was never written in this pass.
Memory mismatch #3 at 000004: expected 061676, found 177777, diff mask = 116101.  
 Test value 061676 was never written in this pass.
Memory mismatch #4 at 000006: expected 125774, found 177777, diff mask = 052003.  
 Test value 125774 was never written in this pass.
Memory mismatch #5 at 000010: expected 010605, found 177777, diff mask = 167172.  
 Test value 010605 was never written in this pass.
Memory mismatch #6 at 000012: expected 145252, found 177777, diff mask = 032525.  
 Test value 145252 was never written in this pass.
Memory mismatch #7 at 000014: expected 033167, found 177777, diff mask = 144610.  
 Test value 033167 was never written in this pass.
Memory mismatch #8 at 000016: expected 062132, found 177777, diff mask = 115645.  
 Test value 062132 was never written in this pass.
Stopped by error: no timeout, 32768 mismatches
Current EXAM/DEPOSIT address is 000000

TM>>>ta 0 200000

Testing 000000..177776 linear with "address" data pattern (stop with ^C) ...
WRRRRRRRRR 10 RRRRRRRRRR 20 RRRRRRRRRR 30 RRRRRRRRRR 40 RRRRRRRRRR 50 RRRRRRRRR
R 60 RRRRRRRRRR 70 RRRRRRRRRR 80 RRRRRRRRR
Memory mismatch #1 at 147216: expected 063507, found 023507, diff mask = 040000.  
  Found mem value 023507 was written to addresses: 047216
Memory mismatch #2 at 147220: expected 063510, found 023510, diff mask = 040000.  
  Found mem value 023510 was written to addresses: 047220
Memory mismatch #3 at 147222: expected 063511, found 023511, diff mask = 040000.  
  Found mem value 023511 was written to addresses: 047222
Memory mismatch #4 at 147224: expected 063512, found 023512, diff mask = 040000.  
  Found mem value 023512 was written to addresses: 047224
Memory mismatch #5 at 147226: expected 063513, found 023513, diff mask = 040000.  
  Found mem value 023513 was written to addresses: 047226
Memory mismatch #6 at 147230: expected 063514, found 023514, diff mask = 040000.  
  Found mem value 023514 was written to addresses: 047230
Memory mismatch #7 at 147232: expected 063515, found 023515, diff mask = 040000.  
  Found mem value 023515 was written to addresses: 047232
Memory mismatch #8 at 147234: expected 063516, found 023516, diff mask = 040000.  
  Found mem value 023516 was written to addresses: 047234
Stopped by error: no timeout, 6329 mismatches
Current EXAM/DEPOSIT address is 000000

After some initial runs, I get no errors anymore from the defective memory board.

>>>tm
[18:15:15.542085 Inf    APP] Connecting to PRU.
[18:15:15.544686 Inf DDRMEM] Shared DDR memory: 4194304 bytes available, 4194304 bytes needed.
...

TM>>>tr 0 200000

Testing 000000..177776 randomly (stop with ^C) ...
WRWRWRWRWRWRWRWRWR 10 WRWRWRWRWRWRWRWRWRWR 20 WRWRWRWRWRWRWRWRWRWR 30 WRWRWRWRW
RWRWRWRWRWR 40 WRWRWRWRWRWRWRWRWRWR 50 WRWRWRWRWRWRWRWRWRWR 60 WRWRWRWRWRWRWRWR
WRWR 70 WRWRWRWRWRWRWRWRWRWR 80 WRWRWRWRWRWRWRWRWRWR 90 WRWRWRWRWRWRWRWRWRWR
 100 WRWRWRWRWRWRWRWRWRWR 110 WRWRWRWRWRWRWRWRWRWR 120 WRWRWRWRWRWRWRWRWRWR
 130 WRWRWRWRWRWRWRWRWRWR 140 WRWRWRWRWRWRWRWRWRWR 150 WRWRWRWRWRWRWRWRWRWR
 160 WRWRWRWRWRWRWRWRWRWR 170 WRWRWRWRWRWRWRWRWRWR 180 WRWRWRWRWRWRWRWRWRWR
 190 WRWRWRWRWRWRWRWRWRWR 200 WRWRWRWRWRWRWRWRWRWR 210 WRWRWRWRWRWRWRWRWRWR
 220 WRWRWRWRWRWRWRWRWRWR 230 WRWRWRWRWRWRWRWRWRWR 240 WRWRWRWRWRWRWRWRWRWR
 250 WRWRWRWRWRWRWRWRWRWR 260 WRWRWRWRWRWRWRWRWRWR 270 WRWRWRWRWRWRWRWRWRWR
 280 WRWRWRWRWRWRWRWRWRWR 290 WRWRWRWRWRWRWRWRWRWR 300 WRWRWRWRWRWRWRWRWRWR
 310 WRWRWRWRWRWRWRWRWRWR 320 WRWRWRWRWRWRWRWRWRWR 330 WRWRWRWRWRWRWRWRWRWR
...

Result so far

After having run the test for about 15 hours in 4 sessions, I had no errors. To summarize, the board seems to have issues directly after switching on, but I had this only at the two or three first switch on events. Since then, no issue.

What could be responsible for that kind of error?

On the board, there are only two things: the RAM chips and the charge pump. Because the WR errors came always in different locations, I guess the RAM chips are ok. So the charge pump comes in focus.

The next image shows the charge pump schematics. A 555 chip is used to generate some AC signal, that goes through a voltage doubler (C5,D1,D2,C6). Q1 is a voltage stabilizer 79M05 that outputs stable -5V from the unstabilized voltage doubler output voltage.

And some more schematics that takes care for the power supply (+5V, +12V):

I list it, because there are also some polarized capacitors that may produce trouble.

The issue can come from many of the parts in the schematics, but a frequent source of trouble are the polarized capacitors. They do age, which leads to all sorts of problems. Very often, they simply lose their capacity.