Occasional bus error during PCI transfers

Hedeon · 14 July 2017, 17:10

Hi all,

I'm currently working on a project where a PPC CPU on the PCI bus (mediator) reads and writes to config registers of a graphics card on the same PCI bus. If I issue a lot of commands through the CPU to the config area of the graphics board the graphic cards locks and after a while a bus error pops up.

I was thinking, could there be a conflict between the PPC CPU and 68K CPU both trying to drive the PCI bus resulting in a lock up?

I know the mediator had a bus master jumper. Can this help?

The code already looks for a sign whether the gfx card is idle before issueing the next commands so it shouldn't be an overflow of commands. If I add delays between the commands issued the system becomes stable. (but slowing it down....)

Any experts who can point me in a right direction?

Daedalus · 14 July 2017, 23:31

The Mediator uses its own little DMA system with the graphics card memory, so some transfers are probably cached and queued to be transferred to the Amiga side. Perhaps you're running foul of issues with that cache between the two CPUs? Just speculation though, I've never coded that sort of hardware combo...

Stedy · 15 July 2017, 00:25

Hi,

What values do the Base Address Registers (BAR) of the GPU report?
The last nibble denotes if the device supports 32/64 bit, I/O or mem space and importantly cacheable support. Try an eieio command or msync(0) depending on the CPU.

What PPC chip and northbridge or is it custom chip?

The PCI arbiter should stop the PPC and 68K both driving the bus.

When the error occurs, what status is returned by the various PCI devices?

Sorry for so many questions but I don't know enough about your system to go into specifics.

Hedeon · 15 July 2017, 05:48

PPC is MPC750. Northbridge is MPC107.

When the error occurs the whole Amiga halts in such a bad way no status can be retrieved. I need to reset the machine. Even the reset takes a while to get through..

Not sure if eieio is fully supported on the MPC750, btw. I think it is treated as a nop in most cases.

For now, most of the errors are gone as, you guessed it, I made an error in setting up the memory in which the config registers of the GPU were. I set it up to be cache inhibited using a page table, which is correct. But I also set up a BAT which was Write_Through and overlapping the page table setup. And BAT is checked first before page table.

You always find that kind of stuff out moments after you post a question...

So the idle check was actually done on a value in cache.... idle check works correctly now and I have gotten rid of the delays and the test programs now seem to work. So I can now draw triangles and stuff and texture them. I actually use the Warp3D examples of Kas1e from os4coding forum, but compiled for WarpOS.

The more complex programs like gearsppc still don't work (bus error). Have to dig deeper for that, Can be an error in the driver itself also in this case.

Stedy · 16 July 2017, 00:08

Hi,

@Hedeon

Had not realised you were the guy working on the Sonnet 7200 software.

Have you got the Errata document for the TSI107?

One thing that caught me out on one design, using the MPC8245 (PPC603 + TSI107 in 1 chip) , was issues with DMAs and back to back transfers. Try clearing bit 9 (Fast back to back) of the command register of the target device, in your case, the GPU.

Also try setting bit 0 of the AMBOR register at offset 0xE0, this removes an issue with speculative reads of local memory.

The MPC8245, which has identical registers to the TSI107, has an errata document freely available here:
http://www.nxp.com/docs/en/errata/MPC8245CE.pdf

Good luck.

Hedeon · 01 July 2020, 11:47

In the end I worked around it. It was bad code haha.

But over the years.....I have seen it pop-up occasionally again. Is there also such a thing as a bus time-out? I read somewhere that PCI solutions for the Amiga give a bus time-out (in the shape of a bus error) when addressing slow stuff on the bus (e.g. a ROM from let's say Voodoo or Radeon).

I think the readme of the new FireStorm firmware states something around that lines.

Hedeon · 27 January 2021, 17:43

@Stedy

I want to revisit this. Are you available? In this case it is the Prometheus/Firestorm being the culprit. In combination will all different north bridges.

Are you available? :-)

grelbfarlk · 28 January 2021, 02:49

https://forum.amiga.org/index.php?topic=33092.15

http://www.e3b.de/prometheus/prometheus_V05.txt

Quote:

RETRY mechanism
===============
The new Fire Storm upgrade supports a simple RETRY mechanism for accesses from
Zorro III to PCI. Due to timing constraints on the Zorro III bus it is advisable
to access known slow devices only with PCI-PCI DMA disabled.
For the software side no changes are needed; in case a PCI device does issue a
RETRY situation, the Prometheus CPLDs will repeat the bus access immediately.
If a RETRY fails within the timeout of Zorro III, a Bus Error will occur on
Zorro III.

Cards known to produce RETRYs are:
- slow gfx cards when accessing onboard BIOS ROM
- PCI-PCI bridges, especially on CFG cycles on the PCI bus behind the bridge

Stedy · 03 February 2021, 13:13

Quote:

Originally Posted by Hedeon

@Stedy

I want to revisit this. Are you available? In this case it is the Prometheus/Firestorm being the culprit. In combination will all different north bridges.

Are you available? :-)

Hi,

I'll try my best. Do you have something equivalent to "lspci" on the Amiga?
This is useful as itprettifies and prints the PCI config registers of every device and is a good starting point.

Hedeon · 03 February 2021, 15:54

Not really. But are there fields you are especially interested in? Prmscan does not show all, but maybe I can add.

grelbfarlk · 03 February 2021, 22:18

OpenPCIInfo dumps a lot of the config space too.

Stedy · 03 February 2021, 22:54

Hi,

Interested in the PCI Command and Status registers, latency timer, cache line, interrupt line/pin, MIN grant, MAX LAT and the base address registers.
It's also useful to know what memory regions are cacheable, from either CPU.
When the PowerPC system hangs, can you still perform PCI transactions from the 68K processor/Zorro bus?

Hedeon · 04 February 2021, 01:57

Quote:

Originally Posted by Stedy

Hi,

Interested in the PCI Command and Status registers, latency timer, cache line, interrupt line/pin, MIN grant, MAX LAT and the base address registers.
It's also useful to know what memory regions are cacheable, from either CPU.
When the PowerPC system hangs, can you still perform PCI transactions from the 68K processor/Zorro bus?

Prometheus range = 0x40000000-0x60000000
The whole range is cache inhibited regarding the 68K.

PrmScan:

Code:

Prmscan 1.6 by Grzegorz Kraszewski.
PCI cards listing:
-------------------------------------------------
Board in slot 0, function 0
Vendor: Realtek Audio/Lan?Maker
Device: RTL8028 PCI Full-Duplex Ethernet Controller with PnP Function
Revision: 0.
Device class 02, subclass 00.
Address range: 5FE01100 - 5FE0111F (32 B).
Board driver: prm-rtl8029.device.
-------------------------------------------------
Board in slot 1, function 0
Vendor: ATI Technologies Inc. / Advanced Micro Devices, Inc.
Device: unknown ($5960)
Revision: 1.
Device class 03, subclass 00.
Address range: 40000000 - 47FFFFFF (128 MB).
Address range: 5FE01000 - 5FE010FF (256 B).
Address range: 48060000 - 4806FFFF (64 kB).
128 kB of ROM at 48040000 - 4805FFFF.
Board driver: NONE.
-------------------------------------------------
Board in slot 2, function 0
Vendor: Motorola
Device: unknown ($480B)
Revision: 2.
Device class 06, subclass 00.
Address range: 48070000 - 48070FFF (4 kB).
Address range: 48071000 - 48071FFF (4 kB).
Address range: 50000000 - 57FFFFFF (128 MB).
Address range: 48000000 - 4803FFFF (256 kB).
Board driver: NONE.
-------------------------------------------------

Regarding PPC cache:

The frame buffer (in this case 0x40000000-0x480000000 is Write-Through.
The VGA config (in this case 0x48060000-0x48070000) is cache inhibited/guarded.
The PPC memory (in this case 0x50000000-0x58000000) is mostly copy-back.
The PPC configs (in this case 0x48070000-0x48072000) is cache inhibited/guarded.

Rest is less important, I'd think

I can still access the PPC card memory and config ranges from the 68K debugger when the PPC hang happens. Access to the frame buffer or VGA config registers by the 68K debugger results in a bus error.

I'll look up the other values of the cards soon. The only thing OpenPCIscan has more is some status/command stuff, but not the rest.

I do expect a time out of some kind (see grelblarlk reference)

Stedy · 05 February 2021, 12:57

Hi,

From what you describe, the PowerPC processor has had a critical fault, on some PPC devices, this is a machine check exception. Have seen this in my day job, the CPU core would hand up but another processor could access RAM on the 'dead' card. All was restored on a reboot.

What processor and North bridge are you using?

Hedeon · 08 March 2021, 20:52

Processor is MPC7410. Northbridge is the 1057, 480b (PCI id).

It's in LE. Originally, Latency Timer was $00 by default. $80 gives less errors.This is with the Prometheus.

VendorID, DeviceID,
Command, Status
RevID, ProgIF, Subclass, Classcode
CacheLineSize, Latency Timer, Header Type, BIST
Interrupt Line, Interrupt Pin, Min Grant, Max Latency

gfx card:
$0210, $6059
$0702, $9002
$01, $00, $00, $03
$00, $80, $80, $00
$FF, $01, $08, $00

G4 card:
$5710, $0B48
$0700, $A0A2
$02, $00, $00, $06
$00, $80, $00, $00
$00, $01, $00, $00

What I have found so far is that the gfx card has crashed, removing it effectively from PCI space. If the 68K then tries to access it, it gives a bus error. If the PPC tries to access it, it just halts.

The gfx card crashes as its command FIFO has overflown. That happened as the command processor stopped processing them and that often happens after receiving an invalid command package.

So why is it getting wrong info as with the Mediator this does not happen and it is the same code. Maybe something gets corrupted when the PPC is pushed of being bus master while doing a transfer.

Stedy · 16 March 2021, 01:05

Hi,

Assuming I've byte reversed correctly, the status registers indicated that the graphics card had a parity error on a transfer and the processor detected it and set a master abort.

Looking at the command register, you have fast back to back transfers enabled on the graphics card but the CPU cannot support this. Would be worth disabling this on the graphics card.

Do you have any exception handlers for the PCI bus or do you use the Machine check as a catch all handler?

I have seen PCI errors cause a machine check on E300/PPC603 cores in the past. I can go into more detail, I guess I should look at the SonnetPCI libraries first?

grelbfarlk · 16 March 2021, 03:56

Relevant?

Quote:

Originally Posted by Timtheloon

Hi all

You lot probably know about scanPCI 0.9

Why don’t we use this more often it give lots of info most which goes way over my head

Like my next question

I notice with ScanPCI it states: Detected Parity Error. What does this mean it states it on all the PCI cards with the exception of the sound card

Hedeon · 06 May 2021, 03:16

Most of the errors were due to bugs in the driver (who knew!). At least on MPC107 and Harrier most things are now working. For the rest of the errors I am looking hard if it is software of hardware related. The M1/K1 bridge is much more troublesome, however and cannot run for more than a few seconds before bus error.

I am not sure if the PPC goes into an exception as the 68K crashing takes the whole system with it. The 68K tries to read from PCI during vblank and then bus error (the 2D VGA driver is 68K).

Looking into the fast back2back stuff.

Quote:

Originally Posted by Stedy

Hi,

Assuming I've byte reversed correctly, the status registers indicated that the graphics card had a parity error on a transfer and the processor detected it and set a master abort.

Looking at the command register, you have fast back to back transfers enabled on the graphics card but the CPU cannot support this. Would be worth disabling this on the graphics card.

Do you have any exception handlers for the PCI bus or do you use the Machine check as a catch all handler?

I have seen PCI errors cause a machine check on E300/PPC603 cores in the past. I can go into more detail, I guess I should look at the SonnetPCI libraries first?

14 July 2017, 17:10	#1
Hedeon Semi-Retired Join Date: Mar 2012 Location: Leiden / The Netherlands Posts: 1,993	Occasional bus error during PCI transfers Hi all, I'm currently working on a project where a PPC CPU on the PCI bus (mediator) reads and writes to config registers of a graphics card on the same PCI bus. If I issue a lot of commands through the CPU to the config area of the graphics board the graphic cards locks and after a while a bus error pops up. I was thinking, could there be a conflict between the PPC CPU and 68K CPU both trying to drive the PCI bus resulting in a lock up? I know the mediator had a bus master jumper. Can this help? The code already looks for a sign whether the gfx card is idle before issueing the next commands so it shouldn't be an overflow of commands. If I add delays between the commands issued the system becomes stable. (but slowing it down....) Any experts who can point me in a right direction?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Occasional Red Screen Kiskstart 3.1	ScottC2010	support.Hardware	7	02 June 2017 02:51
Occasional green/pink tint on A1200	pedrorq	support.Hardware	5	30 May 2014 11:22
PC-Amiga-PC transfers	Yesideez	New to Emulation or Amiga scene	4	21 March 2007 15:15
Prometheus PCI & Voodoo 3 PCI GFX Card	Slayer	support.Hardware	21	05 September 2006 10:57
PC<> miggy file transfers	arizz	support.Hardware	3	03 April 2005 01:47

14 July 2017, 23:31	#2
Daedalus Registered User Join Date: Jun 2009 Location: Dublin, then Glasgow Posts: 6,334	The Mediator uses its own little DMA system with the graphics card memory, so some transfers are probably cached and queued to be transferred to the Amiga side. Perhaps you're running foul of issues with that cache between the two CPUs? Just speculation though, I've never coded that sort of hardware combo...

15 July 2017, 00:25	#3
Stedy Registered User Join Date: Jan 2008 Location: United Kingdom Age: 46 Posts: 733	Hi, What values do the Base Address Registers (BAR) of the GPU report? The last nibble denotes if the device supports 32/64 bit, I/O or mem space and importantly cacheable support. Try an eieio command or msync(0) depending on the CPU. What PPC chip and northbridge or is it custom chip? The PCI arbiter should stop the PPC and 68K both driving the bus. When the error occurs, what status is returned by the various PCI devices? Sorry for so many questions but I don't know enough about your system to go into specifics.

15 July 2017, 05:48	#4
Hedeon Semi-Retired Join Date: Mar 2012 Location: Leiden / The Netherlands Posts: 1,993	PPC is MPC750. Northbridge is MPC107. When the error occurs the whole Amiga halts in such a bad way no status can be retrieved. I need to reset the machine. Even the reset takes a while to get through.. Not sure if eieio is fully supported on the MPC750, btw. I think it is treated as a nop in most cases. For now, most of the errors are gone as, you guessed it, I made an error in setting up the memory in which the config registers of the GPU were. I set it up to be cache inhibited using a page table, which is correct. But I also set up a BAT which was Write_Through and overlapping the page table setup. And BAT is checked first before page table. You always find that kind of stuff out moments after you post a question... So the idle check was actually done on a value in cache.... idle check works correctly now and I have gotten rid of the delays and the test programs now seem to work. So I can now draw triangles and stuff and texture them. I actually use the Warp3D examples of Kas1e from os4coding forum, but compiled for WarpOS. The more complex programs like gearsppc still don't work (bus error). Have to dig deeper for that, Can be an error in the driver itself also in this case.

16 July 2017, 00:08	#5
Stedy Registered User Join Date: Jan 2008 Location: United Kingdom Age: 46 Posts: 733	Hi, @Hedeon Had not realised you were the guy working on the Sonnet 7200 software. Have you got the Errata document for the TSI107? One thing that caught me out on one design, using the MPC8245 (PPC603 + TSI107 in 1 chip) , was issues with DMAs and back to back transfers. Try clearing bit 9 (Fast back to back) of the command register of the target device, in your case, the GPU. Also try setting bit 0 of the AMBOR register at offset 0xE0, this removes an issue with speculative reads of local memory. The MPC8245, which has identical registers to the TSI107, has an errata document freely available here: http://www.nxp.com/docs/en/errata/MPC8245CE.pdf Good luck.

01 July 2020, 11:47	#6
Hedeon Semi-Retired Join Date: Mar 2012 Location: Leiden / The Netherlands Posts: 1,993	In the end I worked around it. It was bad code haha. But over the years.....I have seen it pop-up occasionally again. Is there also such a thing as a bus time-out? I read somewhere that PCI solutions for the Amiga give a bus time-out (in the shape of a bus error) when addressing slow stuff on the bus (e.g. a ROM from let's say Voodoo or Radeon). I think the readme of the new FireStorm firmware states something around that lines.

27 January 2021, 17:43	#7
Hedeon Semi-Retired Join Date: Mar 2012 Location: Leiden / The Netherlands Posts: 1,993	@Stedy I want to revisit this. Are you available? In this case it is the Prometheus/Firestorm being the culprit. In combination will all different north bridges. Are you available? :-)

03 February 2021, 15:54	#10
Hedeon Semi-Retired Join Date: Mar 2012 Location: Leiden / The Netherlands Posts: 1,993	Not really. But are there fields you are especially interested in? Prmscan does not show all, but maybe I can add.

03 February 2021, 22:18	#11
grelbfarlk Registered User Join Date: Dec 2015 Location: USA Posts: 2,902	OpenPCIInfo dumps a lot of the config space too.

03 February 2021, 22:54	#12
Stedy Registered User Join Date: Jan 2008 Location: United Kingdom Age: 46 Posts: 733	Hi, Interested in the PCI Command and Status registers, latency timer, cache line, interrupt line/pin, MIN grant, MAX LAT and the base address registers. It's also useful to know what memory regions are cacheable, from either CPU. When the PowerPC system hangs, can you still perform PCI transactions from the 68K processor/Zorro bus?

05 February 2021, 12:57	#14
Stedy Registered User Join Date: Jan 2008 Location: United Kingdom Age: 46 Posts: 733	Hi, From what you describe, the PowerPC processor has had a critical fault, on some PPC devices, this is a machine check exception. Have seen this in my day job, the CPU core would hand up but another processor could access RAM on the 'dead' card. All was restored on a reboot. What processor and North bridge are you using?

08 March 2021, 20:52	#15
Hedeon Semi-Retired Join Date: Mar 2012 Location: Leiden / The Netherlands Posts: 1,993	Processor is MPC7410. Northbridge is the 1057, 480b (PCI id). It's in LE. Originally, Latency Timer was $00 by default. $80 gives less errors.This is with the Prometheus. VendorID, DeviceID, Command, Status RevID, ProgIF, Subclass, Classcode CacheLineSize, Latency Timer, Header Type, BIST Interrupt Line, Interrupt Pin, Min Grant, Max Latency gfx card: $0210, $6059 $0702, $9002 $01, $00, $00, $03 $00, $80, $80, $00 $FF, $01, $08, $00 G4 card: $5710, $0B48 $0700, $A0A2 $02, $00, $00, $06 $00, $80, $00, $00 $00, $01, $00, $00 What I have found so far is that the gfx card has crashed, removing it effectively from PCI space. If the 68K then tries to access it, it gives a bus error. If the PPC tries to access it, it just halts. The gfx card crashes as its command FIFO has overflown. That happened as the command processor stopped processing them and that often happens after receiving an invalid command package. So why is it getting wrong info as with the Mediator this does not happen and it is the same code. Maybe something gets corrupted when the PPC is pushed of being bus master while doing a transfer.

16 March 2021, 01:05	#16
Stedy Registered User Join Date: Jan 2008 Location: United Kingdom Age: 46 Posts: 733	Hi, Assuming I've byte reversed correctly, the status registers indicated that the graphics card had a parity error on a transfer and the processor detected it and set a master abort. Looking at the command register, you have fast back to back transfers enabled on the graphics card but the CPU cannot support this. Would be worth disabling this on the graphics card. Do you have any exception handlers for the PCI bus or do you use the Machine check as a catch all handler? I have seen PCI errors cause a machine check on E300/PPC603 cores in the past. I can go into more detail, I guess I should look at the SonnetPCI libraries first?

Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)