Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Method to test crashing system?
View unanswered posts
View posts from last 24 hours

Goto page Previous  1, 2, 3  Next  
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
Tony0945
Advocate
Advocate


Joined: 25 Jul 2006
Posts: 4146
Location: Illinois, USA

PostPosted: Mon Sep 21, 2020 8:13 pm    Post subject: Reply with quote

RayDude wrote:

Socket AM4. I was hoping to hold off until Zen3 was out and a bit mature...

I received this Ryzen 5 1600 from AMD directly as the one I purchased was affected by the "linux" bug. I've contacted AMD tech support to see what kind of warranty it has...

*crosses fingers*

I used an Athlon X4 950 for a year to bypass those early Ryzen problems. I see them cheap on internet. They are Bulldozer based so you would have to reinstall gcc, binutils, libtool and glibc from a stage3 and rebuild the system. A quick search shows even dual core Ryzen 3's going for ridiculous prices. I do see used 2700X's from $100 to $200 and new ones at $200. I have a 2700X and I like it a lot.
What mobo? Also, these boards are notoriously fussy about memory. Good luck with the warranty but my guess is that first generation Ryzen problems are not confined to a narrow range of production dates.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Mon Sep 21, 2020 10:57 pm    Post subject: Reply with quote

Tony0945 wrote:
RayDude wrote:

Socket AM4. I was hoping to hold off until Zen3 was out and a bit mature...

I received this Ryzen 5 1600 from AMD directly as the one I purchased was affected by the "linux" bug. I've contacted AMD tech support to see what kind of warranty it has...

*crosses fingers*

I used an Athlon X4 950 for a year to bypass those early Ryzen problems. I see them cheap on internet. They are Bulldozer based so you would have to reinstall gcc, binutils, libtool and glibc from a stage3 and rebuild the system. A quick search shows even dual core Ryzen 3's going for ridiculous prices. I do see used 2700X's from $100 to $200 and new ones at $200. I have a 2700X and I like it a lot.
What mobo? Also, these boards are notoriously fussy about memory. Good luck with the warranty but my guess is that first generation Ryzen problems are not confined to a narrow range of production dates.



I have a Gigabyte AB350M-D3H. I got really lucky during that first year. Gigabyte released a BIOS that over voltaged the parts and actually damaged peoples CPUs. I'm so glad I missed that update.

The reason I don't think it's bad DRAM is simply that it fails at 3200 MHz DDR4 in exactly the same way it fails at 2133 MHz. You would expect if it were flaky ram, the failure would happen less at slower speeds.

I've been watching the prices on the 2700 and 2700x, but I'd rather not dump more than $100.00 an an older processor, when I could apply it to a newer one.

Thanks again.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Mon Sep 21, 2020 11:03 pm    Post subject: Reply with quote

NeddySeagoon wrote:
RayDude,

You will loose your mind :)

At least you have a testbed, that makes life easier.

Here's a nasty thought, or a series of related nasty thoughts ...
The low voltages (below 3.3v) required to operate the RAM and CPU are derived from the Auxillary 12v connector to the motherboard,
Its 4, 6 or 8 pins, whatever, often not enough pins. As a result, it gets hot and goes high resistance. Its always the sockets on the cable that suffer, as the pins on the motherboard are soldered to 2oz copper power planes.
All is well until there is a big gulp of power and one or more low voltage supplies goes out of tolerance.
A lot of the 12v is dropped across the connector ...

A worked example may help. To keep the arithmetic simple, lets say that the CPU needs 120W flat out.
That's 10A through that connector. That's OK when its nice and shiny.

Now suppose that the contract resistance increases from very little to 0.1 Ohm. That's still not a lot but it costs 1v loss at the connector.
10A and 1v is 100w ... OK, so the connector fails well before the contact resistance gets to 0.1 Ohm. :)

Long story short. I've had a few instances like this, its worth 'wiping' the contacts by unplugging and replugging the connector two or three times.
Make sure the gold is still there while its apart.
'Wiping' the pins will fix it for 6 to 9 months.

Further down in the converter, the input and output capacitors get a very hard life. They used to be aluminium electrolytics and failures we easy to spot.
The tops would dome, the rubber bungs would push out of the bottom and allow the electrolyte to leak out.
I don't know what you have on your motherboard, but a failure or one or more of these has the same effect.

Here's a test. The idea is to reduce the load on the 12v Aux.
In the BIOS, turn off as many real CPU cores as you dare, then run memtest86.
If this works, it points the finger at the motherboard dynamic voltage regulation but the main PSU, providing the 12v is not in the clear yet either.



Cool! And thanks so much!

I'm a EE and I totally get what you are saying. I had thought about power supply issues, but I hadn't thought about the 12V connector oxidizing. I'll check it out.

I have already tested memory with only two cores active / no hyperthreading. It hangs just as fast as it did with all 6 cores (12 threads) active.

I've already informed my son that I'm borrowing his ram for a quick test. He's not happy, but hopefully I'll do the test today.

I predict his memory will fail just as fast as mine did and show that it's not memory.

The only way to prove it's the motherboard is to buy a new CPU... I hope to hold that off.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Mon Sep 21, 2020 11:38 pm    Post subject: Reply with quote

Update: My son's DDR4 fails memtest86+ in exactly the same way as my DDR4.

I left the memory swapped to see if it changes system stability.

I'm still disabling core1 for the heck of it.

I'll do emerge -DNuq @world later this week to see if it crashes again.

I'm sure it will.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
Tony0945
Advocate
Advocate


Joined: 25 Jul 2006
Posts: 4146
Location: Illinois, USA

PostPosted: Tue Sep 22, 2020 2:32 am    Post subject: Reply with quote

I've heard Gigabyte mobo's are not the best for Ryzen. I have an MSI Tomahawk Arctic and everyone praises ASUS.
So. Motherboard (especially with BIOS update), CPU or memory. If the memory works in another machine, it should be good. Another Zen machine that is. I have read that memory that works in Intel machines may not work in Ryzen machines. I bought my memory direct from Crucial, guaranteed by them to work in my particular motherboard. The machine I'm writing this on has a Gigabyte motherboard and a k10, Phenom II CPU. Different generation.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Tue Sep 22, 2020 3:22 am    Post subject: Reply with quote

Tony0945 wrote:
I've heard Gigabyte mobo's are not the best for Ryzen. I have an MSI Tomahawk Arctic and everyone praises ASUS.
So. Motherboard (especially with BIOS update), CPU or memory. If the memory works in another machine, it should be good. Another Zen machine that is. I have read that memory that works in Intel machines may not work in Ryzen machines. I bought my memory direct from Crucial, guaranteed by them to work in my particular motherboard. The machine I'm writing this on has a Gigabyte motherboard and a k10, Phenom II CPU. Different generation.


Thanks.

I won't buy gigabyte again.

We got an MSI for my son. It's been great. He's running 3600 MHz memory, no problems.

I'll have to do research again for my new board. I'm still trying to understand how the B550 boards are more expensive than the X570s...
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
Tony0945
Advocate
Advocate


Joined: 25 Jul 2006
Posts: 4146
Location: Illinois, USA

PostPosted: Tue Sep 22, 2020 2:31 pm    Post subject: Reply with quote

RayDude wrote:
I'll have to do research again for my new board. I'm still trying to understand how the B550 boards are more expensive than the X570s...

I bought an Asus (mainly on availability) and a 3900X for a new build but haven't assembled it yet. It's an X570. NeddySeagoon convinced me that i might need the extra lanes someday. It's a new expensive build. I'm hoping it will last a dozen years like this one. And this one may live on as a Gentoo router yet.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46364
Location: 56N 3W

PostPosted: Tue Sep 22, 2020 7:28 pm    Post subject: Reply with quote

RayDude,

Can you try your CPU in your sons system?
or even his CPU in your system?
Don't even think of a temporary swap without doing the CPU repaste job properly.

If turning off cores hasn't helped, its unlikely to be PSU.

It may be the DRAM controller, which is a corner of the CPU these days.

Your son must still be at an age where you can tell him these things.
You will soon need to negotiate :)
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Tony0945
Advocate
Advocate


Joined: 25 Jul 2006
Posts: 4146
Location: Illinois, USA

PostPosted: Tue Sep 22, 2020 10:04 pm    Post subject: Reply with quote

NeddySeagoon wrote:
You will soon need to negotiate :)

OH YES!!!
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Thu Sep 24, 2020 5:05 pm    Post subject: Reply with quote

I tested my son's memory. In fact, I'm running his memory at the moment. mine is labeled CAS17 but I think runs CAS16, his is labeled CAS18 but run CAS17 according to the XMP profile.

His computer is running with my memory, no issues (although it's windows so, yeah not so tough).

I can't see running my CPU in his PC as being an option, that's too much of a tear down of both machines. I don't want to make things worse. His is working, I'm going to leave it that way.

From what I can determine, X570 motherboards don't support Zen 1. They all specifically mention 2XXX and 3XXX, but not 1XXX Ryzen processors. That sucks.

That means that if I buy a new X570 MOBO to replace the Gigabyte, then I have to get a new processor to boot. That's what I'm trying to avoid.

So that leaves B450 as the only option and that's not good because it won't support 4XXX or 5XXX new features, although they will -- theoretically -- work.

I have a case opened with AMD, once I hear from them I'll have to figure out what to do.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Thu Sep 24, 2020 5:10 pm    Post subject: Reply with quote

Oh. Have you guys done emerge -j n (where n > 3) while MAKEOPTS has "-j m" where m is number of threads?

I did that on my server and build crashed I seem to remember it going downhill quickly after that. I wonder if that isn't what toasted it... I know it's impossible for software to kill hardware (except in the case of the 6502 'halt and catch fire' instruction), but still it makes me wonder.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46364
Location: 56N 3W

PostPosted: Thu Sep 24, 2020 6:49 pm    Post subject: Reply with quote

RayDude,

I run MAKEOPTS="-j100" emerge --jobs=6 ... on a 96 core arm64 system.
It does get a bit sluggish when fifefox, thunderbird, libreoffice and chromium decide to build concurrently but it does not crash or lock up.
Its just unlucky when that happens. :)
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Fri Sep 25, 2020 11:12 pm    Post subject: Reply with quote

NeddySeagoon wrote:
RayDude,

I run MAKEOPTS="-j100" emerge --jobs=6 ... on a 96 core arm64 system.
It does get a bit sluggish when fifefox, thunderbird, libreoffice and chromium decide to build concurrently but it does not crash or lock up.
Its just unlucky when that happens. :)


Thanks. I think it failed on my new work laptop. Gosh I hope not. I'll try that next.

By the way I heard back from AMD and the Warranty is 3 years. They sent me that replacement CPU a couple months + three years ago.

But the tech support guy confirmed that CPU core 0 dying on the Address test of memtest86+ is a CPU problem and suggested I submit it for replacement.

I hope they'll throw me a bone.

I requested the RMA last night, I'll let you guys know what they say.

I'll update the thread
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46364
Location: 56N 3W

PostPosted: Sat Sep 26, 2020 8:55 am    Post subject: Reply with quote

RayDude,

If its a known systematic failure caused by AMD, like the packaging issue on very early Ryzen, they will probably give you a new CPU.
If its a random hardware failure, they probably won't.

Good luck.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Tue Sep 29, 2020 7:28 am    Post subject: Reply with quote

AMD says its out of warranty.

I'm out of luck.

Now what do I do? Limp along until Zen 3 ships and hope for a cheap Zen 2? Or get a used Zen+?

Ugh.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46364
Location: 56N 3W

PostPosted: Tue Sep 29, 2020 8:20 am    Post subject: Reply with quote

RayDude,

Don't pin your hopes on Zen3 ... Remember Zen1 ... don't buy version 1 of anything.
As you have not done a motherboard or CPU swap, you don't know which it is.

That's really the next step,

Very long shot. A long time ago the kernel had a badram command line option. Maybe it was a patch.
The idea was to prevent the kernel allocating the badram, so everything worked as expected.

It won't matter to the kernel if the RAM is bad, or the RAM controller in the CPU has problems with some addresses.

You can also play with maxram= kernel command line option, to see if you can find a memory size that always works.
The idea remains to avoid triggering the problem.

One more thing.
Remove your CPU from the motherboard and reseat it. It just might be a CPU pin to socket contact gone high resistance.

The address is applied to DRAMs in two pieces, called the column and row addresses. The DRAM and memory controller therefore have a property called 'geometry' which limits the size of DRAM that can be addressed.
The high part of the address is the column and the low part the row. The exact split is determined by the DRAM geometry.
If the row part was in a mess, almost nothing would work.
If the column address has a stuck bit (just one) then effects vary, from mapping two addresses to the same physical address, or mapping a real physical address to empty space.
That would generate a bus error, when the non existent RAM failed to respond.

The more I write, the more I think its CPU related as the row and column addresses travel over the same PCB tracks on the motherboard.

The upshot of this is that you can fit all your RAM and do a binary search with maxram=
Its not maxram, its mem=
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Tue Sep 29, 2020 2:40 pm    Post subject: Reply with quote

Thanks Neddy!

I posted a lament to reddit.com/r/pcmasterrace and received a suggestion from a linux guy...

I added this to the grub boot cmdline:

isolcpus=0,1

This keeps core0 and core1 (second thread of core 0) from receiving any programs.

It is strange though. Core0 still has kernel functions running on it periodically, but it only gets to 0.7% occupancy in htop, even when I did an emerge libreoffice.

This does appear to have cut down the amount of activity on core0. I'm hoping it will keep the system stable enough to wait until Zen3 drops, and make Zen+ or Zen2 cheaper so I can get 8 cores for $150.00 or less.

If I still get a crash, then I'll see if I can implement your suggestion. Figuring out where the bad ram is will take a bit of work. Hopefully dmesg will provide the addresses that fail and I'll be able to build a list of what is bad. If enough time passes I might be able to figure out which address bit / data bit is failing.

I'm really hoping it is a core0 issue though. That would be the best since this work around might mitigate that.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Tue Sep 29, 2020 6:32 pm    Post subject: Reply with quote

I'm running emerge -DNuvq @world and this is what htop looks like:

https://i.imgur.com/wIQrpCU.png

It seems to be stable. I'll know as soon as it's done building the 110 packages.
_________________
Some day there will only be free software.


Last edited by RayDude on Tue Sep 29, 2020 9:38 pm; edited 1 time in total
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46364
Location: 56N 3W

PostPosted: Tue Sep 29, 2020 6:55 pm    Post subject: Reply with quote

RayDude,

That will discriminate between a core 0 and RAM problem.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Tue Sep 29, 2020 9:42 pm    Post subject: Reply with quote

I have come to the conclusion that core0 is used to talk to the hardware and that's why core0 needs to be active all the time. If that's true, then it explains why video was affected during the core0 crash.

I built llvm, qtwebengine, gimp and I'm building wine at the moment. It seems much more stable. Typically it would only get about half way through before crashing.

Although I have killed some hardware related background tasks, I am also building in an shm partition which has got to hit ram really hard.

I think this is a core0 problem... And that means, I'm safe for a while.

I wonder if the "disease" will travel from core0 to other cores.

Thanks again for your help!
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Tue Sep 29, 2020 10:31 pm    Post subject: Reply with quote

Update: it finished building without a problem. Looks like I'm set.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Fri Oct 02, 2020 1:28 am    Post subject: Reply with quote

It died. Video hang again...

It looks like a different CPU is on my purchase list... Bummer...

Code:
[115504.102790] usb 3-3: new full-speed USB device number 5 using xhci_hcd
[115504.237097] usb 3-3: New USB device found, idVendor=0a5c, idProduct=21e8, bcdDevice= 1.12
[115504.237101] usb 3-3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[115504.237103] usb 3-3: Product: BCM20702A0
[115504.237105] usb 3-3: Manufacturer: Broadcom Corp
[115504.237107] usb 3-3: SerialNumber: 001986002CBC
[115504.362033] Bluetooth: hci0: BCM: chip id 63
[115504.363032] Bluetooth: hci0: BCM: features 0x07
[115504.379042] Bluetooth: hci0: BCM20702A
[115504.379048] Bluetooth: hci0: BCM20702A1 (001.002.014) build 0000
[115504.380992] Bluetooth: hci0: BCM: firmware Patch file not found, tried:
[115504.380994] Bluetooth: hci0: BCM: 'brcm/BCM20702A1-0a5c-21e8.hcd'
[115504.380995] Bluetooth: hci0: BCM: 'brcm/BCM-0a5c-21e8.hcd'
[130116.593009] elogind-daemon[1990]: New session c18 of user man.
[130129.705431] elogind-daemon[1990]: Removed session c18.
[148094.899392] usb 3-1: USB disconnect, device number 2
[182577.681757] Xorg: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[182577.681764] CPU: 3 PID: 3483 Comm: Xorg Tainted: P           O    T 5.8.12-gentoo #1
[182577.681765] Hardware name: Gigabyte Technology Co., Ltd. AB350M-D3H/AB350M-D3H-CF, BIOS F51c 07/02/2020
[182577.681766] Call Trace:
[182577.681773]  dump_stack+0x6d/0x90
[182577.681776]  warn_alloc.cold+0x74/0xd8
[182577.681779]  ? __alloc_pages_direct_compact+0x10f/0x130
[182577.681781]  __alloc_pages_slowpath.constprop.0+0xb69/0xba0
[182577.681783]  ? prep_new_page+0xbb/0xc0
[182577.681927]  ? _nv037032rm+0x26e/0x370 [nvidia]
[182577.681929]  __alloc_pages_nodemask+0x214/0x240
[182577.681932]  kmalloc_order+0x18/0x60
[182577.681943]  nvkms_alloc+0x1b/0xd0 [nvidia_modeset]
[182577.681957]  _nv002653kms+0x16/0x30 [nvidia_modeset]
[182577.681969]  ? _nv002759kms+0x66/0x1470 [nvidia_modeset]
[182577.682159]  ? _nv033594rm+0x40/0x40 [nvidia]
[182577.682351]  ? _nv000586rm+0xa08/0xde0 [nvidia]
[182577.682360]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[182577.682362]  ? __alloc_pages_nodemask+0x11f/0x240
[182577.682372]  ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]
[182577.682373]  ? kmalloc_order+0x54/0x60
[182577.682382]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[182577.682392]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[182577.682401]  ? nvkms_ioctl_common+0x36/0x160 [nvidia_modeset]
[182577.682410]  ? nvkms_ioctl_common+0x124/0x160 [nvidia_modeset]
[182577.682529]  ? nvidia_frontend_unlocked_ioctl+0x31/0x40 [nvidia]
[182577.682531]  ? ksys_ioctl+0x82/0xc0
[182577.682532]  ? __x64_sys_ioctl+0x11/0x20
[182577.682534]  ? do_syscall_64+0x3e/0xb0
[182577.682537]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[182577.682547] Mem-Info:
[182577.682552] active_anon:2760986 inactive_anon:294392 isolated_anon:0
                 active_file:424716 inactive_file:237483 isolated_file:0
                 unevictable:24 dirty:222 writeback:0
                 slab_reclaimable:141336 slab_unreclaimable:32381
                 mapped:1741344 shmem:1626059 pagetables:20730 bounce:0
                 free:109346 free_pcp:344 free_cma:0
[182577.682555] Node 0 active_anon:11043944kB inactive_anon:1177568kB active_file:1698864kB inactive_file:949932kB unevictable:96kB isolated(anon):0kB isolated(file):0kB mapped:6965376kB dirty:888kB writeback:0kB shmem:6504236kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
[182577.682558] DMA free:15888kB min:64kB low:80kB high:96kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15972kB managed:15888kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[182577.682559] lowmem_reserve[]: 0 3468 15940 15940
[182577.682563] DMA32 free:353412kB min:14688kB low:18360kB high:22032kB reserved_highatomic:0KB active_anon:1308636kB inactive_anon:575740kB active_file:297760kB inactive_file:545704kB unevictable:0kB writepending:360kB present:3616964kB managed:3616964kB mlocked:0kB kernel_stack:5440kB pagetables:15372kB bounce:0kB free_pcp:52kB local_pcp:52kB free_cma:0kB
[182577.682564] lowmem_reserve[]: 0 0 12472 12472
[182577.682568] Normal free:68084kB min:52828kB low:66032kB high:79236kB reserved_highatomic:2048KB active_anon:9735308kB inactive_anon:601828kB active_file:1401372kB inactive_file:404140kB unevictable:96kB writepending:528kB present:13094400kB managed:12776572kB mlocked:96kB kernel_stack:14768kB pagetables:67548kB bounce:0kB free_pcp:1324kB local_pcp:204kB free_cma:0kB
[182577.682568] lowmem_reserve[]: 0 0 0 0
[182577.682570] DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15888kB
[182577.682577] DMA32: 12856*4kB (UME) 9283*8kB (UME) 10180*16kB (UME) 1797*32kB (UME) 132*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 354520kB
[182577.682585] Normal: 1003*4kB (UMEH) 1445*8kB (UMEH) 2458*16kB (UMEH) 337*32kB (UMEH) 32*64kB (UMEH) 3*128kB (H) 1*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 68372kB
[182577.682592] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[182577.682592] 2324105 total pagecache pages
[182577.682595] 35776 pages in swap cache
[182577.682596] Swap cache stats: add 1432955, delete 1397288, find 349174/573369
[182577.682597] Free swap  = 6231548kB
[182577.682597] Total swap = 8388604kB
[182577.682598] 4181834 pages RAM
[182577.682598] 0 pages HighMem/MovableOnly
[182577.682599] 79478 pages reserved
[182577.682605] BUG: unable to handle page fault for address: 0000000000007980
[182577.682608] #PF: supervisor read access in kernel mode
[182577.682609] #PF: error_code(0x0000) - not-present page
[182577.682610] PGD 0 P4D 0
[182577.682613] Oops: 0000 [#1] PREEMPT SMP NOPTI
[182577.682615] CPU: 3 PID: 3483 Comm: Xorg Tainted: P           O    T 5.8.12-gentoo #1
[182577.682616] Hardware name: Gigabyte Technology Co., Ltd. AB350M-D3H/AB350M-D3H-CF, BIOS F51c 07/02/2020
[182577.682632] RIP: 0010:_nv002606kms+0x60/0x100 [nvidia_modeset]
[182577.682635] Code: eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d0 00 00 00 83 c5 01 48 81 c3 28 04 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 dd d2 ff ff eb cb 0f
[182577.682636] RSP: 0018:ffffaf0f81a63ce8 EFLAGS: 00010202
[182577.682639] RAX: 0000000000000004 RBX: 0000000000007980 RCX: 0000000000000004
[182577.682640] RDX: ffff9adc238fe348 RSI: 0000000000007980 RDI: ffff9adc238f9008
[182577.682641] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000000
[182577.682643] R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000007980
[182577.682644] R13: 0000000000007980 R14: ffff9adc238f9008 R15: 0000000000000001
[182577.682646] FS:  00007f7a5de388c0(0000) GS:ffff9adc4e980000(0000) knlGS:0000000000000000
[182577.682648] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[182577.682649] CR2: 0000000000007980 CR3: 00000003e6498000 CR4: 00000000003406e0
[182577.682651] Call Trace:
[182577.682664]  ? _nv002759kms+0x3ca/0x1470 [nvidia_modeset]
[182577.682856]  ? _nv000586rm+0x970/0xde0 [nvidia]
[182577.682867]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[182577.682870]  ? __alloc_pages_nodemask+0x11f/0x240
[182577.682880]  ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]
[182577.682883]  ? kmalloc_order+0x54/0x60
[182577.682893]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[182577.682902]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[182577.682912]  ? nvkms_ioctl_common+0x36/0x160 [nvidia_modeset]
[182577.682922]  ? nvkms_ioctl_common+0x124/0x160 [nvidia_modeset]
[182577.683041]  ? nvidia_frontend_unlocked_ioctl+0x31/0x40 [nvidia]
[182577.683044]  ? ksys_ioctl+0x82/0xc0
[182577.683046]  ? __x64_sys_ioctl+0x11/0x20
[182577.683048]  ? do_syscall_64+0x3e/0xb0
[182577.683050]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[182577.683051] Modules linked in: fuse nvidia_drm(PO) hid_logitech_hidpp nvidia_modeset(PO) nvidia(PO) hid_logitech_dj input_leds r8169 realtek libphy
[182577.683060] CR2: 0000000000007980
[182577.683062] ---[ end trace eac861e1a55d63fd ]---
[182577.683077] RIP: 0010:_nv002606kms+0x60/0x100 [nvidia_modeset]
[182577.683080] Code: eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d0 00 00 00 83 c5 01 48 81 c3 28 04 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 dd d2 ff ff eb cb 0f
[182577.683081] RSP: 0018:ffffaf0f81a63ce8 EFLAGS: 00010202
[182577.683083] RAX: 0000000000000004 RBX: 0000000000007980 RCX: 0000000000000004
[182577.683084] RDX: ffff9adc238fe348 RSI: 0000000000007980 RDI: ffff9adc238f9008
[182577.683085] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000000
[182577.683086] R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000007980
[182577.683087] R13: 0000000000007980 R14: ffff9adc238f9008 R15: 0000000000000001
[182577.683090] FS:  00007f7a5de388c0(0000) GS:ffff9adc4e980000(0000) knlGS:0000000000000000
[182577.683091] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[182577.683092] CR2: 0000000000007980 CR3: 00000003e6498000 CR4: 00000000003406e0

_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Fri Oct 02, 2020 3:28 am    Post subject: Reply with quote

Googled dmesg's crash message and found this:

https://forums.developer.nvidia.com/t/440-48-02-random-x-org-lock-ups-due-to-kernel-module-crash/110995/8

This might be a bug in the nvidia driver.

I put in his work around and will try it for the next week or so. If it lasts, I'll re-enable my core0. Then if that works, I'll boost my memory clock up from 2133MHz.

*crosses fingers*
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Tue Oct 06, 2020 5:59 am    Post subject: Reply with quote

Still crashing even with harddpms false in xorg.conf.

Code:
[361372.655844] Xorg: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[361372.655854] CPU: 8 PID: 19219 Comm: Xorg Tainted: P           O    T 5.8.12-gentoo #1
[361372.655856] Hardware name: Gigabyte Technology Co., Ltd. AB350M-D3H/AB350M-D3H-CF, BIOS F51c 07/02/2020
[361372.655857] Call Trace:
[361372.655866]  dump_stack+0x6d/0x90
[361372.655871]  warn_alloc.cold+0x74/0xd8
[361372.655875]  ? __alloc_pages_direct_compact+0x10f/0x130
[361372.655879]  __alloc_pages_slowpath.constprop.0+0xb69/0xba0
[361372.655882]  ? prep_new_page+0xbb/0xc0
[361372.656141]  ? _nv037032rm+0x26e/0x370 [nvidia]
[361372.656144]  __alloc_pages_nodemask+0x214/0x240
[361372.656148]  kmalloc_order+0x18/0x60
[361372.656166]  nvkms_alloc+0x1b/0xd0 [nvidia_modeset]
[361372.656188]  _nv002653kms+0x16/0x30 [nvidia_modeset]
[361372.656208]  ? _nv002759kms+0x66/0x1470 [nvidia_modeset]
[361372.656519]  ? _nv033594rm+0x40/0x40 [nvidia]
[361372.656829]  ? _nv000586rm+0xa08/0xde0 [nvidia]
[361372.656846]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[361372.656849]  ? __alloc_pages_nodemask+0x11f/0x240
[361372.656866]  ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]
[361372.656868]  ? kmalloc_order+0x54/0x60
[361372.656885]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[361372.656901]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[361372.656904]  ? __fget_files+0x6c/0xa0
[361372.656922]  ? nvkms_ioctl_common+0x36/0x160 [nvidia_modeset]
[361372.656947]  ? nvkms_ioctl_common+0x124/0x160 [nvidia_modeset]
[361372.657193]  ? nvidia_frontend_unlocked_ioctl+0x31/0x40 [nvidia]
[361372.657197]  ? ksys_ioctl+0x82/0xc0
[361372.657199]  ? __x64_sys_ioctl+0x11/0x20
[361372.657202]  ? do_syscall_64+0x3e/0xb0
[361372.657206]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[361372.657226] Mem-Info:
[361372.657235] active_anon:2237041 inactive_anon:534370 isolated_anon:0
                 active_file:257084 inactive_file:442193 isolated_file:0
                 unevictable:24 dirty:168 writeback:506
                 slab_reclaimable:156242 slab_unreclaimable:29802
                 mapped:1703731 shmem:1591765 pagetables:19447 bounce:0
                 free:338412 free_pcp:1 free_cma:0
[361372.657239] Node 0 active_anon:8948164kB inactive_anon:2137480kB active_file:1028336kB inactive_file:1768772kB unevictable:96kB isolated(anon):0kB isolated(file):0kB mapped:6814924kB dirty:672kB writeback:2024kB shmem:6367060kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
[361372.657246] DMA free:15888kB min:64kB low:80kB high:96kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15972kB managed:15888kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[361372.657247] lowmem_reserve[]: 0 3468 15940 15940
[361372.657256] DMA32 free:1259540kB min:14688kB low:18360kB high:22032kB reserved_highatomic:2048KB active_anon:1315496kB inactive_anon:275800kB active_file:75188kB inactive_file:404068kB unevictable:0kB writepending:2384kB present:3616964kB managed:3616964kB mlocked:0kB kernel_stack:3428kB pagetables:9328kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[361372.657257] lowmem_reserve[]: 0 0 12472 12472
[361372.657265] Normal free:78220kB min:52828kB low:66032kB high:79236kB reserved_highatomic:2048KB active_anon:7632668kB inactive_anon:1861680kB active_file:953148kB inactive_file:1364704kB unevictable:96kB writepending:312kB present:13094400kB managed:12776572kB mlocked:96kB kernel_stack:16540kB pagetables:68460kB bounce:0kB free_pcp:4kB local_pcp:0kB free_cma:0kB
[361372.657265] lowmem_reserve[]: 0 0 0 0
[361372.657268] DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15888kB
[361372.657280] DMA32: 72588*4kB (UMEH) 66966*8kB (UMEH) 25917*16kB (UMEH) 499*32kB (UMH) 20*64kB (UMH) 1*128kB (H) 1*256kB (H) 1*512kB (H) 1*1024kB (H) 0*2048kB 0*4096kB = 1259920kB
[361372.657293] Normal: 1225*4kB (UMEH) 2003*8kB (UMEH) 3506*16kB (UMEH) 11*32kB (UEH) 1*64kB (H) 2*128kB (H) 2*256kB (H) 2*512kB (H) 0*1024kB 0*2048kB 0*4096kB = 79228kB
[361372.657307] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[361372.657308] 2331041 total pagecache pages
[361372.657312] 39972 pages in swap cache
[361372.657313] Swap cache stats: add 2999527, delete 2959784, find 1401773/1901494
[361372.657314] Free swap  = 6158844kB
[361372.657316] Total swap = 8388604kB
[361372.657317] 4181834 pages RAM
[361372.657317] 0 pages HighMem/MovableOnly
[361372.657318] 79478 pages reserved
[361372.657329] BUG: unable to handle page fault for address: 0000000000007980
[361372.657334] #PF: supervisor read access in kernel mode
[361372.657337] #PF: error_code(0x0000) - not-present page
[361372.657339] PGD 0 P4D 0
[361372.657344] Oops: 0000 [#1] PREEMPT SMP NOPTI
[361372.657348] CPU: 8 PID: 19219 Comm: Xorg Tainted: P           O    T 5.8.12-gentoo #1
[361372.657356] Hardware name: Gigabyte Technology Co., Ltd. AB350M-D3H/AB350M-D3H-CF, BIOS F51c 07/02/2020
[361372.657388] RIP: 0010:_nv002606kms+0x60/0x100 [nvidia_modeset]
[361372.657394] Code: eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d0 00 00 00 83 c5 01 48 81 c3 28 04 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 dd d2 ff ff eb cb 0f
[361372.657397] RSP: 0018:ffffb08680bbfce8 EFLAGS: 00010202
[361372.657400] RAX: 0000000000000004 RBX: 0000000000007980 RCX: 0000000000000004
[361372.657403] RDX: ffff9e79cab44348 RSI: 0000000000007980 RDI: ffff9e79cab41008
[361372.657405] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000000
[361372.657408] R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000007980
[361372.657411] R13: 0000000000007980 R14: ffff9e79cab41008 R15: 0000000000000001
[361372.657414] FS:  00007f032f0158c0(0000) GS:ffff9e79cec00000(0000) knlGS:0000000000000000
[361372.657417] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[361372.657419] CR2: 0000000000007980 CR3: 000000024565e000 CR4: 00000000003406e0
[361372.657422] Call Trace:
[361372.657450]  ? _nv002759kms+0x3ca/0x1470 [nvidia_modeset]
[361372.657770]  ? _nv000586rm+0x970/0xde0 [nvidia]
[361372.657788]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[361372.657794]  ? __alloc_pages_nodemask+0x11f/0x240
[361372.657811]  ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]
[361372.657815]  ? kmalloc_order+0x54/0x60
[361372.657832]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[361372.657848]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[361372.657852]  ? __fget_files+0x6c/0xa0
[361372.657869]  ? nvkms_ioctl_common+0x36/0x160 [nvidia_modeset]
[361372.657885]  ? nvkms_ioctl_common+0x124/0x160 [nvidia_modeset]
[361372.658097]  ? nvidia_frontend_unlocked_ioctl+0x31/0x40 [nvidia]
[361372.658104]  ? ksys_ioctl+0x82/0xc0
[361372.658106]  ? __x64_sys_ioctl+0x11/0x20
[361372.658110]  ? do_syscall_64+0x3e/0xb0
[361372.658114]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[361372.658118] Modules linked in: fuse nvidia_drm(PO) nvidia_modeset(PO) hid_logitech_hidpp nvidia(PO) hid_logitech_dj input_leds r8169 realtek libphy
[361372.658132] CR2: 0000000000007980
[361372.658136] ---[ end trace 727de6fe850b9bc6 ]---
[361372.658161] RIP: 0010:_nv002606kms+0x60/0x100 [nvidia_modeset]
[361372.658166] Code: eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d0 00 00 00 83 c5 01 48 81 c3 28 04 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 dd d2 ff ff eb cb 0f
[361372.658168] RSP: 0018:ffffb08680bbfce8 EFLAGS: 00010202
[361372.658170] RAX: 0000000000000004 RBX: 0000000000007980 RCX: 0000000000000004
[361372.658172] RDX: ffff9e79cab44348 RSI: 0000000000007980 RDI: ffff9e79cab41008
[361372.658175] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000000
[361372.658177] R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000007980
[361372.658178] R13: 0000000000007980 R14: ffff9e79cab41008 R15: 0000000000000001
[361372.658181] FS:  00007f032f0158c0(0000) GS:ffff9e79cec00000(0000) knlGS:0000000000000000
[361372.658183] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[361372.658185] CR2: 0000000000007980 CR3: 000000024565e000 CR4: 00000000003406e0


Maybe the CPU is really dead...
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
Ionen
Veteran
Veteran


Joined: 06 Dec 2018
Posts: 1275

PostPosted: Tue Oct 06, 2020 7:34 am    Post subject: Reply with quote

Are you using nvidia-drivers-455.23.04 now?

I've recently ran into page allocation failure issues as well, could easily trigger it on purpose by doing heavy tmpfs usage but other things can randomly trigger it. There's a newer similar report on nvidia's forums.

This also didn't hang the system, could ctrl+alt+F1 to return to my efifb console (don't even need ssh, xorg was still taking inputs) and everything was going as normal except the xorg display being frozen.

Returning to nvidia-drivers-450.80.02 solved the issue.

If it's happening to you with stable 450.66 then I don't know :(
Edit: although I did see one other user have similar problems with 450.66, I'm not convinced it's failing hardware either way. 440.100 is probably the most stable if want to try it.
Edit2: personally still get page allocation failures if I use the new 455.28 (does need some unnatural abuse to make it happen quickly though, would mostly work otherwise), so went back to 450.80.02 again.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Goto page Previous  1, 2, 3  Next
Page 2 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum