Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Method to test crashing system?
View unanswered posts
View posts from last 24 hours

Goto page 1, 2, 3  Next  
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Thu Sep 10, 2020 7:56 pm    Post subject: Method to test crashing system? Reply with quote

Final Update:

Core 0 has issues. I have found a way (via a user at reddit) to disable CPU1 and CPU2 (both threads of core0) for all tasks, except hardware related ones. This seems to have brought back stability.

****************************************************************************************8


I have a three year old server that is hanging every few days overnight.

It's done it three times in the last week or so.

The strange thing is: it's not totally dead. The screen freezes and I can't switch to console, but I can log in remotely. When I do, I can't kill any application that's hung which includes X, kde, plasma. I can't even get it to reboot.

I have to hard reset or power cycle it.

This feels like a hardware issue to me. Does anyone have any tricks to figuring out if it's hardware or if I somehow botched the software so badly that X is hanging beyond kill -9?

I was hoping to put off upgrading this system until next year... I'm going to start looking for motherboard and CPU deals...

Thanks in advance.
_________________
Some day there will only be free software.


Last edited by RayDude on Fri Oct 02, 2020 1:27 am; edited 3 times in total
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46349
Location: 56N 3W

PostPosted: Thu Sep 10, 2020 8:20 pm    Post subject: Reply with quote

RayDude,

Can you read logs is get logs off it?
dmesg would be good.

What does smartctl -x say about the HDD?

Boot into a few cycles of memtest86
A fail does not always mean a RAM fail.

Take out half the RAM. Does it still hang.
Now try with only the half of the RAM that was out.

Put all the RAM back ... what happens now.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Fri Sep 11, 2020 12:40 am    Post subject: Reply with quote

Thanks Neddy!

NeddySeagoon wrote:
RayDude,

Can you read logs is get logs off it?
dmesg would be good.



I did not think of this. So obvious. Next time it happens I will definitely check both.


NeddySeagoon wrote:
What does smartctl -x say about the HDD?


I know the hard drives / ssd are okay. I keep tabs on them.


NeddySeagoon wrote:

Boot into a few cycles of memtest86
A fail does not always mean a RAM fail.

Take out half the RAM. Does it still hang.
Now try with only the half of the RAM that was out.

Put all the RAM back ... what happens now.


I will consider this.

The first failures happened after I installed a new BIOS and attempted to run the memory faster. It worked great, until it didn't.

But then, I slowed it all back down to slower than stock (this is a Ryzen 5 1600) I put memory at 2133 and I've never left stock cpu frequency or voltage and it is still happening.

It makes me wonder if the BIOS upgrade broke something. I'll check out gigabyte has released another bios to fix this one...

Thanks again. I really appreciate you taking the time to respond.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Fri Sep 11, 2020 2:11 am    Post subject: Reply with quote

Update: there was a new bios released last month. It contained AGESA 1.0.0.6 update. I'm hoping that helps.

I'm leaving everything stock to see if it fails again.

I'm crossing my fingers...
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Sun Sep 13, 2020 12:31 am    Post subject: Reply with quote

I updated the BIOS, which set everything back to BIOS defaults, I left it there.

I did an emerge -DNuq @world yesterday and things went south again. Again, X windows died. Black screen, no activity.

But I was able to login remotely and check the system. Here is the end of dmesg, keep in mind some of the machination around the nvidia drivers are me trying to get X to restart and failing.

Code:
[29781.707378] elogind-daemon[2003]: New session c17 of user man.
[29782.656892] elogind-daemon[2003]: Removed session c17.
[54302.920055] elogind-daemon[2003]: New session 6 of user XXXX.
[54311.100668] elogind-daemon[2003]: Removed session 6.
[54319.280085] elogind-daemon[2003]: New session 7 of user XXXX.
[54441.427526] TCP: request_sock_TCP: Possible SYN flooding on port 56190. Sending cookies.  Check SNMP counters.
[76583.096115] fuse: init (API version 7.31)
[84602.043020] elogind-daemon[2003]: New session 8 of user XXXX.
[86195.821445] udevd[805]: invalid key/value pair in file /lib/udev/rules.d/60-steam-input.rules on line 42, starting at character 82 ('u')
[86723.913499] elogind-daemon[2003]: Removed session 7.
[87027.491302] traps: ThreadPoolSingl[4325] trap int3 ip:563acaf0f594 sp:7fd05b513f50 error:0 in chrome (deleted)[563ac804b000+7bf1000]
[87028.245102] elogind-daemon[2003]: Removed session 3.
[87092.876276] elogind-daemon[2003]: New session 9 of user root.
[87094.733875] elogind-daemon[2003]: Removed session 9.
[87103.959788] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[87103.959886] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[87104.611948] elogind-daemon[2003]: New session 10 of user mythtv.
[87158.604771] elogind-daemon[2003]: Removed session 10.
[87171.388304] elogind-daemon[2003]: New session 11 of user root.
[87780.826265] [drm] [nvidia-drm] [GPU ID 0x00000800] Unloading driver
[87780.836313] nvidia-modeset: Unloading
[87780.845269] nvidia-nvlink: Unregistered the Nvlink Core, major device number 246
[87823.682060] nvidia-nvlink: Nvlink Core is being initialized, major device number 246
[87823.682484] nvidia 0000:08:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=io+mem
[87823.882322] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  450.66  Wed Aug 12 19:42:48 UTC 2020
[87824.142123] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
[87824.142224] caller _nv000745rm+0x1af/0x200 [nvidia] mapping multiple BARs
[93754.509148] nvidia-nvlink: Unregistered the Nvlink Core, major device number 246


I ran /etc/intit.d/xdm stop and it says sddm stopped, but X didn't.

I tried to remove the nvidia drivers and the system wouldn't do it, even with modprobe -r -f because "module in use".

ps -ef | grep plasma showed that plasma was still running. I killed it.

ps -ef | grep kde (I think) showed that something of kde was still running and I killed it.

Then I could unload the nvidia modules.

Then I tried to start sddm (xdm) again and nothing. The nvidia driver didn't even load. I loaded it by hand and drm didn't load, only the nvidia module loaded and some of the output in dmesg is what it had to say...

Nothing I did could get video to recover.

But CTRL-ALT-F1 did get me to a working console.

I'm starting to think the video card or motherboard is going bad. I wonder if there's dust caked on the video card. I didn't study it closely the last time I had the case open. I should probably check. I wonder if things are overheating. Every since the thermal monitor for KDE stopped working I haven't been paying attention to the temps. I should set up a script and watch the next time I emerge @world...

I did something wacky in BIOS for the next experiment. I turned the PCIe ports down to PCIe Gen 1 to see if it makes a difference. I'll do an emerge @world next Friday and see what happens.

It's funny, I had a very similar problem in the system this system replaced a couple years ago and I'm pretty sure the video card died in very similar ways. I still have it, can't throw out a Geforce, I might need it in a pinch, but it's just cooking in the garage summer heat for the last several years.

Man I want this thing to survive until Zen 3 comes out and doesn't bust a wallet.

Thanks for listening. I'll keep posting status because it helps me organize my thoughts.

PS. I wonder if the syn flooding is a symptom of the crash...

Edit: I have a huge SHM for zoneminder on this machine. This feels like it might be a memory management issue that affects X and plasma. I've been thinking about doubling my ram to 32 GB. DRAM prices are in freefall at the moment, should bottom out by the end of the year, maybe first quarter as manufacturers scale production back. But for now DRAM and SSDs are getting cheaper by the week. But I hestiate to buy new RAM when a new system might want DDR5? I'll have to check to see if Zen 3 supports DDR5... I suspect not...
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Wed Sep 16, 2020 10:36 pm    Post subject: Reply with quote

I don't know if anyone can help me, but I got an oops.

Code:

[408991.171083] Xorg: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[408991.171092] CPU: 0 PID: 3377 Comm: Xorg Tainted: P           O    T 5.8.8-gentoo #1
[408991.171094] Hardware name: Gigabyte Technology Co., Ltd. AB350M-D3H/AB350M-D3H-CF, BIOS F51c 07/02/2020
[408991.171095] Call Trace:
[408991.171104]  dump_stack+0x6d/0x90
[408991.171109]  warn_alloc.cold+0x74/0xdb
[408991.171113]  ? __alloc_pages_direct_compact+0x11d/0x140
[408991.171117]  __alloc_pages_slowpath.constprop.0+0xb53/0xb90
[408991.171121]  ? wake_up_q+0x90/0x90
[408991.171124]  ? prep_new_page+0xbd/0xc0
[408991.171127]  __alloc_pages_nodemask+0x210/0x240
[408991.171131]  kmalloc_order+0x1b/0x60
[408991.171148]  nvkms_alloc+0x1b/0xd0 [nvidia_modeset]
[408991.171168]  _nv002653kms+0x16/0x30 [nvidia_modeset]
[408991.171185]  ? _nv002759kms+0x66/0x1470 [nvidia_modeset]
[408991.171200]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[408991.171202]  ? __alloc_pages_nodemask+0x11b/0x240
[408991.171216]  ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]
[408991.171219]  ? kmalloc_order+0x57/0x60
[408991.171232]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[408991.171245]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[408991.171259]  ? nvkms_ioctl_common+0x36/0x160 [nvidia_modeset]
[408991.171273]  ? nvkms_ioctl_common+0x124/0x160 [nvidia_modeset]
[408991.171449]  ? nvidia_frontend_unlocked_ioctl+0x2f/0x40 [nvidia]
[408991.171452]  ? ksys_ioctl+0x82/0xc0
[408991.171454]  ? __x64_sys_ioctl+0x11/0x20
[408991.171457]  ? do_syscall_64+0x3e/0xb0
[408991.171460]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[408991.171475] Mem-Info:
[408991.171482] active_anon:2641871 inactive_anon:301721 isolated_anon:0
                 active_file:492415 inactive_file:210014 isolated_file:0
                 unevictable:24 dirty:34 writeback:0
                 slab_reclaimable:142143 slab_unreclaimable:30998
                 mapped:1689062 shmem:1608888 pagetables:19906 bounce:0
                 free:176615 free_pcp:0 free_cma:0
[408991.171486] Node 0 active_anon:10567484kB inactive_anon:1206884kB active_file:1969660kB inactive_file:840056kB unevictable:96kB isolated(anon):0kB isolated(file):0kB mapped:6756248kB
dirty:136kB writeback:0kB shmem:6435552kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
[408991.171490] DMA free:15888kB min:64kB low:80kB high:96kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB p
resent:15972kB managed:15888kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[408991.171491] lowmem_reserve[]: 0 3468 15940 15940
[408991.171498] DMA32 free:611724kB min:14688kB low:18360kB high:22032kB reserved_highatomic:0KB active_anon:1428252kB inactive_anon:427880kB active_file:202688kB inactive_file:452276kB u
nevictable:0kB writepending:16kB present:3616964kB managed:3616964kB mlocked:0kB kernel_stack:4492kB pagetables:12700kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[408991.171498] lowmem_reserve[]: 0 0 12472 12472
[408991.171505] Normal free:78848kB min:52828kB low:66032kB high:79236kB reserved_highatomic:2048KB active_anon:9139232kB inactive_anon:779004kB active_file:1766972kB inactive_file:387780
kB unevictable:96kB writepending:120kB present:13094400kB managed:12776556kB mlocked:96kB kernel_stack:13828kB pagetables:66924kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[408991.171505] lowmem_reserve[]: 0 0 0 0
[408991.171507] DMA: 0*4kB 0*8kB 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15888kB
[408991.171517] DMA32: 26229*4kB (UME) 18227*8kB (UME) 17805*16kB (UME) 2039*32kB (UME) 188*64kB (UME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 612892kB
[408991.171525] Normal: 5695*4kB (UMEH) 2019*8kB (UMEH) 2080*16kB (UMEH) 225*32kB (UMEH) 33*64kB (MEH) 3*128kB (H) 1*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 82164kB
[408991.171536] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[408991.171537] 2376843 total pagecache pages
[408991.171540] 65526 pages in swap cache
[408991.171541] Swap cache stats: add 2813577, delete 2748166, find 1460048/1950362
[408991.171542] Free swap  = 5640700kB
[408991.171542] Total swap = 8388604kB
[408991.171543] 4181834 pages RAM
[408991.171544] 0 pages HighMem/MovableOnly
[408991.171544] 79482 pages reserved
[408991.171553] BUG: unable to handle page fault for address: 0000000000007980
[408991.171557] #PF: supervisor read access in kernel mode
[408991.171559] #PF: error_code(0x0000) - not-present page
[408991.171561] PGD 0 P4D 0
[408991.171564] Oops: 0000 [#1] PREEMPT SMP NOPTI
[408991.171568] CPU: 0 PID: 3377 Comm: Xorg Tainted: P           O    T 5.8.8-gentoo #1
[408991.171569] Hardware name: Gigabyte Technology Co., Ltd. AB350M-D3H/AB350M-D3H-CF, BIOS F51c 07/02/2020
[408991.171593] RIP: 0010:_nv002606kms+0x60/0x100 [nvidia_modeset]
[408991.171601] Code: eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d0 00 00 00 83 c5 01 48 81 c3 28 04 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 dd d2 ff ff eb cb 0f
[408991.171603] RSP: 0018:ffffb099811c3ce8 EFLAGS: 00010202
[408991.171606] RAX: 0000000000000004 RBX: 0000000000007980 RCX: 0000000000000004
[408991.171608] RDX: ffff98aceb7e9348 RSI: 0000000000007980 RDI: ffff98ace71d1008
[408991.171610] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000000
[408991.171611] R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000007980
[408991.171613] R13: 0000000000007980 R14: ffff98ace71d1008 R15: 0000000000000001
[408991.171616] FS:  00007f9eaf52d8c0(0000) GS:ffff98ad0e800000(0000) knlGS:0000000000000000
[408991.171618] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[408991.171620] CR2: 0000000000007980 CR3: 00000003ef2be000 CR4: 00000000003406f0
[408991.171622] Call Trace:
[408991.171641]  ? _nv002759kms+0x3ca/0x1470 [nvidia_modeset]
[408991.171655]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[408991.171660]  ? __alloc_pages_nodemask+0x11b/0x240
[408991.171674]  ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]
[408991.171678]  ? kmalloc_order+0x57/0x60
[408991.171693]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[408991.171708]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[408991.171723]  ? nvkms_ioctl_common+0x36/0x160 [nvidia_modeset]
[408991.171738]  ? nvkms_ioctl_common+0x124/0x160 [nvidia_modeset]
[408991.171911]  ? nvidia_frontend_unlocked_ioctl+0x2f/0x40 [nvidia]
[408991.171915]  ? ksys_ioctl+0x82/0xc0
[408991.171918]  ? __x64_sys_ioctl+0x11/0x20
[408991.171921]  ? do_syscall_64+0x3e/0xb0
[408991.171925]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[408991.171929] Modules linked in: ipt_REJECT nf_reject_ipv4 xt_multiport iptable_filter fuse nvidia_drm(PO) nvidia_modeset(PO) hid_logitech_hidpp nvidia(PO) input_leds hid_logitech_dj r8169 realtek libphy
[408991.171941] CR2: 0000000000007980
[408991.171944] ---[ end trace 816cbc84fb70ef20 ]---
[408991.171966] RIP: 0010:_nv002606kms+0x60/0x100 [nvidia_modeset]
[408991.171970] Code: eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d0 00 00 00 83 c5 01 48 81 c3 28 04 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 dd d2 ff ff eb cb 0f
[408991.171972] RSP: 0018:ffffb099811c3ce8 EFLAGS: 00010202
[408991.171974] RAX: 0000000000000004 RBX: 0000000000007980 RCX: 0000000000000004
[408991.171975] RDX: ffff98aceb7e9348 RSI: 0000000000007980 RDI: ffff98ace71d1008
[408991.171977] RBP: 0000000000000000 R08: 0000000000000200 R09: 0000000000000000
[408991.171978] R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000007980
[408991.171980] R13: 0000000000007980 R14: ffff98ace71d1008 R15: 0000000000000001
[408991.171982] FS:  00007f9eaf52d8c0(0000) GS:ffff98ad0e800000(0000) knlGS:0000000000000000
[408991.171984] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[408991.171986] CR2: 0000000000007980 CR3: 00000003ef2be000 CR4: 00000000003406f0
[409016.172216] GpuWatchdog[4352]: segfault at 0 ip 000055e936015a02 sp 00007f004d0e2850 error 6 in chrome[55e93177b000+7bf3000]
[409016.172225] Code: 89 de e8 c1 8e 6f ff 80 7d c7 00 79 09 48 8b 7d b0 e8 42 e9 6b fe 41 8b 84 24 e0 00 00 00 89 45 b0 48 8d 7d b0 e8 ce df 9c fb <c7> 04 25 00 00 00 00 37 13 00 00 48 83 c4 48 5b 41 5c 41 5d 41 5e


Can someone understand this? I'll pour through it after I get the PC rebooted.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Wed Sep 16, 2020 11:43 pm    Post subject: Reply with quote

I found a thread on the internet that implies that the first error message:

Code:
Xorg: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP),


Is caused by the nvidia driver crashing. If that's the case, then maybe an old driver would fix it, or perhaps the video card really is dying.

This crash didn't even happen under stress. I had just woken up the display from blanking when this crash happened.

The fact that reloading the driver didn't fix the problem before makes me think this is a hardware failure...

Dangit.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Sat Sep 19, 2020 6:27 pm    Post subject: Reply with quote

Update: While emerging world today I had multiple oops.

But in this case, the video did not crash.

Which makes me think it's not the video card after all...

Since it seems to be related to memory and I'm using a lot of SHM for zoneminder, I decided to turn the shm size down since I was only using about 52% of it. I adjusted it from 12 GB to 8GB.

I'll let this run for a while and see if things continue to crash.

This may be a memory issue. It could be the motherboard...
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46349
Location: 56N 3W

PostPosted: Sat Sep 19, 2020 6:41 pm    Post subject: Reply with quote

RayDude,

Memtest86 may be your friend. You must boot into it.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
molletts
Tux's lil' helper
Tux's lil' helper


Joined: 16 Feb 2013
Posts: 75

PostPosted: Sat Sep 19, 2020 7:56 pm    Post subject: Reply with quote

The video hang may be completely unrelated to the oops - it sounds very much like the occasional hangs I've been having for some months on my #2 system (also AMD-based, but much older than yours - a Phenom II X6 1090T), which I described here. It has an Nvidia GTX260 with the 340.x drivers.

I've yet to experience one on my #1 system (AMD FX9590), which has a GTX460 with the 390.x drivers.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Sun Sep 20, 2020 1:44 am    Post subject: Reply with quote

Thanks guys.

Neddy, I'll try memtest86 soon.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Sun Sep 20, 2020 3:44 am    Post subject: Reply with quote

Hi Neddy,

I ran memtest86+

It hangs at 35%. Or at least I think it does. The keyboard can't do anything. I tried pressing F1 at the beginning but it doesn't seem to make a difference.

The memory bandwidth is accurately reported. I tested 2133 MHz and 3200 MHz and the bandwidth when from 14 GB/s to 18 GB/s.

I'm not sure what it's supposed to look like when it runs so I'm going to install it on my laptop and observe it...

Edit: I tried it on two older intel laptops and it goes to a blank screen and doesn't run...

Am I doing something wrong?
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
s|mon
Apprentice
Apprentice


Joined: 04 Jul 2004
Posts: 157
Location: Bayern [de]

PostPosted: Sun Sep 20, 2020 9:23 am    Post subject: Reply with quote

Hi RayDude,

memtest usually continues along with different patterns so it would not really end iirc. But if you would manage to pass a day or two chances would be good that it is not broken.
But it should not hang (it may take longer to progress with later cycles but it should not hang) so that would be one more point hinting at sth. mem (or board/cpu) related.
You can find screenshots and description on it's website or wikipedia to see what it should look like.

Did you try as Neddy suggested with one module at a time? Maybe only one has an issue. I'd start with a single one on default (safe settings no OC) and let it run for a day or at least over night.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46349
Location: 56N 3W

PostPosted: Sun Sep 20, 2020 9:31 am    Post subject: Reply with quote

RayDude,

It should either report errors and attempt to continue or there should be visible sights of progress at the top of the screen.

It works by dividing RAM into sections and testing a section at a time.
It moves itself around in RAM too, so it can test the region it was once running from, rather like playing 'core wars' with itself.

Try the Added enhanced Fail Safe Mode (Press F1 at startup) option.
Blank screen at boot may be a BIOS/UEFI issue.

That it was running and stalled at 35% sounds like a problem.
Some tests take much longer than others but even on my Phenom II there is only a few seconds between scree updates.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Sun Sep 20, 2020 4:38 pm    Post subject: Reply with quote

Thanks guys.

The machine under test is always in use. It's our home server. It serves files to the family, runs security cameras, records from live TV. It's quite busy.

I hesitate to start monkeying with it.

While I was trying to get memtest86+ to run, I changed a few BIOS settings. Then after I boot back into gentoo, chrome would crash if I scrolled down to the bottom of this thread.

Performing a BIOS set optimized defaults fixed that issue.

It had been stable for so many years. Sigh.

I'll try pulling a memory out and try memtest86+ again.

I tried memtest86+ on my new work laptop and it crashes at a black screen as well.

I can't imagine what is wrong with that...
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Sun Sep 20, 2020 5:09 pm    Post subject: Reply with quote

My motherboard sucks.

I can't use a USB keyboard to control the test. It doesn't work. CTRL-ALT-DEL works, but nothing else.

I used a PS2 keyboard and realized a few things.

1. With one ram (tried in three slots) it hangs at 65%.
2. With two rams, in either set of slots) it hangs at 35%.
3. I ran the test on all cores and the test gets to pass two and crashes at 5% on CORE0. It did this twice.

This makes me think that CORE 0 is bad. Is there a way to disable core0? I'm googling that next. 10 threads would be plenty until Zen3 comes out.

I'm also going to find out what the warranty is on this processor. I doubt it's five years, but hey, maybe.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Sun Sep 20, 2020 5:15 pm    Post subject: Reply with quote

Argh. Can't disable core0 in linux...
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
Tony0945
Advocate
Advocate


Joined: 25 Jul 2006
Posts: 4142
Location: Illinois, USA

PostPosted: Sun Sep 20, 2020 5:39 pm    Post subject: Reply with quote

RayDude wrote:
1. With one ram (tried in three slots) it hangs at 65%.
2. With two rams, in either set of slots) it hangs at 35%.

Usually one stick can only go in one slot and two sticks must go in a particular pair of slots. Check your motherboard manual. if you can't find it, they are usually available for dowbload on the OEM's site.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Sun Sep 20, 2020 6:04 pm    Post subject: Reply with quote

I tried all sorts of BIOS options.

I tried 4 cores with SMT off, two cores with SMT off.

I disabled advanced power control, cache, etc.

I turned the clock frequency down to 2 GHz, then 1.6 GHz.

No matter what I did, memtest86+ failed on core 0.

Core 0 is always active no matter how many cores you disable.

Assuming memtest86+ is good, the problem is core0.

What a bummer.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
Tony0945
Advocate
Advocate


Joined: 25 Jul 2006
Posts: 4142
Location: Illinois, USA

PostPosted: Sun Sep 20, 2020 10:04 pm    Post subject: Reply with quote

What socket? Good chance of getting a better CPU cheap.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46349
Location: 56N 3W

PostPosted: Sun Sep 20, 2020 11:11 pm    Post subject: Reply with quote

RayDude,

I think I've seen disabling Core 0 in very new kernels.

Code:
 
  ┌──────────────────────── Debug CPU0 hotplug ────────────────────────┐
  │ CONFIG_DEBUG_HOTPLUG_CPU0:                                         │ 
  │                                                                    │ 
  │ Enabling this option offlines CPU0 (if CPU0 can be offlined) as    │ 
  │ soon as possible and boots up userspace with CPU0 offlined. User   │ 
  │ can online CPU0 back after boot time.                              │ 
  │                                                                    │ 
  │ To debug CPU0 hotplug, you need to enable CPU0 offline/online      │ 
  │ feature by either turning on CONFIG_BOOTPARAM_HOTPLUG_CPU0 during  │ 
  │ compilation or giving cpu0_hotplug kernel parameter at boot.       │ 


From that I understand that you can bring up the box normally with CPU0 offline if the CPU supports it.
CPU0 is always used to start the other CPUs, so its not useful to turn it off in the firmware.

-- edit --

Can you test the RAM in another system?
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Mon Sep 21, 2020 3:50 pm    Post subject: Reply with quote

Tony0945 wrote:
What socket? Good chance of getting a better CPU cheap.


Socket AM4. I was hoping to hold off until Zen3 was out and a bit mature...

I received this Ryzen 5 1600 from AMD directly as the one I purchased was affected by the "linux" bug. I've contacted AMD tech support to see what kind of warranty it has...

*crosses fingers*
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Mon Sep 21, 2020 4:05 pm    Post subject: Reply with quote

[quote="NeddySeagoon"]RayDude,

I think I've seen disabling Core 0 in very new kernels.

Code:
 
  ┌──────────────────────── Debug CPU0 hotplug ────────────────────────┐
  │ CONFIG_DEBUG_HOTPLUG_CPU0:                                         │ 
  │                                                                    │ 
  │ Enabling this option offlines CPU0 (if CPU0 can be offlined) as    │ 
  │ soon as possible and boots up userspace with CPU0 offlined. User   │ 
  │ can online CPU0 back after boot time.                              │ 
  │                                                                    │ 
  │ To debug CPU0 hotplug, you need to enable CPU0 offline/online      │ 
  │ feature by either turning on CONFIG_BOOTPARAM_HOTPLUG_CPU0 during  │ 
  │ compilation or giving cpu0_hotplug kernel parameter at boot.       │ 


This is really cool. Let me see if I can figure out how to enable it. If nothing else, perhaps it will make the system stable for a while.

I enabled it in my kernel and will reboot momentarily. I'll post an update as soon as it's up to let you know if it was able to offline cpu0.

NeddySeagoon wrote:
From that I understand that you can bring up the box normally with CPU0 offline if the CPU supports it.
CPU0 is always used to start the other CPUs, so its not useful to turn it off in the firmware.

-- edit --

Can you test the RAM in another system?


My son's gaming PC is a Ryzen 5 3600+ we got him last year. He's using it for school, but perhaps tonight I'll be able to swap the RAMs and see how they do.

They are both 3600 MHz memory, although mine is Samsung B-Die, so they should perform better in his system than his.

If memtest86+ works on my system with his ram then we have at least narrowed it down.

Thanks again for the great ideas!
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1725
Location: San Jose, CA

PostPosted: Mon Sep 21, 2020 4:48 pm    Post subject: Reply with quote

Ah well. It looks like CPU0 can't be off-lined.

It is interesting to note that when I offline CPU1 (second thread of CPU0, I think), htop still shows activity on that CPU. So I wonder if offlining really works at all.

[update: htop is accurate, one core has completely disappeared. I saw the top CPU was 11 and forgot that htop numbers them from 1 to 12]

I tried the (c)onfiguration options of memtest86+ to see if there was a way to ignore cpu0, but there wasn't. I ran another test and this time a few seconds into the test the machine rebooted.

I'll try swapping my son's memory with mine this afternoon. Once he's out of school.

Oh, and if the issue is third level CCX cache, then all the CPUs on the first CCX will have problems. And it seems like the system is using all 4 of the CPUs on that CCX, so I 'd need to disable them all.

If however the issue is with primary or secondary cache, then disabling a single CPU (and it's second thread) might solve the problem...
_________________
Some day there will only be free software.


Last edited by RayDude on Mon Sep 21, 2020 11:05 pm; edited 1 time in total
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46349
Location: 56N 3W

PostPosted: Mon Sep 21, 2020 5:23 pm    Post subject: Reply with quote

RayDude,

You will loose your mind :)

At least you have a testbed, that makes life easier.

Here's a nasty thought, or a series of related nasty thoughts ...
The low voltages (below 3.3v) required to operate the RAM and CPU are derived from the Auxillary 12v connector to the motherboard,
Its 4, 6 or 8 pins, whatever, often not enough pins. As a result, it gets hot and goes high resistance. Its always the sockets on the cable that suffer, as the pins on the motherboard are soldered to 2oz copper power planes.
All is well until there is a big gulp of power and one or more low voltage supplies goes out of tolerance.
A lot of the 12v is dropped across the connector ...

A worked example may help. To keep the arithmetic simple, lets say that the CPU needs 120W flat out.
That's 10A through that connector. That's OK when its nice and shiny.

Now suppose that the contract resistance increases from very little to 0.1 Ohm. That's still not a lot but it costs 1v loss at the connector.
10A and 1v is 100w ... OK, so the connector fails well before the contact resistance gets to 0.1 Ohm. :)

Long story short. I've had a few instances like this, its worth 'wiping' the contacts by unplugging and replugging the connector two or three times.
Make sure the gold is still there while its apart.
'Wiping' the pins will fix it for 6 to 9 months.

Further down in the converter, the input and output capacitors get a very hard life. They used to be aluminium electrolytics and failures we easy to spot.
The tops would dome, the rubber bungs would push out of the bottom and allow the electrolyte to leak out.
I don't know what you have on your motherboard, but a failure or one or more of these has the same effect.

Here's a test. The idea is to reduce the load on the 12v Aux.
In the BIOS, turn off as many real CPU cores as you dare, then run memtest86.
If this works, it points the finger at the motherboard dynamic voltage regulation but the main PSU, providing the 12v is not in the clear yet either.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum