Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Help Debugging Hardware Failure? [solved]
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware
View previous topic :: View next topic  
Author Message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 261

PostPosted: Fri Sep 25, 2020 12:14 pm    Post subject: Help Debugging Hardware Failure? [solved] Reply with quote

Hello All,
I realize this isn't necessarily the ideal forum for this, but thought I'd try anyway. I just built a system that keeps hanging on me and I'm trying to debug the problem. Here's some pertinent data:

  • I ran memtest86 from a USB for 1 pass; no problems showed up (2 days).
  • I ran stress on the machine for about 10 minutes, no issues.
  • When I try to emerge world, the machine inevitably dies within 30 minutes.
  • Last time I tried, I had sensors running in a loop; CPU temps all below 60 degrees
  • The only possibly related message I see in /var/log/messages is:
    Quote:
    alaya kernel: [ 1219.857741] kworker/dying (203) used greatest stack depth: 12288 bytes left
    but I don't know if this is relevant
  • The system was actually dying when I was trying to finish up the stage3 install; i.e. when compiling grub2 while the system was mounted from a minimal bootdisk. I got around this by removing all but one of the RAM sticks and then it compiled o.k. But then I assumed the RAM wasn't the issue because of the memtest.
  • Some relevant hardware specs: Xeon-W2295 CPU, Asus WS C422 Pro/Se Mobo, 512G Ram, Nvividia 1080Ti GPU, Samsung EVO 970 SSD, Corsair 1200W PSU


So my guess is the issue is the CPU, the motherboard or the RAM, but I'm not sure how to narrow it down. Any tips? As I write this I'm thinking I should maybe remove a bunch of the RAM sticks and try emerge world again. Then maybe trying memtest86+?

Can I rule any of those 3 out? Anybody have educated guesses as to where I should focus my efforts? Should I run stress for longer?


Last edited by justin_brody on Wed Oct 07, 2020 4:39 pm; edited 1 time in total
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46272
Location: 56N 3W

PostPosted: Fri Sep 25, 2020 4:45 pm    Post subject: Reply with quote

justin_brody,

We have a few retired hardware guys/girls here and some not yet retired too, so this will do. :)

To save us all a lot of typing, follow the advice in Method to test hardware functionality of crashing system?. While the things you can try are already in that topic, the questions you need to ask may be different.

Follow that topic for investigation ideas but post your questions and findings here.
They may reach a different conclusion to that topic.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 261

PostPosted: Fri Sep 25, 2020 5:18 pm    Post subject: Reply with quote

Thanks so much NeddySeagoon! I think you actually helped me diagnose hardware failures on the last system I built, so that's probably why I had the idea this was a good place Plus this feels more like home than anywhere else.

I will go through the link as you suggested and post updates; thanks again!
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 261

PostPosted: Thu Oct 01, 2020 12:47 pm    Post subject: Reply with quote

Hi Mika,

Thanks for that tip! I'm hoping I can figure out my issues without having to get that sophisticated, but it might be fun to play with if I need too :)


Some updates on my debugging efforts, based on the post NeddySeagoon pointed to:

  • smartctl -x reports no errors on the SSD
  • I spent a fair amount of time swapping RAM modules in and out. With just one module, I was able to emerge chromium without any issues. I tried this with each of the 8 modules; everything seemed o.k. I read somewhere that this might indicate the DDR needed higher voltage, so I manually set it to 1.4 everywhere. This caused the machine to hang partway through the gentoo boot-up. I set it back to [auto] and the system seemed more stable; I was able to run it (with all 8 RAM modules) for a few hours, but then it hung again.
  • As I mentioned in the (now old) original post, I ran memtest86 for a couple of days. It completed the first cycle without any trouble.
  • BIOS is updated to the latest; June of 2020 as I recall.
  • Right now, lm_sensors is only giving me CPU temps and they always look o.k. (< 60 Celsius even during heavy compilation). It is *not* giving me motherboard temps; I'll see if there's something I need to do on the motherboard to get those reported.
  • Overall, this thing produces a lot of heat. But with this CPU and the GPU I would think that's to be expected. When I originally pulled the RAM modules out they were hot to touch; is that normal? For what it's worth the modules don't have any kind of heat spreaders or anything else on them. Is there a way to tell if they're simply overheating?


My guess right now is that it's something with the RAM; but I'm not sure how to verify. Should I try running it for a few days with only half the RAM in and see if it's more stable? Any advice will be much appreciated!
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 261

PostPosted: Thu Oct 01, 2020 12:55 pm    Post subject: Reply with quote

OK, running stress --vm 20 seems to hang the machine very quickly. Not usually happy to see my computer crash :)

So this sound like a memory issue. How to debug?
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 261

PostPosted: Thu Oct 01, 2020 1:18 pm    Post subject: Reply with quote

Took all but one module out, now I can run stress --vm 20 without issue.

Given that memtest86 reported nothing and that I was able to emerge chromium with each individual module, I'm guessing this is some kind of motherboard-related issue? Voltages? Power? Heat?
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 261

PostPosted: Thu Oct 01, 2020 1:23 pm    Post subject: Reply with quote

In case it's relevant, the modules are all the same make and model but were packaged as 8 individual sticks rather than being packaged as a kit. I've seen people say that can be an issue. I'm also noticing that the response time of the machine seems better with a single stick than it did with 8.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46272
Location: 56N 3W

PostPosted: Thu Oct 01, 2020 1:37 pm    Post subject: Reply with quote

justin_brody,

Stress streesses lots of things. The memory cannot be used without the memory controller, which is a part of the CPU.
The CPU communicates with the RAM using the motherboard.
The whole thing requires power.

Do a binary search on your RAM. Remove half of it and leave half fitted to test.
Identify the sticks, you need to be able to tell then apart later.
If that fails, the defect is still present, so swap the fitted sticks with the ones you removed.
If that fails, its probably not RAM.
If that works, you have likely removed a faulty RAM stick.
However, it could be an address bit, With half the RAM removed, you are using less of the address bus.

With 4 sticks you test 2 then 1 .. and so on.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 261

PostPosted: Thu Oct 01, 2020 1:38 pm    Post subject: Reply with quote

Oh, that makes very good sense -- thanks! I'll give it a shot and report back.
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 261

PostPosted: Thu Oct 01, 2020 3:11 pm    Post subject: Reply with quote

O.k, I tried what you suggested and both sets of 4 work. Tried adding in two more and I was able to run stress --vm 20 with both pairs from the other 4 (10 minute run). In the process of looking up which sockets to put the sticks in, I saw this in the manual:
Quote:
For system stability, use a more efficient memory cooling system to support a full memory load (8 DIMMS)


So I think that's my answer :)

Now I have to figure out what that looks like; but I'm optimistically thinking that none of my hardware is bad, which is very good news. I think I can somehow live on 384GB of RAM for a bit.

Thanks again for the helpful advice NeddySeagoon! Unless you tell me I'm not done I'll go ahead and mark this [solved].

Or perhaps not??? Other info online seems to indicate that RAM overheating isn't a real issue. Perhaps it's just something in motherboard/cpu on the last two sockets? But wouldn't that show up more reliably???
Back to top
View user's profile Send private message
x90e
n00b
n00b


Joined: 30 Sep 2020
Posts: 38

PostPosted: Thu Oct 01, 2020 5:10 pm    Post subject: Reply with quote

justin_brody wrote:
O.k, I tried what you suggested and both sets of 4 work. Tried adding in two more and I was able to run stress --vm 20 with both pairs from the other 4 (10 minute run). In the process of looking up which sockets to put the sticks in, I saw this in the manual:
Quote:
For system stability, use a more efficient memory cooling system to support a full memory load (8 DIMMS)


So I think that's my answer :)

Now I have to figure out what that looks like; but I'm optimistically thinking that none of my hardware is bad, which is very good news. I think I can somehow live on 384GB of RAM for a bit.

Thanks again for the helpful advice NeddySeagoon! Unless you tell me I'm not done I'll go ahead and mark this [solved].

Or perhaps not??? Other info online seems to indicate that RAM overheating isn't a real issue. Perhaps it's just something in motherboard/cpu on the last two sockets? But wouldn't that show up more reliably???


RAM overheating can definitely be a thing.. high speeds and full slots when you've got 8 DIMMs, plus a full RAM load is more stressful on any memory controller. like with the memory controller on a 2700x cpu/b450-f motherboard,3013 bios, I can run 2 sticks of DDR4 at 3600 but with all four slots full i'm lucky to get 2933 with single rank DIMMs but with all four RAM slots full with dual rank DIMMs and I am looking at only being able to run at 2133mhz. maybe 2400mhz but I will start getting random errors and crashes, causing all sorts of system stability issues.

They make coolers with fans that attach over your memory sticks you might be able to find one that fits in your case, not sure how it works with 8 dimms but you can't be the first person to have this problem.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46272
Location: 56N 3W

PostPosted: Thu Oct 01, 2020 7:16 pm    Post subject: Reply with quote

justin_brody,

Temperature matters. Its one of a whole range of physical properties that affect the the performance of electronics.

The hotter the silicon, the higher the leakage currents and the harder the memory controller has to work.
The more DRAMs you fit, the higher the stray (parasitic) capacitance that the RAM presents to the DRAM controller, so the slower the signals to the RAM must be.
As motherboard manufactures go to great lengths to make all the traces transmission lines of the same length (yes the wavy tracks are there for a reason), the more the stray capacitance, the slower the RAMs can be driven.

The stray capacitance even has a temperature coefficient. It gets worse with rising temperature.

Temperature is about the only thing you can usefully influence.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 261

PostPosted: Thu Oct 01, 2020 10:28 pm    Post subject: Reply with quote

O.k. -- thanks so much! I will look into cooling things down!
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46272
Location: 56N 3W

PostPosted: Thu Oct 01, 2020 10:30 pm    Post subject: Reply with quote

justin_brody,

You may find that your DRAMs have a temperature sensor, so you can actually measure your success.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 261

PostPosted: Thu Oct 01, 2020 10:33 pm    Post subject: Reply with quote

oh, very interesting! is that something i should be able to pick up with lm_sensors if it's configured correctly?
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46272
Location: 56N 3W

PostPosted: Thu Oct 01, 2020 10:41 pm    Post subject: Reply with quote

justin_brody,

If its there and you have kernel support, yes.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
figueroa
l33t
l33t


Joined: 14 Aug 2005
Posts: 749
Location: Lower right-hand corner USA

PostPosted: Fri Oct 02, 2020 2:47 am    Post subject: Reply with quote

Be sure your CPU cooler is mounted properly on the chip. Thin coat of thermal paste, cooler securely mounted.

Put a digital thermometer inside your box.

Take the cover off and aim a fan at the hardware. This last trick has helped a lot of machines get through stressful processes. If it works, then you know you have a heat problem.

Are you sure about the power supply? (Swap it out as a test.)
_________________
Andy Figueroa
andy@andyfigueroa.net Working with Unix since 1983.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46272
Location: 56N 3W

PostPosted: Fri Oct 02, 2020 10:05 am    Post subject: Reply with quote

figueroa,

All good stuff.

As the silicon temperature increases, it take more power, which makes it get hotter ...
Luckily, silicon does not exhibit 'themal runaway' not at useful temperatures anyway, unlike germanium.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 261

PostPosted: Fri Oct 02, 2020 11:31 am    Post subject: Reply with quote

I was going to ask about blowing a fan into the case for testing this out :)

The BIOS is reporting stats (like motherboard temperature) that I'm not getting through lm_sensors, but that's probably another thread if I can't figure it out. Otherwise I'll see if I can borrow a good thermometer from someone.

Thanks for the tips!
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46272
Location: 56N 3W

PostPosted: Fri Oct 02, 2020 11:35 am    Post subject: Reply with quote

justin_brody,

Set the fan up first and observe the effects, if any.
It will be easy enough to make temperature measurements later.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 261

PostPosted: Mon Oct 05, 2020 2:42 pm    Post subject: Reply with quote

O.k., that turned out to be pretty illuminating! Ran the machine with 6DIMMs in over the weekend, so some moderately heavy work. No problems. Put the last two in this morning and ran with a big box fan blowing into it. It stayed cool but still died after around 18 minutes of stress --vm 20

I was saving lm_sensors data into a file every 10 seconds; here's the last reading:
Quote:

coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +51.0°C (high = +76.0°C, crit = +86.0°C)
Core 0: +40.0°C (high = +76.0°C, crit = +86.0°C)
Core 1: +45.0°C (high = +76.0°C, crit = +86.0°C)
Core 2: +49.0°C (high = +76.0°C, crit = +86.0°C)
Core 3: +50.0°C (high = +76.0°C, crit = +86.0°C)
Core 4: +50.0°C (high = +76.0°C, crit = +86.0°C)
Core 8: +48.0°C (high = +76.0°C, crit = +86.0°C)
Core 9: +49.0°C (high = +76.0°C, crit = +86.0°C)
Core 10: +48.0°C (high = +76.0°C, crit = +86.0°C)
Core 11: +45.0°C (high = +76.0°C, crit = +86.0°C)
Core 16: +48.0°C (high = +76.0°C, crit = +86.0°C)
Core 17: +45.0°C (high = +76.0°C, crit = +86.0°C)
Core 18: +48.0°C (high = +76.0°C, crit = +86.0°C)
Core 19: +50.0°C (high = +76.0°C, crit = +86.0°C)
Core 20: +48.0°C (high = +76.0°C, crit = +86.0°C)
Core 24: +51.0°C (high = +76.0°C, crit = +86.0°C)
Core 25: +49.0°C (high = +76.0°C, crit = +86.0°C)
Core 26: +50.0°C (high = +76.0°C, crit = +86.0°C)
Core 27: +47.0°C (high = +76.0°C, crit = +86.0°C)
Core 26: +50.0°C (high = +76.0°C, crit = +86.0°C)
Core 27: +47.0°C (high = +76.0°C, crit = +86.0°C)

nct6796-isa-0290
Adapter: ISA adapter
Vcore: 896.00 mV (min = +0.00 V, max = +1.74 V)
in1: 1000.00 mV (min = +0.00 V, max = +0.00 V) ALARM
AVCC: 3.39 V (min = +0.00 V, max = +0.00 V) ALARM
+3.3V: 3.26 V (min = +0.00 V, max = +0.00 V) ALARM
in4: 1.02 V (min = +0.00 V, max = +0.00 V) ALARM
in5: 0.00 V (min = +0.00 V, max = +0.00 V)
in6: 592.00 mV (min = +0.00 V, max = +0.00 V) ALARM
3VSB: 3.39 V (min = +0.00 V, max = +0.00 V) ALARM
Vbat: 3.10 V (min = +0.00 V, max = +0.00 V) ALARM
in9: 1.02 V (min = +0.00 V, max = +0.00 V) ALARM
in10: 600.00 mV (min = +0.00 V, max = +0.00 V) ALARM
in11: 440.00 mV (min = +0.00 V, max = +0.00 V) ALARM
in12: 1.01 V (min = +0.00 V, max = +0.00 V) ALARM
in13: 0.00 V (min = +0.00 V, max = +0.00 V)
in14: 512.00 mV (min = +0.00 V, max = +0.00 V) ALARM
fan1: 0 RPM (min = 0 RPM)
fan2: 0 RPM (min = 0 RPM)
fan3: 0 RPM (min = 0 RPM)
fan4: 0 RPM (min = 0 RPM)
fan5: 0 RPM (min = 0 RPM)
fan6: 0 RPM (min = 0 RPM)
fan7: 0 RPM (min = 0 RPM)
SYSTIN: +30.0°C (high = +80.0°C, hyst = +75.0°C) sensor = thermistor
CPUTIN: +46.5°C (high = +80.0°C, hyst = +75.0°C) sensor = thermistor
AUXTIN0: +50.5°C sensor = thermistor
AUXTIN1: +50.0°C sensor = thermistor
AUXTIN2: +63.0°C sensor = thermistor
AUXTIN3: +57.0°C sensor = thermistor
PECI Agent 0 Calibration: +11.0°C
PCH_CHIP_CPU_MAX_TEMP: +0.0°C
PCH_CHIP_TEMP: +0.0°C
PCH_CPU_TEMP: +0.0°C
intrusion0: ALARM
intrusion1: ALARM
beep_enable: disabled


From what I could tell, nothing looked significantly different from the readings with the machine having 6 dimms in (I can post that as well if it would be helpful).

So is this looking like a motherboard or CPU issue?

I will note that I only stressed the RAM modules for about 10 minutes earlier; so perhaps I should try again with 20 minutes and the last two modules swapped in?
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46272
Location: 56N 3W

PostPosted: Mon Oct 05, 2020 3:01 pm    Post subject: Reply with quote

justin_brody,

Is it OK with any 6 from 8 DIMMs or is it a particular pair?

Setting up the DIMM drives it a bit hit and miss.
The BIOS reads settings from the SPI ROM on the DIM. Some brain dead systems only read one, rather than use the slowest of all fitted DIMMS.
That's just for starters, now the BIOS does trial and error to establish limits.

Its not just timings, its signal drive strength too.
Not strong enough, it won't work because its too slow. Too strong and despite the transmission line traces on the PCB, you get overshoot and again it won't work.
Its all very much like baby bears porridge[1].

[1]Search for Goldilocks and the Three Bears if you don't know the story.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 261

PostPosted: Mon Oct 05, 2020 4:14 pm    Post subject: Reply with quote

Hi NeddySeagoon,

So I just swapped the 2 DIMMs that had been left out back in and was able to run stress for over an hour without any issues.

It sounds like your suggestion is that I play around with voltage and speed setting in BIOS? Since it's DDR4 can I just set all the voltages manually to 1.2V?
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46272
Location: 56N 3W

PostPosted: Mon Oct 05, 2020 8:25 pm    Post subject: Reply with quote

justin_brody,

Its much more complex that the DRAM power supply voltage.
The DRAM timings and signal drive strengths matter too. That's a lot of things to play with.

Personally, I would just leave two sticks of RAM out and call it a day.
Overvolting can destroy your RAM. If you are lucky, the damage will stop there.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
justin_brody
Apprentice
Apprentice


Joined: 26 Jan 2005
Posts: 261

PostPosted: Wed Oct 07, 2020 3:07 pm    Post subject: Reply with quote

Understood -- thanks again NeddySeagoon. I think I'm going to RMA the motherboard; very much appreciate the help!!!
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Kernel & Hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum