Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
AMD Zen/Ryzen thread
View unanswered posts
View posts from last 24 hours

Goto page Previous  1, 2, 3 ... 17, 18, 19, 20, 21, 22  Next  
This topic is locked: you cannot edit posts or make replies.    Gentoo Forums Forum Index Gentoo Chat
View previous topic :: View next topic  
Author Message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1758
Location: San Jose, CA

PostPosted: Thu Jun 08, 2017 9:38 pm    Post subject: Reply with quote

I'm curious about the Gentooers who are getting random segfaults with their Ryzen.

First, are there any Gentooers getting random segfaults with their Ryzen?

If you stop overclocking and run your DDR at 2133 do the segfaults go away?

I'm curious if anyone has made any attempts to isolate the failures.

If you are stock CPU and memory frequencies, and voltages and still get seg faults, can you please post your system profile?

CPU, Memory Size Make and Model, Motherboard Make and Model, PSU Size, Make and Model, GPU Make and Model, Boot drive make, model, and interface (IE SATA, nvme, etc).

I'd also be interested in any data you have on the seg fault. I'm curious where it's happening. In other words, the core number, the thread number, the address, the software that crashed, etc.

I'm a hardware engineer and this daunting task seems really interesting. There might be enough of us to see a pattern emerge. I'm sure AMD is looking at this, but they don't seem to have anything outward facing so I'm wondering if we might be of some help.

My Ryzen 5 1600 has absolute stability. I'm beating the crap out of it and it's never had random segfaults (3800MHz or below). I think it would be really interesting to be able to isolate the issue to a single core that happens to be disabled on my CPU, which is why I decided to write this post.

I'm also wondering if it would be useful to use a single benchmark, say the building of the linux kernel as the the way to test the issue. We should be able to overclock memory / cpu, build the kernel and record the segfault debug information to figure out what kind of segfaults happen during over-over clocks. This may help us isolate the failure modes that are recorded on the stock parts.

Anyway, anyone interested in doing a little research for science and AMD?
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
Chewi
Developer
Developer


Joined: 01 Sep 2003
Posts: 881
Location: Edinburgh, Scotland

PostPosted: Thu Jun 08, 2017 10:39 pm    Post subject: Reply with quote

RayDude, if you've been following recent developments, you'll see that anyone who reported segfaults and has tried disabling ASLR has found that fixes the problem. This follows from a discovery by Matt Dillon of DragonflyBSD that the hardware has some kind of bug.

Until yesterday, I hadn't observed any segfaults, just freezes. I'm not 100% sure yet but it looks like the kernel option that makes the freezes magically disappear is NUMA. I have never enabled this in my own kernels but Fedora enable it by default. I'm now on a record-breaking 34 hour uptime with my own configuration + NUMA. I'm now building a Fedora kernel without NUMA as I type to try and confirm this finding.

What's interesting is that once I enabled NUMA, I experienced a segfault as soon as I tried to compile something (Mesa) and that went away after disabling ASLR. Despite the freezes, compiling worked fine before. I will also investigate this finding further.
Back to top
View user's profile Send private message
Naib
Watchman
Watchman


Joined: 21 May 2004
Posts: 5896
Location: Removed by Neddy

PostPosted: Thu Jun 08, 2017 11:02 pm    Post subject: Reply with quote

I am tempted to run the stress command in 2gig chunks of RAM to see if there are any issues.
I am either very lucky in the hardware or lucky in the configuration to not experience crashes (and I have emerged three sets of mesa in parallel, three times in a row to check...).

Thats why I setup the google doc thing to see if there is some form of commonality in the issues (none so far...). From a hardware perspective, everything points towards Ryzen being very RAM sensitive and as a hardware engineer, track impedance is a real annoyance and cheaper mobo's skimp on subtleties like impedance matching (Those annoying wavy traces...) BUT crashes has been top of the line and basement bargan.


--edit--
I don't know if these were suitable options, but as a 1st run...

Code:
 stress-ng --cpu 12 --affinity-rand -vm 12 --vm-bytes 1G -t 30m
stress-ng: debug: [31292] 12 processors online, 12 processors configured
stress-ng: info:  [31292] dispatching hogs: 12 cpu, 12 vm
stress-ng: info:  [31292] cache allocate: default cache size: 8192K
stress-ng: debug: [31292] starting stressors
stress-ng: debug: [31294] stress-ng-cpu: started [31294] (instance 0)
stress-ng: debug: [31295] stress-ng-vm: started [31295] (instance 0)
stress-ng: debug: [31296] stress-ng-cpu: started [31296] (instance 1)
stress-ng: debug: [31297] stress-ng-vm: started [31297] (instance 1)
stress-ng: debug: [31298] stress-ng-cpu: started [31298] (instance 2)
stress-ng: debug: [31300] stress-ng-vm: started [31300] (instance 2)
stress-ng: debug: [31302] stress-ng-cpu: started [31302] (instance 3)
stress-ng: debug: [31303] stress-ng-vm: started [31303] (instance 3)
stress-ng: debug: [31304] stress-ng-cpu: started [31304] (instance 4)
stress-ng: debug: [31306] stress-ng-vm: started [31306] (instance 4)
stress-ng: debug: [31309] stress-ng-vm: started [31309] (instance 5)
stress-ng: debug: [31311] stress-ng-cpu: started [31311] (instance 6)
stress-ng: debug: [31292] 24 stressors spawned
stress-ng: debug: [31323] stress-ng-vm: started [31323] (instance 11)
stress-ng: debug: [31317] stress-ng-vm: started [31317] (instance 8)
stress-ng: debug: [31308] stress-ng-cpu: started [31308] (instance 5)
stress-ng: debug: [31318] stress-ng-cpu: started [31318] (instance 9)
stress-ng: debug: [31320] stress-ng-cpu: started [31320] (instance 10)
stress-ng: debug: [31312] stress-ng-vm: started [31312] (instance 6)
stress-ng: debug: [31321] stress-ng-vm: started [31321] (instance 10)
stress-ng: debug: [31319] stress-ng-vm: started [31319] (instance 9)
stress-ng: debug: [31322] stress-ng-cpu: started [31322] (instance 11)
stress-ng: debug: [31315] stress-ng-vm: started [31315] (instance 7)
stress-ng: debug: [31316] stress-ng-cpu: started [31316] (instance 8)
stress-ng: debug: [31313] stress-ng-cpu: started [31313] (instance 7)
stress-ng: debug: [31294] stress-ng-cpu: exited [31294] (instance 0)
stress-ng: debug: [31292] process [31294] terminated
stress-ng: debug: [31311] stress-ng-cpu: exited [31311] (instance 6)
stress-ng: debug: [31318] stress-ng-cpu: exited [31318] (instance 9)
stress-ng: debug: [31322] stress-ng-cpu: exited [31322] (instance 11)
stress-ng: debug: [31313] stress-ng-cpu: exited [31313] (instance 7)
stress-ng: debug: [31304] stress-ng-cpu: exited [31304] (instance 4)
stress-ng: debug: [31298] stress-ng-cpu: exited [31298] (instance 2)
stress-ng: debug: [31302] stress-ng-cpu: exited [31302] (instance 3)
stress-ng: debug: [31316] stress-ng-cpu: exited [31316] (instance 8)
stress-ng: debug: [31308] stress-ng-cpu: exited [31308] (instance 5)
stress-ng: debug: [31320] stress-ng-cpu: exited [31320] (instance 10)
stress-ng: debug: [31296] stress-ng-cpu: exited [31296] (instance 1)
stress-ng: debug: [31292] process [31296] terminated
stress-ng: debug: [31292] process [31298] terminated
stress-ng: debug: [31292] process [31302] terminated
stress-ng: debug: [31292] process [31304] terminated
stress-ng: debug: [31292] process [31308] terminated
stress-ng: debug: [31292] process [31311] terminated
stress-ng: debug: [31292] process [31313] terminated
stress-ng: debug: [31292] process [31316] terminated
stress-ng: debug: [31292] process [31318] terminated
stress-ng: debug: [31292] process [31320] terminated
stress-ng: debug: [31292] process [31322] terminated
stress-ng: debug: [31300] stress-ng-vm: exited [31300] (instance 2)
stress-ng: debug: [31317] stress-ng-vm: exited [31317] (instance 8)
stress-ng: debug: [31312] stress-ng-vm: exited [31312] (instance 6)
stress-ng: debug: [31323] stress-ng-vm: exited [31323] (instance 11)
stress-ng: debug: [31321] stress-ng-vm: exited [31321] (instance 10)
stress-ng: debug: [31295] stress-ng-vm: exited [31295] (instance 0)
stress-ng: debug: [31292] process [31295] terminated
stress-ng: debug: [31303] stress-ng-vm: exited [31303] (instance 3)
stress-ng: debug: [31306] stress-ng-vm: exited [31306] (instance 4)
stress-ng: debug: [31315] stress-ng-vm: exited [31315] (instance 7)
stress-ng: debug: [31319] stress-ng-vm: exited [31319] (instance 9)
stress-ng: debug: [31297] stress-ng-vm: exited [31297] (instance 1)
stress-ng: debug: [31309] stress-ng-vm: exited [31309] (instance 5)
stress-ng: debug: [31292] process [31297] terminated
stress-ng: debug: [31292] process [31300] terminated
stress-ng: debug: [31292] process [31303] terminated
stress-ng: debug: [31292] process [31306] terminated
stress-ng: debug: [31292] process [31309] terminated
stress-ng: debug: [31292] process [31312] terminated
stress-ng: debug: [31292] process [31315] terminated
stress-ng: debug: [31292] process [31317] terminated
stress-ng: debug: [31292] process [31319] terminated
stress-ng: debug: [31292] process [31321] terminated
stress-ng: debug: [31292] process [31323] terminated
stress-ng: info:  [31292] successful run completed in 1800.33s (30 mins, 0.33 secs)


all 12 "cores" were at 100%, 12Gig of 16gig used. Core temp settled out at 70C. no segfault, no issues reported.

I might look into other options as I am interesting in what it does with memory (ie does it just allocate or does it write and read back say 0xAAAA or 0x5555 . The CPU side seems to be via known alogithms with known results (fft, fib, etc..)
_________________
https://www.otw20.com/ Where you can talk
Quote:
Removed by Chiitoo
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1758
Location: San Jose, CA

PostPosted: Fri Jun 09, 2017 6:29 pm    Post subject: Reply with quote

Chewi wrote:
RayDude, if you've been following recent developments, you'll see that anyone who reported segfaults and has tried disabling ASLR has found that fixes the problem. This follows from a discovery by Matt Dillon of DragonflyBSD that the hardware has some kind of bug.

Until yesterday, I hadn't observed any segfaults, just freezes. I'm not 100% sure yet but it looks like the kernel option that makes the freezes magically disappear is NUMA. I have never enabled this in my own kernels but Fedora enable it by default. I'm now on a record-breaking 34 hour uptime with my own configuration + NUMA. I'm now building a Fedora kernel without NUMA as I type to try and confirm this finding.

What's interesting is that once I enabled NUMA, I experienced a segfault as soon as I tried to compile something (Mesa) and that went away after disabling ASLR. Despite the freezes, compiling worked fine before. I will also investigate this finding further.


Thanks. Where are you reading about NUMA and ASLR? I'll check to see what I have setup as I may have just gotten lucky.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1758
Location: San Jose, CA

PostPosted: Fri Jun 09, 2017 6:41 pm    Post subject: Reply with quote

Thanks for posting. I'm clearly out of touch.

If you wanted to beat on the Cache as hard as possible, you want to hit addresses on bit boundaries. You'd probably have to write custom code.

The idea is, if there's a timing issue with the Cache, you're most likely to see it when a lot of bits flip (lots of mosfets flipping introduces noise), or when only one bit changes.

So, to limit it, lets say that we only had 16 bits of address (circa 1970s):

When the address rolls from 0b0111111111111111 (0x7FFF) to 0b1000000000000000 (0x8000) it's a worst case condition for switching and if the timing is tight (due to cross talk or local power issues) then the array may fail to return the data in time for the read.

That's tested in an incrementing test. However, there are lots of other non-incrementing cases where things like this happen, for example: (assuming 8 bit memory)

Read of data 0x55 from address 0xFEFF followed by a read of data 0xAA from address 0x0100. That's a lot of random bits changing. Of course if you knew the layout of the ram you could see which signals are close to each other and contrive worst case scenarios.

I'm sure there are guys doing this at AMD and running worst case simulations as well as triple checking their worst case timing simulations and reports.

Anyway, don't know why I talked about this. I guess I think it's fun.


Naib wrote:
I am tempted to run the stress command in 2gig chunks of RAM to see if there are any issues.
I am either very lucky in the hardware or lucky in the configuration to not experience crashes (and I have emerged three sets of mesa in parallel, three times in a row to check...).

Thats why I setup the google doc thing to see if there is some form of commonality in the issues (none so far...). From a hardware perspective, everything points towards Ryzen being very RAM sensitive and as a hardware engineer, track impedance is a real annoyance and cheaper mobo's skimp on subtleties like impedance matching (Those annoying wavy traces...) BUT crashes has been top of the line and basement bargan.


--edit--
I don't know if these were suitable options, but as a 1st run...

Code:
 stress-ng --cpu 12 --affinity-rand -vm 12 --vm-bytes 1G -t 30m
stress-ng: debug: [31292] 12 processors online, 12 processors configured
stress-ng: info:  [31292] dispatching hogs: 12 cpu, 12 vm
stress-ng: info:  [31292] cache allocate: default cache size: 8192K
stress-ng: debug: [31292] starting stressors
stress-ng: debug: [31294] stress-ng-cpu: started [31294] (instance 0)
stress-ng: debug: [31295] stress-ng-vm: started [31295] (instance 0)
stress-ng: debug: [31296] stress-ng-cpu: started [31296] (instance 1)
stress-ng: debug: [31297] stress-ng-vm: started [31297] (instance 1)
stress-ng: debug: [31298] stress-ng-cpu: started [31298] (instance 2)
stress-ng: debug: [31300] stress-ng-vm: started [31300] (instance 2)
stress-ng: debug: [31302] stress-ng-cpu: started [31302] (instance 3)
stress-ng: debug: [31303] stress-ng-vm: started [31303] (instance 3)
stress-ng: debug: [31304] stress-ng-cpu: started [31304] (instance 4)
stress-ng: debug: [31306] stress-ng-vm: started [31306] (instance 4)
stress-ng: debug: [31309] stress-ng-vm: started [31309] (instance 5)
stress-ng: debug: [31311] stress-ng-cpu: started [31311] (instance 6)
stress-ng: debug: [31292] 24 stressors spawned
stress-ng: debug: [31323] stress-ng-vm: started [31323] (instance 11)
stress-ng: debug: [31317] stress-ng-vm: started [31317] (instance 8)
stress-ng: debug: [31308] stress-ng-cpu: started [31308] (instance 5)
stress-ng: debug: [31318] stress-ng-cpu: started [31318] (instance 9)
stress-ng: debug: [31320] stress-ng-cpu: started [31320] (instance 10)
stress-ng: debug: [31312] stress-ng-vm: started [31312] (instance 6)
stress-ng: debug: [31321] stress-ng-vm: started [31321] (instance 10)
stress-ng: debug: [31319] stress-ng-vm: started [31319] (instance 9)
stress-ng: debug: [31322] stress-ng-cpu: started [31322] (instance 11)
stress-ng: debug: [31315] stress-ng-vm: started [31315] (instance 7)
stress-ng: debug: [31316] stress-ng-cpu: started [31316] (instance 8)
stress-ng: debug: [31313] stress-ng-cpu: started [31313] (instance 7)
stress-ng: debug: [31294] stress-ng-cpu: exited [31294] (instance 0)
stress-ng: debug: [31292] process [31294] terminated
stress-ng: debug: [31311] stress-ng-cpu: exited [31311] (instance 6)
stress-ng: debug: [31318] stress-ng-cpu: exited [31318] (instance 9)
stress-ng: debug: [31322] stress-ng-cpu: exited [31322] (instance 11)
stress-ng: debug: [31313] stress-ng-cpu: exited [31313] (instance 7)
stress-ng: debug: [31304] stress-ng-cpu: exited [31304] (instance 4)
stress-ng: debug: [31298] stress-ng-cpu: exited [31298] (instance 2)
stress-ng: debug: [31302] stress-ng-cpu: exited [31302] (instance 3)
stress-ng: debug: [31316] stress-ng-cpu: exited [31316] (instance 8)
stress-ng: debug: [31308] stress-ng-cpu: exited [31308] (instance 5)
stress-ng: debug: [31320] stress-ng-cpu: exited [31320] (instance 10)
stress-ng: debug: [31296] stress-ng-cpu: exited [31296] (instance 1)
stress-ng: debug: [31292] process [31296] terminated
stress-ng: debug: [31292] process [31298] terminated
stress-ng: debug: [31292] process [31302] terminated
stress-ng: debug: [31292] process [31304] terminated
stress-ng: debug: [31292] process [31308] terminated
stress-ng: debug: [31292] process [31311] terminated
stress-ng: debug: [31292] process [31313] terminated
stress-ng: debug: [31292] process [31316] terminated
stress-ng: debug: [31292] process [31318] terminated
stress-ng: debug: [31292] process [31320] terminated
stress-ng: debug: [31292] process [31322] terminated
stress-ng: debug: [31300] stress-ng-vm: exited [31300] (instance 2)
stress-ng: debug: [31317] stress-ng-vm: exited [31317] (instance 8)
stress-ng: debug: [31312] stress-ng-vm: exited [31312] (instance 6)
stress-ng: debug: [31323] stress-ng-vm: exited [31323] (instance 11)
stress-ng: debug: [31321] stress-ng-vm: exited [31321] (instance 10)
stress-ng: debug: [31295] stress-ng-vm: exited [31295] (instance 0)
stress-ng: debug: [31292] process [31295] terminated
stress-ng: debug: [31303] stress-ng-vm: exited [31303] (instance 3)
stress-ng: debug: [31306] stress-ng-vm: exited [31306] (instance 4)
stress-ng: debug: [31315] stress-ng-vm: exited [31315] (instance 7)
stress-ng: debug: [31319] stress-ng-vm: exited [31319] (instance 9)
stress-ng: debug: [31297] stress-ng-vm: exited [31297] (instance 1)
stress-ng: debug: [31309] stress-ng-vm: exited [31309] (instance 5)
stress-ng: debug: [31292] process [31297] terminated
stress-ng: debug: [31292] process [31300] terminated
stress-ng: debug: [31292] process [31303] terminated
stress-ng: debug: [31292] process [31306] terminated
stress-ng: debug: [31292] process [31309] terminated
stress-ng: debug: [31292] process [31312] terminated
stress-ng: debug: [31292] process [31315] terminated
stress-ng: debug: [31292] process [31317] terminated
stress-ng: debug: [31292] process [31319] terminated
stress-ng: debug: [31292] process [31321] terminated
stress-ng: debug: [31292] process [31323] terminated
stress-ng: info:  [31292] successful run completed in 1800.33s (30 mins, 0.33 secs)


all 12 "cores" were at 100%, 12Gig of 16gig used. Core temp settled out at 70C. no segfault, no issues reported.

I might look into other options as I am interesting in what it does with memory (ie does it just allocate or does it write and read back say 0xAAAA or 0x5555 . The CPU side seems to be via known alogithms with known results (fft, fib, etc..)

_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
Tony0945
Advocate
Advocate


Joined: 25 Jul 2006
Posts: 4735
Location: Illinois, USA

PostPosted: Sat Jun 10, 2017 2:09 pm    Post subject: Reply with quote

I see that in this thread https://community.amd.com/thread/215773

But I have to disable a security precaution? And the responses in that thread from AMD sound to me like "Move on. Nothing to see here."

Seriously thinking of buying Skylake instead. In even comes with built-in GPU so I don't need to buy a separate card.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1758
Location: San Jose, CA

PostPosted: Sat Jun 10, 2017 4:22 pm    Post subject: Reply with quote

Thanks Tony.

Can everyone who has gigabyte boards send a support request for the latest BIOS?

My board is stuck at 1.0.0.4a and I want 1.0.0.6.

I sent them a message and they said they are working on it but don't know when it will be released. If a bunch of people ask for it, hopefully that will encourage them to at least set a time line. Without a time line I feel it may never happen.

Thanks.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
wrc1944
Advocate
Advocate


Joined: 15 Aug 2002
Posts: 3376
Location: Gainesville, Florida

PostPosted: Sat Jun 10, 2017 6:30 pm    Post subject: Reply with quote

FWIW, here's a nice 7 page rundown on the AGESA 1.0.0.6 AMD bios update.
As usual, there may be other issues being reported, so I'm not updating from 1.0.0.4a until I read more reviews.

Currently, my two ~amd4 gcc-7.1.0-r1/binutils-2.28-r2 systems are working pretty good for me, so no rush on a bios update.

https://www.overclock3d.net/reviews/cpu_mainboard/amd_agesa_update_1_0_0_6_-_do_bios_updates_matter/1

MORE: https://www.reddit.com/r/Amd/comments/6cfpiz/ive_been_testing_agesa_1006_for_a_day_with_the/ (Lots of feedback/discussion)

Page 7:
Quote:
Conclusion

In the past BIOS updates were not recommended very often, especially if your system was already fully functional. Updating a BIOS used to be a very risky process but today it is now much simpler thanks to utilities like ASUS' EZ Flash that are built into most modern UEFIs.

With Ryzen, we have already seen that BIOS updates can have a noteworthy impact, with AGESA 1.0.0.4 decreasing memory latency and AGESA 1.0.0.6 giving some minor speed boosts across a wide range of applications. While with AGESA version 1.0.0.6 these performance changes are minor, it must be remembered that these boosts are achieved with no hardware changes or software changes outside of ASUS' UEFI.

In 3DMARK and in Gears of War 4 we can see a noteworthy increase in benchmark scores, which while all less than 5% is still faster than what we were dealing with before the update. Ryzen has not changed here and neither has Windows or any of the benchmarks or games that we have tested, making the small boosts almost seem like magic.

AMD's optimisations in their AGESA code are having a positive impact on system performance, which combined with feature improvements like improved DDR4 compatibility and new memory timing/sub-timing options are making Ryzen more attractive platform with each and every update.

On the memory side, AGESA 1.0.0.6 offers users improved DDR4 compatibility, with our test system now booting with DIMMs that previously only greeted us with a black screen and a reboot cycle. That being said memory support is still not perfect but it is true that a lot more DDR4 kits are working now than previously. Those who are looking to build a Ryzen-based system should check their chosen motherboard's QVL list for compatible memory before making any purchases.

AGESA 1.0.0.6 may be seen as a minor update by many, but it certainly is more than enough to fill us with excitement. We have already seen performance boosts with no physical hardware changes on our Crosshair VI Hero motherboard and this is only scratching the surface of what AGESA 1.0.0.6 allows us to do.

The biggest change with this new update is the inclusion of new memory multipliers and sub-timing options in the BIOS, which opens up several more avenues that could give Ryzen users an additional speed boost. Will 3466MHz or 3600+MHz memory allow for even higher levels of system performance in games and productivity tasks?

To make a long story short, ASUS' latest BIOS, when combined with AMD's AGESA 1.0.0.6 code, offers end users a noteworthy increase in system performance which while minor is certainly worth taking advantage of. Will AMD continue to give Ryzen users a boost with future AGESA updates? I guess we will have to wait and see.

_________________
Main box- AsRock x370 Gaming K4
Ryzen 7 3700x, 3.6GHz, 16GB GSkill Flare DDR4 3200mhz
Samsung SATA 1000GB, Radeon HD R7 350 2GB DDR5
Gentoo ~amd64 plasma, glibc-2.33, gcc-10.2.0-r5, kernel-5.11.11 USE=experimental
Back to top
View user's profile Send private message
Chewi
Developer
Developer


Joined: 01 Sep 2003
Posts: 881
Location: Edinburgh, Scotland

PostPosted: Tue Jun 13, 2017 9:33 pm    Post subject: Reply with quote

RayDude wrote:
Can everyone who has gigabyte boards send a support request for the latest BIOS?

My board is stuck at 1.0.0.4a and I want 1.0.0.6..

What board do you have and where are you looking? There are plenty of Gigabyte beta BIOSes here. F6F is working well on my AX370-Gaming 5. I heard some had trouble with F6G so I'll wait until it goes final now.

It turns out I was slightly off the mark with NUMA. I was changing a few settings at once to try and narrow down the freezes quicker and NUMA seemed like the most likely culprit. Turns out it was actually CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL. These are enabled in Fedora. After enabling just these in my minimal config, there was no freeze. After disabling these in Fedora's config, I got a freeze. I don't know the reasoning but it seems quite conclusive. I don't yet know whether it's merely a workaround for some underlying issue or even whether it simply makes a freeze less likely. Hopefully I can get this information across to AMD.

Interestingly, CONFIG_RCU_NOCB_CPU is not enabled in Debian. We haven't seen swarms of Debian users reporting freezes with Ryzen so perhaps this issue is specific to the motherboard or some other factor. I did want to try a prebuilt Debian kernel but I was surprised that I couldn't find anything newer than 4.10, even among Ubuntu, Mint, and Sid.

I was also wrong about NUMA and ASLR. With NUMA disabled, I was able to produce a segfault eventually if I tried for long enough. I am now booting with norandmaps to ensure that ASLR is disabled immediately on boot. Enabling CONFIG_COMPAT_BRK in the kernel doesn't actually disable ASLR entirely.

As to whether AMD is taking this stuff seriously, I didn't get the impression that they were being dismissive in that thread and I suspect the guy there doesn't have all the info anyway. They have received a solid lead from legendary kernel developer Matthew Dillon and there's no way they wouldn't take that seriously.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1758
Location: San Jose, CA

PostPosted: Thu Jun 15, 2017 1:16 am    Post subject: Reply with quote

[quote="Chewi"]
What board do you have and where are you looking? There are plenty of Gigabyte beta BIOSes here. F6F is working well on my AX370-Gaming 5. I heard some had trouble with F6G so I'll wait until it goes final now.[quote]

I thought I already replied, but I don't see it so I must have forgot to press submit...

My board is a Gigabyte GA-AB350M-D3H and does not have a BIOS listed on that page.

Quote:
It turns out I was slightly off the mark with NUMA. I was changing a few settings at once to try and narrow down the freezes quicker and NUMA seemed like the most likely culprit. Turns out it was actually CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL. These are enabled in Fedora. After enabling just these in my minimal config, there was no freeze. After disabling these in Fedora's config, I got a freeze. I don't know the reasoning but it seems quite conclusive. I don't yet know whether it's merely a workaround for some underlying issue or even whether it simply makes a freeze less likely. Hopefully I can get this information across to AMD.

Interestingly, CONFIG_RCU_NOCB_CPU is not enabled in Debian. We haven't seen swarms of Debian users reporting freezes with Ryzen so perhaps this issue is specific to the motherboard or some other factor. I did want to try a prebuilt Debian kernel but I was surprised that I couldn't find anything newer than 4.10, even among Ubuntu, Mint, and Sid.

I was also wrong about NUMA and ASLR. With NUMA disabled, I was able to produce a segfault eventually if I tried for long enough. I am now booting with norandmaps to ensure that ASLR is disabled immediately on boot. Enabling CONFIG_COMPAT_BRK in the kernel doesn't actually disable ASLR entirely.

As to whether AMD is taking this stuff seriously, I didn't get the impression that they were being dismissive in that thread and I suspect the guy there doesn't have all the info anyway. They have received a solid lead from legendary kernel developer Matthew Dillon and there's no way they wouldn't take that seriously.


Can you give me a link to the Dillon lead? I'd like to read it.

I can reproduce the failure easily. For me bash crashes as well as GCC. It feels like a failed data read. It could be Cache, SMT, DDR or anything else memory related.

As for CONFIG_RCU_NOCB* I have both those features enabled in my kernel. I'm running gentoo-sources-4.11.3.

I'm running a modified script suggested by someone on the AMD forum which causes crashes very regularly. I'm trying to eliminate one feature at a time and thus far I have found nothing that makes the problem better. Here's my summary:

Code:
BENCHMARK: After 10 runs, the average time between failures is 212 seconds
+0.1V VCORE: After 10 runs, the average time between failures is 307 seconds
DDR4 2933: After 10 runs, the average time between failures is 182 seconds
Disabled Cool and Quiet: After 10 runs, the average time between failures is 170 seconds


I tried a bunch of other things but didn't document them. I'm doing SMT next, but I don't know whether to keep -j12 or set it to -j6. I don't like to change two things, yet it seems better to be -j6 for no SMT.

I had first attempted to prove that it was a hold time issue, because it's so freaking random, it feels like hold time. But the test that raised the core voltage showed that it's not likely a hold time issue as raising voltage usually makes hold time issues worse. I tried to test at a core voltage -0.1 V offset, but the CPU wouldn't stay alive, even at 800 Mhz.

My practice is to enter setup, set BIOS defaults, set the CPU multiplier to 8 and then adjust one BIOS parameter. Nothing I have tried has made very much difference. As I said, I'll try SMT next, but I already experimented with it and I know it won't make much difference.

Here's the script I'm running. Emerging mesa crashes fast.

Note: I'm using GCC 4.9.4 with no arch options. This code should be REALLY stable since it's pure AMD64 code. I've heard stories about issues with GCC, but I don't buy it. I don't think it's a GCC issue.

I'm just hoping it's a memory issue because they can change that in registers. If it's almost anything else, we will not have stable hardware...

Code:
#!/bin/bash

# create a place to install the package in tmpfs

if [ "`mount|grep '/var/tmp/portage'`" = "" ]
then
  echo '*** MAKE SURE YOU HAVE A RAM DISK MOUNTED AT /var/tmp/portage ***'
  exit 1
fi

if [ ! -d "/var/tmp/portage/package" ]
then
  rm -rf /var/tmp/portage/package
  mkdir /var/tmp/portage/package
  chown portage:portage /var/tmp/portage/package
  chmod 775 /var/tmp/portage/package
fi

loopcounter=0                                                                                                                                             
total=0                                                                                                                                                   
start=$( date +%s )                                                                                                                                       
                                                                                                                                                           
while [ $loopcounter -lt 10 ]                                                                                                                             
do                                                                                                                                                         
  # emerge a package (default mesa if no argument supplied)                                                                                               
                                                                                                                                                           
  if [ -z "$1" ]                                                                                                                                           
  then                                                                                                                                                     
    PKGDIR="/var/tmp/portage/package" emerge -B -1 mesa                                                                                                   
  else                                                                                                                                                     
    PKGDIR="/var/tmp/portage/package" emerge -B -1 "$1"                                                                                                   
  fi                                                                                                                                                       
                                                                                                                                                           
  # Here we check for error and exit the loop if counter has hit ten                                                                                       
  if [ $? -ne 0 ]                                                                                                                                         
  then                                                                                                                                                     
    echo "Failure #$loopcounter Detected." | tee -a failure.log                                                                                           
    end=$( date +%s )
    tt=$((end-start))
    echo "start=${start}, end=${end}, tt=${tt}"  | tee -a failure.log
    start=$end
    total=$((total+tt))
    ((loopcounter++))
  fi

done

echo "total=$total"  | tee -a failure.log

avg=$((total/10))
#minutes=$((avg/60)
echo "After 10 runs, the average time between failures is ${avg} seconds"  | tee -a failure.log


Edited script. The math was wrong. my results above are flawed. I had the loopcounter increment in the wrong place.
_________________
Some day there will only be free software.


Last edited by RayDude on Thu Jun 15, 2017 3:19 am; edited 1 time in total
Back to top
View user's profile Send private message
thumper
Guru
Guru


Joined: 06 Dec 2002
Posts: 538
Location: Venice FL

PostPosted: Thu Jun 15, 2017 1:59 am    Post subject: Reply with quote

RayDude wrote:


Note: I'm using GCC 4.9.4 with no arch options. This code should be REALLY stable since it's pure AMD64 code. I've heard stories about issues with GCC, but I don't buy it. I don't think it's a GCC issue.


You know, you'd think think so, but my experience as it turned out.... not so much...

Do yourself a favor, move to GCC 7.1, my system was crap till I moved to GCC 7.1, now it's ROCK solid stable, and every issue since moving to GCC 7.1 was GCC 7.1 specific and all the ones I have found have been fixable.

George.
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1758
Location: San Jose, CA

PostPosted: Thu Jun 15, 2017 3:20 am    Post subject: Reply with quote

thumper wrote:
RayDude wrote:


Note: I'm using GCC 4.9.4 with no arch options. This code should be REALLY stable since it's pure AMD64 code. I've heard stories about issues with GCC, but I don't buy it. I don't think it's a GCC issue.


You know, you'd think think so, but my experience as it turned out.... not so much...

Do yourself a favor, move to GCC 7.1, my system was crap till I moved to GCC 7.1, now it's ROCK solid stable, and every issue since moving to GCC 7.1 was GCC 7.1 specific and all the ones I have found have been fixable.

George.


Thanks George. I can't move to gcc 7.1 because I'm building cuda. They only support up to 5.3 with version 8 and up to 6.3 with version 9 which is not even out.

Have you tried emerging mesa over and over?
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
Tony0945
Advocate
Advocate


Joined: 25 Jul 2006
Posts: 4735
Location: Illinois, USA

PostPosted: Thu Jun 15, 2017 12:26 pm    Post subject: Reply with quote

[quote="RayDude"]
thumper wrote:
Thanks George. I can't move to gcc 7.1 because I'm building cuda. They only support up to 5.3 with version 8 and up to 6.3 with version 9 which is not even out.


Often you can build a package by changing the CFLAGS or CPPFLAGS. I recently resurrected abiword-2.8.6 which failed to compile because of C++ version. The latest patch for that package was for gcc-4.6 and I'm running gcc-6.3.0. Adding -std=gnu++98 to CPPFLAGS solved that. Looking at the forum, it appears that standard change is the leading cause of failure with a new gcc version although excessive ricing of flags is another. I haven't gone to 7.1 but if I ever buy a Ryzen, I will. Right now, two things are stopping me, the fastest Ryzen doesn't come with a cooler and the segfault issue with stock clocks.

The flags can be on a per package basis https://wiki.gentoo.org/wiki//etc/portage/package.env
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1758
Location: San Jose, CA

PostPosted: Thu Jun 15, 2017 3:19 pm    Post subject: Reply with quote

Strange problem. I disabled SMT in the BIOS yesterday, ran the test and my PC eventually locked up (at 800 MHz). I restored bios defaults, but SMT is still disabled, even though it's set to AUTO in the BIOS.

This must be a BIOS bug, right?

Update:

Clearing CMOS fixed it.

That means to ensure the same starting conditions after reboot I have to clear the CMOS.

Boy, the BIOS is buggy.
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
thumper
Guru
Guru


Joined: 06 Dec 2002
Posts: 538
Location: Venice FL

PostPosted: Fri Jun 16, 2017 2:06 am    Post subject: Reply with quote

RayDude wrote:


Thanks George. I can't move to gcc 7.1 because I'm building cuda. They only support up to 5.3 with version 8 and up to 6.3 with version 9 which is not even out.

Have you tried emerging mesa over and over?


I just emerged mesa 10 times with no errors in a loop, time for the loop to run:
Code:

real    48m32.449s
user    30m21.264s
sys     323m50.513s


I've had no problem with long runs 400 - 600+ packages, it's been rather uneventful with the exception of GCC 7.1 issues that seem to be handled by either a patch or a version upgrade.

George
Back to top
View user's profile Send private message
RayDude
Veteran
Veteran


Joined: 29 May 2004
Posts: 1758
Location: San Jose, CA

PostPosted: Fri Jun 16, 2017 3:32 am    Post subject: Reply with quote

I just got stability.

I learned that the BIOS settings are broken.

So I cleared CMOS, turned off SMT, then booted to default with SMT off. For the 1600, I'm running 3200MHz with DDR4-3600 DRAM at 2133 MHz.

I changed make_opts to -j6 and started the script.

It's been running for 12 hours now without problem.

I'm going to crank the DDR4 clock to 2933 (because it worked there before) and see what happens. At least I know I can achieve relative stability.

Update: for some reason USB stops working or I guess it could be a hang when I enter bios. I can't make any changes in BIOS...
_________________
Some day there will only be free software.
Back to top
View user's profile Send private message
tinea_pedis
n00b
n00b


Joined: 08 Mar 2017
Posts: 27

PostPosted: Thu Jun 22, 2017 7:10 pm    Post subject: Reply with quote

Damn. Seeing segfaults left and right now, with workload of everykind. 1700x at stock clock, Asrock Taichi X370 with latest beta uefi (also went back to previous uefi, segfaults still occuring), which have microcode version 1.0.0.6. SMT still disabled. Same memory units as before, which work fine on my X99 setup (ran memtest for a week, no errors reported).
I've done complete world rebuild with presumably safe (for _most_ of my systems and indeed: no segfaults with this setup on fx-8350, 2500k, 3930k or 5820k) instructions (march=corei7avx) with gcc-4.8.5. Worked fine for a week, but no longer.

System goes into shelf for now, lets see again after few uefi updates and microcodes. And by the way, lend Asus Crosshair 6 mobo (which had latest uefi using 1.0.0.6 agesa; did not try previous versions, because not my own) for testing and exact same thing happens with it.
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7467

PostPosted: Thu Jun 22, 2017 7:40 pm    Post subject: Reply with quote

tinea_pedis wrote:
Same memory units as before, which work fine on my X99 setup (ran memtest for a week, no errors reported).


Sorry, i don't know if your segfault comes from that, but what i'm sure is that 100% fully functional ram working on a m/b doesn't mean another board will be happy with them.

You have a compatibility list for each m/b, having ram outside that compatibility list doesn't mean they won't work, but also doesn't mean they will work nicely, they are just unknown or untest.

Some m/b or chipsets just cannot use double side module, some others will refuse 2xDS (double side) ram with 2xSS (single side), but they will works flawlessly with 2xDS or 2xSS or 4xSS... pick any tricky mix you want there.
Some also gets mad for no reason: ok to work with 1x16MB ram, ok to work with 4x8mb ram, but get crazy mad if you use 2x16mb modules...

This to say, 100% memory working on a m/b is, alas, really not a proof that they should works flawlessly on another.
Back to top
View user's profile Send private message
tinea_pedis
n00b
n00b


Joined: 08 Mar 2017
Posts: 27

PostPosted: Thu Jun 22, 2017 8:02 pm    Post subject: Reply with quote

You're right and I didn't mean to imply that. F4-3200C16D-16GVKB (2x8Gb G.Skill RipJaws 5: these have Samsung B-die modules) is not on supported memory list for Taichi, but it is for Crosshair VI Hero and also for AGESA 1.0.0.6. But it is one thing to check out and most likely the memory is working correctly.

It seems DDR4 is just as picky as Rambus was (or is: I do use one P4 system). I have gone through dozens of setup's with DDR2 and DDR3, and not once encountered imcombatibility problems and very rarely the memory units have been on supported module list for said motherboards.
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7467

PostPosted: Thu Jun 22, 2017 8:15 pm    Post subject: Reply with quote

In this case, i don't think DDR4 is to blame, it seems clear (for me at least) that amd really has done something bad with ryzen, and memory may play a role in it, but not as an active actor.
They sure will fix this, but it looks complex to find, and solve may not be easy doable, for me, expect microcode fix from amd more than bios hack solve.
Back to top
View user's profile Send private message
Chewi
Developer
Developer


Joined: 01 Sep 2003
Posts: 881
Location: Edinburgh, Scotland

PostPosted: Thu Jun 22, 2017 8:23 pm    Post subject: Reply with quote

tinea_pedis wrote:
Damn. Seeing segfaults left and right now, with workload of everykind. 1700x at stock clock, Asrock Taichi X370 with latest beta uefi (also went back to previous uefi, segfaults still occuring), which have microcode version 1.0.0.6. SMT still disabled. Same memory units as before, which work fine on my X99 setup (ran memtest for a week, no errors reported).

You didn't say if you tried disabling ASLR. There's a very high chance this will fix it.
Back to top
View user's profile Send private message
tinea_pedis
n00b
n00b


Joined: 08 Mar 2017
Posts: 27

PostPosted: Fri Jun 23, 2017 7:48 pm    Post subject: Reply with quote

Chewi wrote:
You didn't say if you tried disabling ASLR. There's a very high chance this will fix it.

Sorry. I did indeed disable ASLR, with no difference. It was enabled before, when there were no segfaults. And these segfaults appeared after week of stable usage after complete world rebuild, while on 2.30 uefi (1.0.0.4 agesa), with no changes made to the uefi. I went to the latest (beta) uefi only after segfaulting started and with it, I also tried disabling ASLR, with no difference.
Actually, the one thing left to test is to disable the ASLR with 2.30 uefi (on which I have not tried disabling it and it worked fine for a while with it enabled).
Kernel 3.10.105 (gentoo-sources-3.10.105-r1) and 4.11.4, no difference, segfaults happen randomly, but frequently with light load (starting, say, firefox) and with make operations.

Noteworthy is the fact that this setup (the very same hdd, with no changes whatsoever) works fine on my fx-8350 and 2500k systems (and also with same gpu and psu - Seasonic X-650, 12V rails are very stable under low and high loads - on said systems).
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 47947
Location: 56N 3W

PostPosted: Fri Jun 23, 2017 8:11 pm    Post subject: Reply with quote

tinea_pedis,

That suggests that the code one the HDD is correct and the segfaults happen at run time. That's a useful datum.

The 12v being stable under steady state conditions is expected.
What does it do when there is a 10A step change in the load in the duration of a single cpu clock cycle?
That's the CPU going from idle to full load.

Its both very difficult to measure and very difficult to control. Its easier on the +12v than it is on the Vcore, where the step change is 100A and the tolerances much tighter.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
wrc1944
Advocate
Advocate


Joined: 15 Aug 2002
Posts: 3376
Location: Gainesville, Florida

PostPosted: Sun Jul 23, 2017 5:02 pm    Post subject: Reply with quote

Bunch of new Phoronix linux compiler benches on znver1 opts.

http://www.phoronix.com/scan.php?page=article&item=gcc8-clang5-znver1&num=1
Quote:
A few days back I posted some fresh AMD Ryzen compiler benchmarks of LLVM Clang now that it has its new Znver1 scheduler model, which helps out the performance of Ryzen on Linux with some of the generated binaries tested. But it was found still that Haswell-tuned binaries are sometimes still faster on Ryzen than the Zen "znver1" tuning itself. For continuing our fresh compiler benchmarks from AMD's new Ryzen platform, here are the latest GCC numbers.


I thought it worth a look, and looked at all of them , but guess I'll concur with the phoronix conclusion on Page 6
Quote:
It's really tough to draw any definitive conclusions based upon the wide spread of the results, but overall does show that more work needs to be done for Zen optimizations/tuning in both GCC and Clang.
Between GCC and Clang performance overall, it was a very tight race with Clang SVN continuing to get faster while GCC remains as the dominant free software compiler but there were a few cases of regressions.


I suppose the take-away would currently be it's still hard to settle on znver1, "native", or even trying -haswell, plus wondering about how serious the reported gcc7 regressions really are. :? :roll:
_________________
Main box- AsRock x370 Gaming K4
Ryzen 7 3700x, 3.6GHz, 16GB GSkill Flare DDR4 3200mhz
Samsung SATA 1000GB, Radeon HD R7 350 2GB DDR5
Gentoo ~amd64 plasma, glibc-2.33, gcc-10.2.0-r5, kernel-5.11.11 USE=experimental
Back to top
View user's profile Send private message
Tony0945
Advocate
Advocate


Joined: 25 Jul 2006
Posts: 4735
Location: Illinois, USA

PostPosted: Sun Jul 23, 2017 7:48 pm    Post subject: Reply with quote

wrc1944 wrote:
plus wondering about how serious the reported gcc7 regressions really are. :? :roll:


I've been running gcc-6.4.0 since July 13. It's not in portage but if you copy the 6.3.0 ebuild to local overlay and rename it it will build. It can't find the xz file that it's looking for, but I downloaded the tar.gz file from upstream, extracted it and tarred it back up with the j flag and put it in distfiles, then added RESTRICT="fetch" to the ebuild.
Back to top
View user's profile Send private message
Display posts from previous:   
This topic is locked: you cannot edit posts or make replies.    Gentoo Forums Forum Index Gentoo Chat All times are GMT
Goto page Previous  1, 2, 3 ... 17, 18, 19, 20, 21, 22  Next
Page 18 of 22

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum