Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Compilation time estimation: should I start a project?
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Gentoo Chat
View previous topic :: View next topic  
Author Message
gorye
n00b
n00b


Joined: 20 Aug 2020
Posts: 3

PostPosted: Thu Aug 20, 2020 4:55 am    Post subject: Compilation time estimation: should I start a project? Reply with quote

I am new to Gentoo, and as far as I understand, if one wants an estimate of how much time would a certain program take to compile, they should use genlop. This is the only such project I am aware of.

My understanding of genlop is the following:
It looks into portage logs and uses the history of previous compilation times in order to estimate, how much it will take to compile something. If there is no history of compiling a program, it is basically useless in my experience (I get the "should compile any time now" message for anything I compile at any point, even if I start compiling a complex program on a relatively weak and old processor).
I also understand that I can use -q flag in order to pull data for genlop to use from [url]gentoo.linuxhowtos.org[/url]. However, this is not ideal as this site lacks data for many processors (for instance, I have a sandy bridge i3 and the site does not have data for it). Also it gets the time that the program is currently being compiled for wrong (in my experience with compiling ungoogled chromium, after 24 hrs it said that portage has been compiling it for 19 hrs or so).
So, after considering all the limitations, I decided that there could be a better solution, hence the question in the topic.

My idea for implementation is the following:
I want the potential program to look for compilation times, amount of RAM, processor name, number of cores and threads, frequencies, etc. in system files. I want it then to append this data to a dataset, from where all information for all existing clients can be pulled and analyzed: for concept proof, a simple linear regression with multiple predictors or multiple regressions will do, but later other more advanced models can be used. This way, even if there is no data for a certain processor, from the independent variables (# of cores, threads, frequencies, amount of RAM, enabled USE flags and optimizations) compilation times can be extrapolated. So the advantage of this method is that it is more flexible in terms of needed data.

Now that I have laid everything out, I have got several questions:
Firstly, did I miss a solution that just works(TM)? Maybe I should install program x and my issues will disappear? Or I have been using genlop in an improper way for my needs?
Secondly, will anyone be interested in using this software?
Thirdly, my project will depend on existing data, so there will be cold start problem, so, will anyone be interested in contributing data about their machines and compilation times in order to help train the model?
Finally, will the potential users be sketched out by the program that will take data about the machine and installation logs and upload it to the internets, even though the code will be relatively simple and open for audit and data will be anonymized?

I just cannot stress enough that I am new to the world of Gentoo, software development and open source projects. However, I feel extremely passionate about this potential project. So any input will be helpful. Please also let me know, if you have any ideas regarding my concept, maybe there could be some improvement?
Thanks in advance for the feedback!

PS:
am also new to this forum, sorry in advance if I violated rules.
Back to top
View user's profile Send private message
pjp
Administrator
Administrator


Joined: 16 Apr 2002
Posts: 18593

PostPosted: Thu Aug 20, 2020 6:32 pm    Post subject: Re: Compilation time estimation: should I start a project? Reply with quote

A lot can be learned from trying, and it may end up being a better solution. There is only one way to know.

gorye wrote:
Also it gets the time that the program is currently being compiled for wrong (in my experience with compiling ungoogled chromium, after 24 hrs it said that portage has been compiling it for 19 hrs or so).
So, after considering all the limitations, I decided that there could be a better solution, hence the question in the topic.
If genlop is reporting inaccurate data, maybe there is a related bug about why? Or maybe a new bug would help resolve that inaccuracy?

gorye wrote:
I also understand that I can use -q flag in order to pull data for genlop to use from [url]gentoo.linuxhowtos.org[/url]. However, this is not ideal as this site lacks data for many processors (for instance, I have a sandy bridge i3 and the site does not have data for it).
gorye wrote:
Thirdly, my project will depend on existing data, so there will be cold start problem
This seems to be an inherent problem with the solution that relies on community contribution of data. Put another way, what is your plan to address the availability of data problem? Without that, how do you expect to be in an improved situation from the existing solution?

gorye wrote:
Finally, will the potential users be sketched out by the program that will take data about the machine and installation logs and upload it to the internets, even though the code will be relatively simple and open for audit and data will be anonymized?
Have you demonstrated that the data you would collect improves the outcome of predicted compilation time? This goes back to the bug issue I mentioned previously. Make sure that you need to collect the data, then demonstrate why. Ultimately, I believe the general idea that telemetry is being collected will put off a lot of users. The transparency you mentioned could help, but I also think it would help to prove that you actually need it.

Another issue I see is a "premature optimization" to have predicted times be unnecessarily and perhaps unrealistically accurate. Your example of 24 hours vs. 19 hours initially sounds bad. And indeed, some arguments could be made that it isn't reasonably accurate enough. However, how much impact does 19 hours vs 24 hours have? For larger packages taking that much time, they can loosely be identified in advance, so trying to time the finished compile so that it can be used immediately upon completion seems like poor planning. What if the build fails at hour 23 or 17? Improving accuracy would be good, but there is a perspective to keep. At least I consider it a guideline, not a scheduling mechanism.

An improvement to the existing approach could be to implement predictions that are a broad guideline where missing architectures exist. For example, given your i3, what are the nearest CPUs that are slower or faster? You could then see an estimate of "more than X, less than Y"

If you search the forums for compile times on bigger packages, you'll see that they change significantly over time, so the "next" version may have significantly different results. One example: excessive time building gcc

For the record, I don't use gnelop.
_________________
Your lips move, but I can't hear what you're saying.
Back to top
View user's profile Send private message
Fitzcarraldo
Veteran
Veteran


Joined: 30 Aug 2008
Posts: 1873
Location: United Kingdom

PostPosted: Fri Aug 21, 2020 8:17 am    Post subject: Reply with quote

Just to mention another Gentoo tool for estimating merge times: qlop

https://wiki.gentoo.org/wiki/Q_applets#Extracting_information_from_emerge_logs_.28qlop.29

https://www.reddit.com/r/Gentoo/comments/62mvpw/are_there_any_alternatives_to_genlop/
_________________
Clevo W230SS: amd64 nvidia-drivers & xf86-video-intel.
Compal NBLB2: ~amd64 xf86-video-ati. Dual boot Win 7 Pro 64-bit.
OpenRC eudev elogind & KDE on both.

Fitzcarraldo's blog
Back to top
View user's profile Send private message
Ionen
Veteran
Veteran


Joined: 06 Dec 2018
Posts: 1451

PostPosted: Fri Aug 21, 2020 8:50 am    Post subject: Reply with quote

It's certainly not a perfect way to estimate (there's really none when it comes down to it), but linuxfromscratch used a point of reference to estimate build times.

Like if it takes you that long to compile package X (preferably something that doesn't wildly change in build time from version to version, isn't mostly scripts, and about everyone on gentoo is using -- Edit: could also be a special for-benchmark-only package) then this other package Y will take 2.3 times longer than X did with your hardware+settings assuming no other major problem like swapping. A database could try to get averages of this and associate with versions+USE flags to be a decent estimate (some USE do drastically change build times, most notably pgo and added abi_x86_32, optional tests being run could also complicate things). At the very least it's easier than trying to keep track of a lot of different hardware, CFLAGS, -jX, gcc version, and other things that can have a big impact.

Even if not always right, user can still estimate an offset using this to ultimately get a decent idea.

It would be a limited reference, but this kind of thing could even be automated by build bots rather than user submissions (will probably want to mostly stick to stable). Would also make it easier to keep in line with the point of reference as it changes as well as major changes like faster/slower toolchain updates since can just schedule to rebuild everything to update. We already have people building gentoo all day long to find issues but they probably build more things at same time so the times probably of limited interest :)
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46753
Location: 56N 3W

PostPosted: Fri Aug 21, 2020 10:13 am    Post subject: Reply with quote

gorye,

Maybe you want to join in the fun at Gentoo Stats.

It started out as Google SoC project which produced a proof of concept and has been picked up and dropped several times since.
The main issues have been around keeping uploads anonymous and at the same time, preventing poisoning of the data set.

The most recent attempt to restart the project is discussed on the -dev mailing list.

Don't reinvent the wheel. Build on the work of others, that's how OSS works best.
If its almost but not quite what you had in mind, fork it.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7466

PostPosted: Fri Aug 21, 2020 10:27 am    Post subject: Reply with quote

You will never get good estimate, it's utopia
there are many corner cases you will never detect easy: does the guy use MAKEOPTS="-j1" with a 10 cores cpu? does the build is using distcc or this time it is not, does the guy is running two emerge or compile something while he emerge, or simply, is the current emerging package have already been built as binary and it will just emerge the binary (which will take emerge time of gcc to few seconds)

I think genlop would had do better by estimate cpu bogomips value compare to a reference one, it wouldn't be good too, but it would use on a static value (the reference cpu) and an easy to catch value (the /proc/cpuinfo of current cpu), no user data need, no website, no nothing, and it end as weak as other methods anyway.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46753
Location: 56N 3W

PostPosted: Fri Aug 21, 2020 10:37 am    Post subject: Reply with quote

krinn,

Like my Raspberry Pi :)
Code:
     Tue Dec 24 21:59:43 2019 >>> sys-devel/gcc-9.2.0-r3
       merge time: 1 minute and 3 seconds.

     Mon May 18 21:18:46 2020 >>> sys-devel/gcc-10.1.0
       merge time: 4 hours, 29 minutes and 51 seconds.

_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
krinn
Watchman
Watchman


Joined: 02 May 2003
Posts: 7466

PostPosted: Fri Aug 21, 2020 11:01 am    Post subject: Reply with quote

NeddySeagoon wrote:
Like my Raspberry Pi :)

perfect example NeddySeagoon :)
"oh, about a minute for gcc, i will take a coffee and it's good"
Back to top
View user's profile Send private message
etnull
Guru
Guru


Joined: 26 Mar 2019
Posts: 466
Location: Russia

PostPosted: Fri Aug 21, 2020 3:02 pm    Post subject: Reply with quote

NeddySeagoon wrote:
krinn,

Like my Raspberry Pi :)
Code:
     Tue Dec 24 21:59:43 2019 >>> sys-devel/gcc-9.2.0-r3
       merge time: 1 minute and 3 seconds.

     Mon May 18 21:18:46 2020 >>> sys-devel/gcc-10.1.0
       merge time: 4 hours, 29 minutes and 51 seconds.

Is this cross compilation?
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46753
Location: 56N 3W

PostPosted: Fri Aug 21, 2020 3:15 pm    Post subject: Reply with quote

etnull,

Its building the binary on a big hairy arm64 system and installing from the binhost.

The second build is on the Pi.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Amity88
Apprentice
Apprentice


Joined: 03 Jul 2010
Posts: 246
Location: Third planet from the Sun

PostPosted: Thu Sep 10, 2020 4:16 pm    Post subject: Reply with quote

@Neddy

Hope you're well?

That's a beast of a machine indeed! I have an Atom that takes 6 - 8 hours or so for gcc compilation. Even 4 hours on the pi is pretty good! is that a pi 4 ?

@gorye,

I did some testing on several machines and small compute farms to figure out the impact of the hardware on compilation times. I don't have the numbers with me anymore but could share some observations.

Code:
1. You'd have to characterize for each package, because some are compile heavy while others are I/O and post-processing heavy with a few bursts of heavy CPU load. You could limit this study to few key packages like gcc, libreoffice etc that take up a lot of emerge time.

2. For compile heavy packages, the time almost linearly scales with the core count. The single core performance had much less impact than the core count. The curve would taper off slightly when you go beyond a certain core count.

3. For packages that weren't compile heavy, the single core performance mattered more than the core count. Having more cores didn't help all that much.

4. Using --jobs=x helps improve the emerge speed when you have more cores but it's doesn't scale as well as #2.

5. emerge with the quiet option also seems to reduce the build times, but it's a relatively constant % reduction I think.

6. Based on these, I think you could use 'average cpu load during compile', core count, single core performance, I/O performance as the independent variables for your model in order of priority.

_________________
Ant P. wrote:
The enterprise distros sell their binaries. Canonical sells their users.


Also... Be ignorant... Be happy! :)
Back to top
View user's profile Send private message
pa4wdh
Guru
Guru


Joined: 16 Dec 2005
Posts: 410

PostPosted: Thu Sep 10, 2020 5:14 pm    Post subject: Reply with quote

Sorry if this is a bit off topic, but:
Code:

     Tue Dec 24 21:59:43 2019 >>> sys-devel/gcc-9.2.0-r3
       merge time: 1 minute and 3 seconds.

     Mon May 18 21:18:46 2020 >>> sys-devel/gcc-10.1.0
       merge time: 4 hours, 29 minutes and 51 seconds.

Quote:
The second build is on the Pi.

What kind of Pi is this?

My RPi2B:
Code:

2018-11-13T06:09:18 >>> sys-devel/gcc-7.3.0-r3: 24:25:10
2019-04-09T02:51:32 >>> sys-devel/gcc-8.2.0-r6: 31:12:20
2019-07-07T00:12:54 >>> sys-devel/gcc-8.3.0-r1: 22:38:40
2020-01-03T00:05:26 >>> sys-devel/gcc-9.2.0-r2: 34:43:47
2020-07-11T05:29:12 >>> sys-devel/gcc-9.3.0-r1: 35:28:19

Your Pi's times are even close to my desktop's times :)
_________________
The gentoo way of bringing peace to the world:
USE="-war" emerge --newuse @world

Free as in Freedom is not limited to software only:
Music: http://www.jamendo.com
Recipes: http://www.opensourcefood.com
Back to top
View user's profile Send private message
CaptainBlood
Veteran
Veteran


Joined: 24 Jan 2010
Posts: 1955

PostPosted: Thu Sep 10, 2020 5:59 pm    Post subject: Reply with quote

Building from binary package is also logged, and BADLY diverts calculated average builded time meaning.

A portage stack tool to clean log file manually would help in this respect.
An emerge command line option to avoid current emerge logging could be an alternative, or complementary.

Thks 4 ur attention, interest & support.


Last edited by CaptainBlood on Thu Sep 10, 2020 6:34 pm; edited 3 times in total
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46753
Location: 56N 3W

PostPosted: Thu Sep 10, 2020 6:13 pm    Post subject: Reply with quote

That Pi build time is on a passivly cooled Pi4 with 8G RAM.
It never goes over 60C, even working hard.

The big hairy build box is a 96 core Cavium Thunder 2 with 128G RAM, configured to build Pi4 optimised binaries.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
CaptainBlood
Veteran
Veteran


Joined: 24 Jan 2010
Posts: 1955

PostPosted: Thu Sep 10, 2020 7:20 pm    Post subject: Reply with quote

NeddySeagoon wrote:
Code:
     Tue Dec 24 21:59:43 2019 >>> sys-devel/gcc-9.2.0-r3
       merge time: 1 minute and 3 seconds.

     Mon May 18 21:18:46 2020 >>> sys-devel/gcc-10.1.0
       merge time: 4 hours, 29 minutes and 51 seconds.
multilib? (if that make any sense...)
Thks 4 ur attention, interest & support.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 46753
Location: 56N 3W

PostPosted: Thu Sep 10, 2020 7:43 pm    Post subject: Reply with quote

CaptainBlood,

There is no multilib on arm64.

A 32 bit instruction set is optional on arm64 hardware. The Pi3 and Pi4 have both.
The Cavium is 64 bit only.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
saellaven
Guru
Guru


Joined: 23 Jul 2006
Posts: 577

PostPosted: Thu Sep 10, 2020 7:57 pm    Post subject: Reply with quote

I seem to remember, maybe 13-14 years or so ago, there was a project that calculated Gentoo build times and took system information (I don't remember if it was just processor or processor + RAM) and stored it in a database. It sticks out in my mind since I was one of the few people with a true dual CPU setup at the time (2 * Athlon 1800MP) and even though it was a few years old, it was smoking the dual core CPUs on compile time. That system got retired in 2007.

Really just a heads up that the code is potentially out there...
_________________
Ryzen 3700X, Asus Prime X570-Pro, 32 GB DDR4 3200, GeForce GTX 1660 Super
openrc-0.17, ~vanilla-sources, ~nvidia-drivers, gcc-9.3.0
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Gentoo Chat All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum