Details
-
Type:
Bug
-
Status:
Open
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: 0.10.0
-
Fix Version/s: None
-
Component/s: Chef Client
-
Labels:None
-
Environment:
CentOS 5 i386
ruby 1.8.7 (2011-02-18 patchlevel 334) [i386-linux]
CentOS 0.10.0 RBEL RPMs
SELinux disabled
Description
Looks like a Ruby GC bug affects the YUM provider included in Chef 0.10.
See ruby bug http://redmine.ruby-lang.org/issues/4856
Daniel DeLeo recommends replacing popen4 in yum.rb with shell_out.
Experimental patch: https://gist.github.com/1016667
Issue Links
- relates to
-
CHEF-3011
Initial Chef run crashes with Segmentation fault in provider/package/yum.rb
-
Activity
- All
- Comments
- History
- Activity
- Transitions Summary
Keeps showing up on the mailing list - if this is the fix than we definitely need it in 0.10.2
a couple things i think are worth adding - I've been able to duplicate this consistently on CentOS 5.4 and 5.5, but it never segfaults if I run chef-client with debug logging (chef-client -l debug). I wonder what's different about running with debug logging. Also I've tried this with both the original yum provider in 0.9.16 and the shell_out patch - both segfaulted.
Also I only had this issue when I used the aegisco ruby-1.8.7.334-2.el5 RPMs. I went back and compiled ruby-1.8.7-p334 from source on a fresh VM, and I haven't been able to reproduce the segfault since.
fyi: latest 0.10.4 work I've got queued up for submission is in https://github.com/mdkent/chef/commits/yum-improvements-v2
Matthew, still getting feedback from ppl that ruby segfaults even with the shell_out patch applied, so I'm just wondering if we should add it to your branch...
Hmm, well if it's not directly fixing the issue for people and given the amount of other changes between 0.10.2 -> 0.10.4 for the yum provider it might be a good idea to wait for 0.10.6.
I see on the list you also posted a new ruby version - hopefully that helps!
Actually, the patch fixes the issue for me, but maybe I'm not pushing the envelope enough. RBEL packages have the patch anyway.
I'll be doing some more testing next week
that patch fixed the issue for me, on cookbooks with a lot of yum related tasks
went too fast
it did work one time - but i needed to re-bootstrap that instance and start from scratch. It doesn't work anymore (fresh aws instance - bootstrapped with a script inspired from centos5-gems.erb - so using ruby-1.8.7-332 from aegis repo)
tried to downgrade to 1.8.7-302 as suggested by http://lists.opscode.com/sympa/arc/chef/2011-06/msg00051.html, to no avail.
I'm giving ree a try.
ree (ruby 1.8.7 (2011-02-18 patchlevel 334) [i686-linux], MBARI 0x8770, Ruby Enterprise Edition 2011.03) works perfectly ...
I believe this support ticket:
is another instance of this bug. The user in question has attempted using the yum-improvements-v2 branch to no avail. Further, using Ruby Enterprise Edition does seem to solve the problem. If this isn't the same but, I'd be happy to open a new ticket with the relevant information.
Does it segfault when running a cookbook?
If so, It would be great if you could share it so I can test it and try to reproduce.
Any output/log from the crash would be greatly appreciated.
Here is the latest example. Since this isn't my log file, I censored some details.
[Mon, 18 Jul 2011 10:16:46 +0000] INFO: execute[remove-censored] sh(sed -i '/censored\s=.$/d' /etc/sysctl.conf) /usr/lib/ruby/gems/1.8/gems/chef-0.10.2/bin/../lib/chef/shell_out/unix.rb:22:
[BUG] Segmentation fault ruby 1.8.7 (2011-02-18 patchlevel 334) [x86_64-linux]
Crashes have happened in other cookbooks (including ones that use the yum provider), but always point to a line in shell_out/unix.rb. I was going to post this as a separate but since it isn't in yum.rb, but then saw that ree seems to fix this as well. I've pointed the original poster to this bug report so he may be willing to post more detailed logs.
I can confirm this on a hardware, CentOS 5.6 x86_64 vanilla machine.
Using RBEL ruby.
/usr/lib/ruby/gems/1.8/gems/chef-0.10.2/bin/../lib/chef/provider/package/yum.rb:67: [BUG] Segmentation fault
ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux]
It's nothing specific - simply running any cookbook that has a package installation "may" provide this error. It is inconsistent, and manually running chef-client again immediately following the error usually works, but not always.
Any ideas on this segfault? I am using the RBEL bootstrap/rpms, and get a segfault, typically on the first yum statement in my run list.
Ticket is getting a bit long, trying to summarize the current state:
- segfault crash can happen on yum package or execute, or really anything using shell_out/popen4
- swapping popen4/shell_out don't make a difference
- i386/x86_64 doesn't matter
- Segfaults on RBEL ruby 1.8.7 p302
- Segfaults on aegisco/RBEL ruby 1.8.7 p334
- Segfaults on RBEL ruby 1.8.7 p352
- No segfault on source compiled ruby 1.8.7 p334
- No segfault on REE 1.8.7 p334
- No segfault on rvm installed 1.9.2 p180 (on my end)
- No segfault on rvm installed 1.9.2 p290 (on my end)
Questions:
Anyone seeing success with RBEL 1.9.2 p290?
Michael Garrett, can you verify you are still having success with the source compiled 1.8.7 p334? To me this is the most interesting data point - could be something being introduced in the rpm builds.
Also for anyone still experiencing segfaults it probably makes sense to gather a gdb backtrace and toss it in the ticket at http://redmine.ruby-lang.org/issues/4856 then maybe some attention could be drawn to it. This will also help verify everyone is experiencing the exact same issue.
Thanks Matthew. Great.
I believe you are on the right track. Ruby 1.8 RPM uses some patches from upstream Fedora packages and one patch from RBEL. Don't think it's the patches causing the issue, but maybe the build flags are.
I'm going to create different packages using different build flags and patches and see how it goes.
Uploaded new ruby packages for testing. They look solid!
Please, have a look if you are experiencing segfaults:
http://rbel.frameos.org/ruby-test/el5 (for RHEL5 distributions)
http://rbel.frameos.org/ruby-test/el6 (for RHEL6 distributions)
With the new ruby RPMs, I'm 3/3 for hitting this...
/usr/lib/ruby/gems/1.8/gems/chef-0.10.2/bin/../lib/chef/provider/package/yum.rb:67: [BUG] Segmentation fault
ruby 1.8.7 (2011-06-30 patchlevel 352) [x86_64-linux]
# rpm -q ruby
ruby-1.8.7.352-2.el5
Yes, I've been able to reproduce that too... I need to come up with a better cookbook to stress the build. Any help there greatly appreciated.
In the meantime, I've uploaded a new build (ruby-1.8.7.352-3.el5).
Needs much more testing so proceed with caution...
Another thing I've noticed is that the segfault can cause the service to halt and not recover (on CentOS5.6), leaving the node in a state where it no longer checks in.
This can be worked around with some monit script to detect and restart the service if needed, but I'd rather not go that route.
I built some new RPMs for CentOS in hopes of resolving the segfault issue. So far 2/2 successful. To test you can simply bootstrap [1] and test Segio's cookbook [2]. Alternatively, you can download from testing.aegisco.com [3]. These are only for i386 and x86_64 el5, but when they are confirmed working I'll build for other platforms as well.
[1] https://gist.github.com/1128513
[2] https://github.com/rubiojr/yum-stress-cookbook
[3] http://testing.aegisco.com/el5/
The new ruby build is ruby-1.8.7.352-1 and the new rubygems is rubygems-1.8.5-1.
ruby-1.8.7.352-1 from http://testing.aegisco.com/ solved the segfault for me too, thanks guys.
Have a look at James mail:
http://lists.opscode.com/sympa/arc/chef/2011-08/msg00044.html
FrameOS RBEL and Aegis' repo have different purposes also:
http://blog.frameos.org/2011/04/14/announcing-rbel-frameos-org
We should answer the question of differences between aegisco and RBEL repos on the wiki in a more detailed but concise manner. I didn't know that you also maintain V8, nginx, and node.js packages; that's great.
The changes I made for the 1.8.7.352-1 build were to the underlying build system rather than the spec file (except for bumping the version). I switched to raw rpmbuild with Chef wrapping it, from using mock by hand.
I also re-found an old bug which was annoying to experience: The official ruby RPMs install to /usr/lib and are not effectively cleaned when upgraded. Because ruby defaults to a prefix of /usr/local/lib, this led to inconsistent results. Further, the rubygem package needs to be rebuilt in this case, because it inherits the prefix from ruby.
Another important lesson: do not run build systems on micro instances, especially if you're going to have to re-build many times.
If anyone experienced segfaults can provide answers to the following questions, would be greatly appreciated.
ruby version [ie, 1.8.7-352]
ruby install type [ie, package, source]
ruby source [ie, ruby,org, aegisco, rbel]
CentOS version [ie, 5.5, 5.6]
CentOS source [ie, AMI, iso, veewee]
Is prelink enabled? [1]
What resource bombed?
Does the error ever occur on the first run?
When does the first error tend to occur?
After the error occurs, do future runs work?
It would also be ideal to get traces for these segfaults.
Provider has been rewritten in 0.10.2, possible to rewrite against that? Or I can do it if you'd like.