Details
-
Type:
Bug
-
Status:
Closed
-
Priority:
Blocker
-
Resolution: Fixed
-
Affects Version/s: 0.6.12
-
Fix Version/s: 6.14.0
-
Component/s: None
-
Labels:None
-
Environment:Hide
Chef 0.10.8
Ohai 0.6.12-
- *Note: This issue is really tied to Ohai 0.6.12, not Chef 0.10.8.
Amazon Linux system-release-2011.02 (Beta)
Chef and Ohai are installed installed during ec2 bootstrap as gems:
gem install ohai --no-rdoc --no-ri --verbose gem install chef --no-rdoc --no-ri --verbose <%= bootstrap_version_string %>
Instances created via knife ec2 create.
ShowChef 0.10.8 Ohai 0.6.12 *Note: This issue is really tied to Ohai 0.6.12, not Chef 0.10.8. Amazon Linux system-release-2011.02 (Beta) Chef and Ohai are installed installed during ec2 bootstrap as gems: gem install ohai --no-rdoc --no-ri --verbose gem install chef --no-rdoc --no-ri --verbose <%= bootstrap_version_string %> Instances created via knife ec2 create. -
Description
We have a recipe "instance_tag" which tags our ec2-instance. In this recipe, we reference the ec2 attribute which is generated by Ohai to get the instance_id by referencing _*node['ec2']['instance_id'].
When a node is converged by calling chef-client on initial run and from the command line, the recipe executes without error.
When a node is converged by the demonized chef-client (we demonize via init.d), we get the error: NoMethodError: undefined method `[]' for nil:NilClass.
This reference is performed inside of an AmazonEC2Tag_ec2tag LWRP block (we have modified significantly AmazonEC2Tag, will make code available if required):
AmazonEC2Tag_ec2tag 'index' do aws_access_key aws['aws_access_key_id'] aws_secret_access_key aws['aws_secret_access_key'] value "#{instance_index}" instance_id node['ec2']['instance_id'] action :create end
When the issue started occurring, I noticed that it was only occurring on instances created on Thursday, March 22, but not on instances created the day before (Wednesday, March 21).
From the Chef irc - thanks to matschaffer, btm, and hoover_damm - I determined that Ohai had been updated on Thursday, March 22.
I confirmed an instance created on Thursday, March 22 was running Ohai 0.6.12 vs. 0.6.10 for an instance created on Wednesday, March 21.
Once I rolled back the use of Ohai 0.6.12, the issue quit occurring.
Issue Links
Activity
- All
- Comments
- History
- Activity
- Transitions Summary
Comment from btm @ 12:59 on March 23, 2012
Hey there,
Ohai 0.6.12 didn't modify ec2 information, but it did modify some basic network collection plugins.
I need to dig deeper, but the EC2 plugin has a two phase implementation. First it looks for a common mac address in the arp table, then it tries to contact the metadata server with a short timeout. In the past, users have seen an issues when the arp table empties on daemonized chef-client runs without network activity, and when something else slows down the Ohai run.
How is Chef installed on your system? Via Rubygems, distro packages, or Omnibus?
Bryan
All of the details with respect to configuration, etc. are captured in the above issue description.
Comment from btm @ 13:40 on March 23, 2012
I couldn't reproduce this under these circumstances:
Ubuntu 11.10 (oneiric) on EC2 with Chef 0.10.8 from Omnibus 0.10.8-2 with Ohai upgraded to 0.6.12:
$ knife node run list add i-d4e1bab0 role[base] # recipe[runit], recipe[chef-client]
$ knife ssh name:i-d4e1bab0 'sudo chef-client' -a ec2.public_hostname -x ubuntu
$ knife node show i-d4e1bab0 -a ohai_time
ohai_time: 1332534857.23604
$ knife node show i-d4e1bab0 -a ohai_time
ohai_time: 1332534953.338546
$ knife node show i-d4e1bab0 -a ec2
[snip - success]
As noted above it the description, the issue occurred:
- On Amazon Linux
- At a minimum, should attempt to reproduce on Centos vs. Ubuntu
- When node was converged by demonized chef-client
- I saw the issue by reviewing /var/log/chef/client.log
Comment from btm @ 13:56 on March 23, 2012
After setting node['chef_client']['interval'] = 60, I did this five times:
- Wait 1 minute
- knife node show i-d4e1bab0 -a ohai_time
- Ensure above number has changed, signifying a chef run has occurred and saved.
- knife node show i-d4e1bab0 -a chef_packages
- Review output of last command to verify Ohai 0.6.12 ran
- knife node show i-d4e1bab0 -a ec2
- Check that EC2 data was collected.
This worked correctly every time.
Is it possible that you have a recipe that is manipulating ec2? Because it works fine when you downgrade I think we can assume there isn't a typo of '=' when it should be '=='.
BTM -
First, I don't think the steps you are taking to reproduce the issue are equivalent to where I encountered the issue.
I believe a much better test case would be to have a recipe that references node['ec2']['instance_id'] and allow a demonized chef-client to execute that recipe.
My node['chef_client']['interval'] is 300 seconds.
No, I am not manipulating the 'ec2' attribute. In fact the only place I reference node['ec2']['instance_id'] is in the snippet of code I provided to you above.
Comment from btw @ 14:36 on March 23, 2012
I've also tested on Ubuntu 11.10 and 10.04 using Omnibus 0.10.8-3 (Which includes Ohai 0.6.12) in the same manner and I'm consistently getting an ec2 attribute saved to the server after each run chef client run under runit.
It'd be great if you could run chef-client daemonized with '-l debug' and get some output.
BTM -
Will do.
As you can understand, since this impacted our production environment, I backed it out as soon as possible and do not have this running in production (or staging) any longer.
In fact, I modified our bootstrap file to install specific versions of the ohai and chef gems.
I do not believe the environments in which you are attempting to recreate the issue mirror where I see it manifesting.
I will work to get you more data, but I do ask that you attempt to reproduce in an environment (see above for details) that mirrors what I am using.
Specifically:
- Amazon Linux
- Ruby Gem Install
- Reference to node['ec2']['instance_id'] in a recipe
Also, it should be noted that I saw this issue by tailing /var/log/chef/client.log. The issue only occurred when the node was converged by a demonized chef-client.
I transferred all of the comments from the Help Discussion to here. Going forward, let's address the issue here.
Comment from Paul Mooring @ 1645 on March 23, 2012
Hello,
We can try and re-create this issue on Amazon Linux, were you using a public AMI and chef-bootstrap I can use to re-create this? If you've already reverted ohai and are running stable I will likely leave this for btm to take a look at on Monday, if you think it's more urgent than that let me know and I'll see what I can do tonight.
thanks for using Chef!
Paul Mooring
Hi Paul -
Thank you for your reply.
While I have worked around this issue, I believe it is fairly serious and has the potential to blow up other customers.
That said, I don't think you need to work the weekend on it.
If you would get me a list of the items that would help you in reproducing it and any tools for introspection I could use to get you more data, I will set up a reproducible case and forward any and all data you need.
Yours,
Jordan
Is it possible for me to be given access to issue I report in order to edit them?
Jordan, what AMI are you using?
I found a list here https://aws.amazon.com/amazon-linux-ami/ and added one to here http://wiki.opscode.com/display/chef/AMI+List+for+Developers
For m1.large instances, our AMI is ami-8e1fece7
For cc1.4xlarge instances, our AMI is ami-321eed5b
Also, how are ruby and rubygems installed? From packages or source? What package or source version?
And how are you populating /etc/init.d/chef-client? From the chef-client cookbook or your own?
Ruby install is performed at part of the bootstrap by the bootstrap script below.
Note This has been modified to version lock the installs of Ohai and Chef. Previously, this was not done.
Bootstrap Script
bash -c ' <%= "export http_proxy=\"#{knife_config[:bootstrap_proxy]}\"" if knife_config[:bootstrap_proxy] -%> if [ ! -f /usr/bin/chef-client ]; then yum install -y ruby ruby-devel gcc gcc-c++ libxml2-devel libxslt-devel automake autoconf make yum install -y rubygems fi gem update --system gem update gem install yaml --no-rdoc --no-ri --verbose gem install ohai --no-rdoc --no-ri -v 0.6.10 --verbose gem install chef --no-rdoc --no-ri -v 0.10.8 --verbose <%= bootstrap_version_string %> gem install multi_json -v '1.0.3' --no-rdoc --no-ri --verbose gem install aws-sdk --no-rdoc --no-ri --verbose gem install fog --no-rdoc --no-ri --verbose mkdir -p /etc/chef ( cat <<'EOP' <%= validation_key %> EOP ) > /tmp/validation.pem awk NF /tmp/validation.pem > /etc/chef/validation.pem rm /tmp/validation.pem ( cat <<'EOP' <%= config_content %> EOP ) > /etc/chef/client.rb ( cat <<'EOP' <%= { "run_list" => @run_list }.to_json %> EOP ) > /etc/chef/first-boot.json <%= start_chef %>'
Population of /etc/init.d/chef-client is performed by the chef-client cookbook. Below is a the script.
/etc/init.d/chef-client
#!/bin/bash # # chef-client Startup script for the Chef client # # chkconfig: - 98 02 # description: Client component of the Chef systems integration framework. ### BEGIN INIT INFO # Provides: chef-client # Required-Start: $local_fs $network $remote_fs # Required-Stop: $local_fs $network $remote_fs # Should-Start: $named $time # Should-Stop: $named $time # Short-Description: Startup script for the Chef client # Description: Client component of the Chef systems integration framework. ### END INIT INFO # Source function library . /etc/init.d/functions exec="/usr/bin/chef-client" prog="chef-client" [ -e /etc/sysconfig/$prog ] && . /etc/sysconfig/$prog config=${CONFIG-/etc/chef/client.rb} pidfile=${PIDFILE-/var/run/chef/client.pid} lockfile=${LOCKFILE-/var/lock/subsys/$prog} logfile=${LOGFILE-/var/log/chef/client.log} interval=${INTERVAL-1800} splay=${SPLAY-20} options=${OPTIONS-} start() { [ -x $exec ] || exit 5 [ -f $config ] || exit 6 echo -n $"Starting $prog: " daemon chef-client -d -c "$config" -L "$logfile" -P "$pidfile" -i "$interval" -s "$splay" "$options" retval=$? echo [ $retval -eq 0 ] && touch $lockfile return $retval } stop() { echo -n $"Stopping $prog: " killproc -p $pidfile chef-client retval=$? echo [ $retval -eq 0 ] && rm -f $lockfile return $retval } restart () { stop start } reload() { echo -n $"Reloading $prog: " killproc -p $pidfile chef-client -HUP retval=$? echo return $retval } force_reload() { restart } rh_status() { # run checks to determine if the service is running or use generic status status -p $pidfile $prog } rh_status_q() { rh_status >/dev/null 2>&1 } case "$1" in start) rh_status_q && exit 0 $1 ;; stop) rh_status_q || exit 0 $1 ;; restart) $1 ;; reload) rh_status_q || exit 7 $1 ;; force-reload) force_reload ;; status) rh_status ;; condrestart|try-restart) rh_status_q || exit 0 restart ;; *) echo $"Usage: $0 {start|stop|status|restart|condrestart|try-restart|reload|force-reload}" exit 2 esac exit $?
Amazon Linux 2011.09 / ami-4b814f22
Chef 0.10.8 / Ohai 0.6.12 installed via gem
sudo yum install -y ruby ruby-devel gcc gcc-c++ automake autoconf rubygems make
sudo gem install chef --no-rdoc --no-ri
sudo mkdir /etc/chef
# Set up /etc/chef/*
sudo chef-client
sudo cp /usr/lib/ruby/gems/1.8/gems/chef-0.10.8/distro/redhat/etc/init.d/chef-client /etc/init.d
sudo cp /usr/lib/ruby/gems/1.8/gems/chef-0.10.8/distro/redhat/etc/sysconfig/chef-client /etc/sysconfig/
sudo chmod 755 /etc/init.d/chef-client
sudo mkdir /var/log/chef/
sudo vi /etc/sysconfig/chef-client # Set INTERVAL=60
sudo /etc/init.d/chef-client start
# recipe file "/tmp/ec2_ohai" do owner "root" group "root" content node['ec2']['instance_id'] end
Since this recipe has to access this variable on every run to create a temporary file to compare to the existing file, it successfully ensures that the variable is populated and gives us the ability to review its content in between the daemonized chef-client runs at one minute intervals. I tailed /var/log/chef/client.log for a half dozen runs, saw no failures and verified that /tmp/ec2_ohai was properly populated after the run.
0.6.12 didn't change the EC2 plugin, perhaps we're jumping to conclusions here. EC2 is initially detected by ohai by finding fe:ff:ff:ff:ff:ff in the arp cache. Is it possible this cache is expiring, as in OHAI-305?
Hi Brayn -
I can easily reproduce this issue. If you would like me to set up a test and capture data, I am more than willing to do so.
The issue never appeared prior to the upgrade to Ohai 0.6.12 and rolling back to 0.6.10 cure it. Which correlation is not causality, it would seem a pretty strong indication that something is going on with Ohai 0.6.12.
Tell me what data I should capture and how to capture it if it is beyond the basics like the chef client log.
Jordan
With reference to OHAI-305, is it possible that something was done that cause the cache to expire faster in Ohai 0.6.12?
I am shooting in the dark here, since I don't know much about this area. But I am willing to learn.
Jordan
The ARP cache is maintained by the system (see 'arp -a') not by Ohai.
First, if you reduce your interval to 60 seconds does it still happen? (1800 seconds is a long time to wait to reproduce something)
Second, if you stay ssh'd into the node with a 'tail -F /var/log/chef/client.log' does it still occur? This should generate network traffic when the Chef run starts which will keep the arp entry to the gateway in the cache.
Third, if those don't work, does this happen on a system with only the recipe I wrote above in the run list?
I ran my test recipe overnight at 60 second intervals without an issue.
I ran my test recipe overnight at 1800 second intervals without an issue.
Bryan -
Sorry for the delay in getting back to you. We had a deployment yesterday and that is where I want fully engaged.
We converge our nodes every 300 seconds. I am sorry that I did not make that clear in my report.
As soon as I clear another issue - I am seeing a hard crash in the Chef first run which wasn't there yesterday, which is captured in CHEF-3011 - and can bring up nodes once again, I will run a test to reproduce this issue and will capture any and all logs. What would you like me to get for you?
I can't imagine that any interval in between 60 and 1800 will make a difference in reproducing it for me. Let me know what you find.
Bryan -
What are the logs and data you want me to capture for you?
Jordan
I believe I've run into the same issue - I'm seeing this blow up when trying to access node[:ec2], and this definitely worked before. I've noticed that both Ohai and Fog have been updated (Ohai to 0.6.12, and Fog from 1.1.2 on Dec 16th 2011 to 1.2 on Mar 19, 1.3 on Mar 21 and 1.3.1 on Mar 27: http://rubygems.org/gems/fog) - the error I'm seeing is:
ERROR: NoMethodError: undefined method `[]' for nil:NilClass
And the code that's throwing the error is:
public_hostname = node[:ec2][:public_hostname]
which is inside of a recipe that's doing work with Route 53 using the Opscode AWS cookbook. This works during the initial bootstrap, but then fails on the first chef-client run after that.
I don't currently have a custom bootstrap script - I'm using knife ec2 server create to build up my instances. What's the fastest way to force older versions of the Ohai and Fog gems to be used so that I can restore functionality while waiting for a fix? Is there a way to do it with my setup, or should I move to a custom bootstrap?
EDIT: I also forgot to add that we're using the Opscode Hosted Chef server, and I confirmed in the node properties for the node in question that both node[:ec2] and node[:ec2][:public_hostname] were defined on the server.
I was able to work around this in the short term by following the workaround in http://help.opscode.com/discussions/problems/1443-nodeipaddress-null-on-amazon-windows-hosts, specifically:
gem uninstall ohai gem install ohai --version=0.6.10
After I did this and did a service chef-client restart, it successfully completed a run. I also discovered an important data point for this bug, which seems to indicate a timeout (or timeout-like behaviour) is indeed the root cause - when I manually kicked the chef client a few times after initial buildout, all of the runs succeeded (and I started to get very concerned that I was discovering a Heisenbug) but when I let it go for 30 minutes and then run on its own schedule, it choked again with the same error listed above. To summarize what I saw (with interim, uninteresting log bits trimmed):
[Tue, 03 Apr 2012 14:47:35 +0000] INFO: *** Chef 0.10.8 *** [Tue, 03 Apr 2012 14:47:37 +0000] INFO: Client key /etc/chef/client.pem is not present - registering [Tue, 03 Apr 2012 14:47:40 +0000] INFO: Starting Chef Run for mn1 [Tue, 03 Apr 2012 14:51:00 +0000] INFO: Chef Run complete in 200.087256 seconds -------initial buildout complete-------------------- [Tue, 03 Apr 2012 14:51:52 +0000] INFO: Starting Chef Run for mn1 [Tue, 03 Apr 2012 14:52:59 +0000] INFO: Chef Run complete in 67.5826 seconds [Tue, 03 Apr 2012 15:05:52 +0000] FATAL: SIGTERM received, stopping [Tue, 03 Apr 2012 15:05:53 +0000] INFO: Daemonizing.. [Tue, 03 Apr 2012 15:05:53 +0000] INFO: Forked, in 16489. Priveleges: 0 0 [Tue, 03 Apr 2012 15:05:56 +0000] INFO: Starting Chef Run for mn1 [Tue, 03 Apr 2012 15:06:03 +0000] INFO: Chef Run complete in 6.633753 seconds [Tue, 03 Apr 2012 15:21:26 +0000] FATAL: SIGTERM received, stopping [Tue, 03 Apr 2012 15:21:28 +0000] INFO: Daemonizing.. [Tue, 03 Apr 2012 15:21:28 +0000] INFO: Forked, in 16821. Priveleges: 0 0 [Tue, 03 Apr 2012 15:22:08 +0000] INFO: Starting Chef Run for mn1 [Tue, 03 Apr 2012 15:22:17 +0000] INFO: Chef Run complete in 8.680373 seconds ------end manual restarts, begin 30m normal run interval------ [Tue, 03 Apr 2012 15:52:47 +0000] INFO: Starting Chef Run for mn1 [Tue, 03 Apr 2012 15:52:50 +0000] ERROR: Running exception handlers [Tue, 03 Apr 2012 15:52:51 +0000] FATAL: Saving node information to /var/cache/chef/failed-run-data.json [Tue, 03 Apr 2012 15:52:51 +0000] ERROR: Exception handlers complete [Tue, 03 Apr 2012 15:52:51 +0000] ERROR: NoMethodError: undefined method `[]' for nil:NilClass [Tue, 03 Apr 2012 15:52:51 +0000] FATAL: Stacktrace dumped to /var/cache/chef/chef-stacktrace.out [Tue, 03 Apr 2012 15:52:51 +0000] ERROR: Sleeping for 1800 seconds before trying again
My chef-client interval is 1800, and splay is 60.
Jordan,
I need you to isolate the cause. I didn't hear back regarding these tests:
First, if you reduce your interval to 60 seconds does it still happen? (1800 seconds is a long time to wait to reproduce something)
Second, if you stay ssh'd into the node with a 'tail -F /var/log/chef/client.log' does it still occur? This should generate network traffic when the Chef run starts which will keep the arp entry to the gateway in the cache.
Third, if those don't work, does this happen on a system with only the recipe I wrote above in the run list?
Jesse, the easiest automated workaround would be to copy and rename the existing bootstrap script that you're using and modify the line that installs the ohai gem to install a specific version. I would also be curious if you kept a ping to the machine going or something else to generate network traffic which would keep the arp tables populated if it would help.
It's possible that Ohai 0.6.10 was doing something in another plugin which would generate some network traffic which 0.6.12 isn't doing. If your systems are relatively inactive its entirely possible that the ec2 plugin didn't fully run. In this case, you could modify your chef-client config to set log_level to debug and then the full log from the failed run would tell us that.
Bryan,
I have a question, since the answer is unclear to me from this comment thread - is this currently considered a bug in Ohai 0.6.12 with a solution being sought, or are you still trying to determine whether this is a bug at all? It's consistently reproducible in our environment, and downgrading Ohai 0.6.12 to Ohai 0.6.10 fixes it every time.
I've tried extensively to reproduce it and cannot. I'm trying to figure out what is causing it. Ohai 0.6.12 made no changes to the ec2 plugin so while down-grading to 0.6.10 is interesting it doesn't tell us why this is happening to you. We use the arp entry to avoid making an unnecessary network call to a metadata server that may not be there. We've seen in OHAI-305 that this arp entry can time out, so I'm trying to rule this out as the cause of your issue and the cause of Jordan's issue.
I encountered a similar (possibly the same) bug this afternoon that was solved by downgrading to ohai 0.6.10. The node[:ec2] and node[:cloud] attributes were completely missing and the node[:ipaddress] attribute was blank. Downgrading to 0.6.10 brought all the attributes back, upgrading back to 0.6.12 took them away. Thanks to hoover_damn and newb_ for helping figure this out.
FWIW, I also began experiencing this inside aws::tag, and was able to reproduce it while ssh'd in, tail-following client.log on an ec2 m1.small, ami-c412cead, chef-0.10.8 & ohai-0.6.12 (installed from gems), with an interval of 60 seconds.
The behavior was similar to what others have described: the first daemonized client run would run through but subsequent runs would fail with a node['ec2'] nil dereference.
Replacing the run_list with a single recipe containing only "log node['ec2']['instance_id']" runs like clockwork, every time.
I was also unable to reproduce this behavior by sleeping for 75 seconds (the approximate time it takes to run our aws recipes) before logging node['ec2']['instance_id']. The single test recipe ran every time w/o encountering a nil dereference.
Lacking a definitive method for determining the cloud type on the host itself, the hint system developed in OHAI-267 will be our solution.
With Ohai 6.4.0, create /etc/chef/ohai/hints/ec2.json to enable EC2 attribute collection.
See Opscode Help Discussion for discussion between btw and myself on this issue.