Details
-
Type:
Bug
-
Status:
Open
-
Priority:
Critical
-
Resolution: Unresolved
-
Affects Version/s: 0.8.14, 0.10.4
-
Fix Version/s: None
-
Component/s: None
-
Labels:None
-
Environment:
chef 0.8.14 installed from gems, couchdb 0.11
Description
knife search node "*:*" -i always returns 1 less result than knife node list
In addition to that the line which is being skip is random.
I e try to do "knife search node "*:*" -i" many times (like 5 or 6) you will get the same number of lines (with 1 missing) but the dropped line will be different (try to do diff between results made in different time).
Also tried search via shef - it also omits 1 line from result.
Some other person (ptrstr) can reproduce this problem in different environment (ubuntu).
This is a big deal because it makes all solr searching recipes to be unreliable (meaning useless).
Activity
- All
- Comments
- History
- Activity
- Transitions Summary
ticket again (with all * properly escaped):
knife 'search node "*:*" -i' always returns 1 less result than knife node list
In addition to that the line which is being skip is random.
I e try to do 'knife search node "*:*" -i' many times (like 5 or 6) you will get the same number of lines (with 1 missing) but the dropped line will be different (try to do diff between results made in different time).
Also tried search via shef - it also omits 1 line from result.
Some other person (ptrstr) can reproduce this problem in different environment (ubuntu).
This is a big deal because it makes all solr searching recipes to be unreliable (meaning useless).
same here with chef 0.8.10 and chef 0.8.14.
server is running chef 0.8.14 on debian/squeeze (rubygems install)
knife search node *:* gives the same number of results that Chef::Node.list.size. But there's one double in knife search node query results. That makes it miss one entry
weird it appears solr also gives me the wrong number of host :
for a in $(seq 1 20); do http_proxy= curl -s -d indent=on -d 'q=hostname:[* TO *]' -d start=0 -d rows=2000 http://localhost:8983/solr/select/ | \ awk -F '[<>]' '/X_CHEF_id_CHEF_X/ {print $3}' | \ sort -u | \ while read b; do http_proxy= curl -s http://localhost:5984/chef/$b | \ sed -e 's/^.*,"name":"\([^"]\+\)","attributes.*$/\1/' done > /tmp/solr$a; wc -l /tmp/solr$a; md5sum /tmp/solr$a done
gives 122 unique entries, always the same entries
for a in $(seq 1 20); do knife node list | awk '/"/ {print $2}' > /tmp/node$a wc -l /tmp/node$a md5sum /tmp/node$a done
gives 123 unique entries, always the same entries
for a in $(seq 1 50); do knife search node "*:*" -i | sort -u > /tmp/search$a wc -l /tmp/search$a md5sum /tmp/search$a done
gives different results.
Number of result is between 120 and 123 and the missing nodes are different
Ok that might help, the node missing in the solr response is the node i manually created to use shef.
Since it never registered i guess it was never added to be indexed.
The 'knife search node "*:*"' returns it. it might be a paging issue.
I'd be curious to know why knife search sometimes gives less that Chef::Node.list.size - 1 though.
We had the same issue. Something like 20 nodes attached to a chef server. The nodes are on Rackspace cloud, the chef server is in our colo. Our nodes check in every 5 mins. We had incredibly unstable search results in our recipes. Sometimes we'd get missing nodes, other times we'd get duplicated nodes. We are on CentOS 5.3, couchdb 0.10, chef 0.8.8
I have seen some pretty good success in setting the search results size to a number far in excess of the number of results I think I should get. Like so from a recipe:
search("mysearch", nil, 0, 50000) do |node| # wizardry end
Ok thanks Eric, it looks like a working workaround.
I'm very surprised the bug doesn't get more interest from the devs.
shef:recipe >
require "md5" (1..50).each do |t| z=search(:node, "*:*", nil, 0, 50000)[0].collect{|n| n.name}.sort.uniq puts "#{z.size} #{MD5.digest(z.to_yaml).unpack("H*").first}"; f=File.open("/tmp/shef#{t}","w") f.write(z.join("\n")) f.close end; nil
I can confirm the above workaround works. Also, when I stop the solr-indexer and run "knife search node : -i" multiple times I get consistent results 100% of the time.
same here, results are ok when solr-indexer is stopped. as soon as i start it again wrong results come again.
Heres workaround that fixed it for me (well at least knife is consistent with search now, don't know about everything else:
1. cd to your gem directory (like /var/lib/gems/1.8/gems)
2. grep -r "rows=20" .
3. For all files found (2 for me) replace 20 with some big number like 50000
Okay guys, the next thing we should look at is if we can get consistent results by sorting nodes by name. Can you try something like:
list = [] search(:node, "*:*", "name asc") {|node| list << node.name }
run that a few times and see if it is consistent?
chef:recipe > z=search(:node, "*:*", "name asc").collect{|n| n.name}.sort.uniq
returns an error.
the solr error is :
SEVERE: java.lang.RuntimeException: there are more terms than documents in field
"name", but it's impossible to sort on tokenized fields
sorting by hostname is not ok:
(1..50).each do |t| z=search(:node, "*:*", "hostname asc").collect{|n| n.name}.sort.uniq puts "#{z.size} #{MD5.digest(z.to_yaml).unpack("H*").first}"; f=File.open("/tmp/shef#{t}","w") f.write(z.join("\n")) f.close end; nil
errors still happen, less often though: 2 errors in 50 runs.
the 2 incorrect sets of result are identical: there're missing 1 host.
If only using the "start" query argument, what do prevent results from changing between 2 paging queries ?
In this example, results are not consistent at all.
for a in $(seq 1 100); do s=0 while http_proxy= curl -s -d indent=on -d 'q=hostname:[* TO *]' -d start=$s -d rows=20 \ http://localhost:8983/solr/select/ | \ awk -F '[<>]' 'BEGIN {e=1} /X_CHEF_id_CHEF_X/ {print $3; e=0} END {exit e}'; do sleep 2 s=$(($s+20)) done | sort -u > /tmp/solr-id$a wc -l /tmp/solr-id$a md5sum /tmp/solr-id$a done
if you add -d 'sort=hostname asc' to sort by hostname, results are more consistent, but i also managed to see 2 errors in 50 tries.
is it an issue to fetch all ids from solr without paging ?
It's seems there're so many changes in solr indexes that it might be a good idea not to use paging at all with solr. (that's what Eric's workaround does anyway)
I suppose that we need to change the way solr-indexer works.
It must update all nodes in one bulk operation instead of many single inserts.
Hey everyone, just a quick update on where we're at with this:
At this point, we're pretty late in the release cycle for Chef 0.9, so we probably will not be able to get any fix in for 0.9.0. After 0.9.0, we're planning to implement the ordering workaround, since that will be something that we can get out to you guys quickly, and goes a long way to mitigating the issue.
The full solution for this will be to ship the full set of IDs matching your search to the chef clients, and then the client will ask the server for a bulk get for each page of results. This will be a more invasive change on the server, possibly breaking backwards compatibility.
Let me know if you have any questions.
Ok, looks like sorting on 'X_CHEF_id_CHEF_X asc' might be a "safe" value to sort on. If you guys test it out, let me know how it works out.
I have pushed the "band-aid" fix for this, meaning we sort on X_CHEF_id_CHEF_X and return up to 1000 results by default. This is not the correct solution to the problem but it should mitigate the issue until we can implement the correct one. So I'm downgrading the issue to critical.
Hey this huge bug is still there.
been experiencing it with chef 0.10.8 on the server, and chef 0.10.4 on the client side
the query is not ":" but " star : star " (confluence eats asteriscs)...
knife search node "*:*" -i