‘ran out of buffer space on element’ errors in Hpricot

Hpricot is a great gem for parsing web pages, and combined with the automatic navigation capabilities provided by WWW::Mechanize, it really becomes easy to create a robot to scrape web sites.

One problem, mentioned in this blog post, is that an ever increasing number of ASP.NET web sites have huge amounts of data in an HTML attribute.

Instead of using the methods provided by Hpricot and WWW::Mechanize to work around this issue (as described in the blog post), I used the following monkey patch.


module WWW
  require 'hpricot'
  class Mechanize
    Hpricot.buffer_size = 262144  # added by naofumi
  end
end

You can put it an initializer if you are working in Rails.