Friday, May 6, 2011

hpricot with firebug's XPath

I'm trying to extract some info from a table based website with hpricot. I get the XPath with FireBug.

/html/body/div/table/tbody/tr/td/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr[3]/td/table[3]/tbody/tr

This doesn't work... Apparently, the FireBug's XPath, is the path of the rendered HTML, and no the actual HTML from the site. I read that removing tbody may resolve the problem.

I try with:

/html/body/div/table/tr/td/table/tr[2]/td/table/tr/td[2]/table/tr[3]/td/table[3]/tr

And still doesn't work... I do a little more research, and some people report they get their XPath removing the numbers, so I try this:

/html/body/div/table/tr/td/table/tr/td/table/tr/td/table/tr/td/table/tr

Still no luck...

So I decide to do it step by step like this:

(doc/"html/body/div/table/tr").each do |aaa |
  (aaa/"td").each do | bbb|
        pp bbb
        (bbb/"table/tr").each do | ccc|
            pp ccc 
      end
  end
end

I find the info I need in bbb, but not in ccc.

What am I doing wrong, or is there better tool to scrap HTML with long/complex XPath.

From stackoverflow
  • You are probably better off using hpricot's CSS parsing instead of XPath. _why was talking about possibly depricating XPath at one point.

    Do you have a better example of the data? Do they use css tags that are easily referenced?

    It's much easier to search like:

    doc.search("#id_tag > table > tr.class_tag > td").each do |aaa|
        aaa.search("blah > blah").each do |bbb|
            bbb.inner_html
    

    There was an older page on _why's website (which I can't seem to find now) that was discussing hpricot, and some of the comments hinted at how the CSS version was a better choice over XPath when doing nested searches similar to what you are.

    Wish I could give a better answer, but I seriously recommend giving the CSS method a shot and see how it goes before tearing your hair out with XPath.

  • I'm now using css and I "figure" it with this great tool : www.selectorgadget.com

  • It's probably worth noting that Nokogiri uses the same API as Hpricot, but also supports XPath expressions.

  • Your problem is in XPather (or firebug XPath). Firefox i think is internally fixing badly formated tables to have tbody element even if in HTML there is none. Nokogiri is not doing that, instead it allows tr tag to be inside table.

    so there's a big chance your path looks to nokogiri like this:

    /html/body/div/table/tr/td/table/tr[2]/td/table/tr/td[2]/table/tr[3]/td/table[3]/tr
    

    and that's how nokogiri will accept it :)

    you might want to check out this

    require 'open-uri'
    require 'nokogiri'
    
    class String
      def relative_to(base)
        (base == self[0..base.length-1]) &&
          self[base.length..-1]
      end
    end
    
    module Importer
      module XUtils
        module_function
    
        def match(text, source)
          case text
          when String
            source.include? text
          when Regexp
            text.match(source)
          when Array
            text.all? {|tt| source.include?(tt)}
          else
            false
          end
        end
    
        def find_xpath (doc, start, texts)
          xpath = start
          found = true
    
          while(found)
            found = [:inner_html, :inner_text].any? do |m|
              doc.xpath(xpath+"/*").any? do |tag|
                tag_text = tag.send(m).strip.gsub(/[\302\240]+/, ' ')
                if tag_text && texts.all?{|text| match(text, tag_text)}
                  xpath = tag.path.to_s
                end
              end
            end
          end
    
          (xpath != start) && xpath
        end
    
        def fetch(url)
          Nokogiri::HTML(open(url).read)
        end
      end
    end
    

    I wrote this little module to help me work with Nokogiri when webscraping and data mining.

    basic usage:

     include XUtils
     doc = fetch("http://some.url.here") # http:// is impotrtant!
    
     base = find_xpath(doc, '/html/body', ["what to find1", "What to find 2"]) # when you provide array, then it'll find element conaining ALL words
    
     precise = find_xpath(doc, base, "what to find1")
     precise.relative_to base
    

    Good luck

  • There is no the TBODY tag in your HTML code. Firebug generates it automatically.

0 comments:

Post a Comment