{"id":1666,"date":"2007-07-03T05:05:43","date_gmt":"2007-07-02T21:05:43","guid":{"rendered":"http:\/\/ihower.idv.tw\/blog\/archives\/1666"},"modified":"2011-04-09T02:15:34","modified_gmt":"2011-04-08T18:15:34","slug":"hpricot-parsing-html","status":"publish","type":"post","link":"https:\/\/ihower.tw\/blog\/1666-hpricot-parsing-html","title":{"rendered":"\u7528 Hpricot \u4f86 parsing HTML"},"content":{"rendered":"<p>Update(2009): \u65b0\u7684 Ruby XML parsing \u738b\u9053\u662f <a href=\"http:\/\/nokogiri.org\/\">nokogiri<\/a> \u4e86 (via <a href=\"http:\/\/www.engineyard.com\/blog\/2009\/xml-parsing-in-ruby\/\">The State of XML Parsing in Ruby (Circa 2009)<\/a>)\u3002<\/p>\n<p><img decoding=\"async\" src=\"http:\/\/code.whytheluckystiff.net\/hpricot\/chrome\/site\/images\/hpricot-small.png\" \/><\/p>\n<p><a href=\"http:\/\/code.whytheluckystiff.net\/hpricot\/\">Hpricot<\/a> \u662f\u500b\u5feb\u53c8\u597d\u7528\u7684 Ruby HTML parser\uff0c\u9ede\u5b50\u4f86\u6e90\u662f <a href=\"http:\/\/jquery.com\/\" class=\"ext-link\"><span class=\"icon\"><font color=\"#0000bb\">JQuery<\/font><\/span><\/a>\u3002\u5b83\u7684\u5169\u5927\u512a\u9ede\u662f 1.\u901f\u5ea6\u5feb\uff0c\u56e0\u70ba\u6838\u5fc3\u7528C\u6539\u5beb\u4e86 2.\u597d\u7528\u7684\u4ecb\u9762\uff0c\u4f60\u53ef\u4ee5\u7528CSS selectors,element IDs,tag types \u7b49\u3002<\/p>\n<p>\u5176\u5b83\u7684\u512a\u9ede\u9084\u6709\u53ef\u4ee5\u5403 XML\uff0c\u53ef\u4ee5\u5403 invaild \u7684 HTML\uff0c\u751a\u81f3\u53ef\u4ee5\u66f4\u6539 document \u7d50\u69cb\u3002<\/p>\n<p>\u9996\u5148\u662f\u5b89\u88dd<\/p>\n<p><code>gem install hpricot<\/code><\/p>\n<p>\u57fa\u672c\u7528\u6cd5<\/p>\n<p><code>require 'rubygems'<br \/>\nrequire 'hpricot'<br \/>\ndocument = &lt;&lt;END<br \/>\n&lt;ul&gt;<br \/>\n&lt;li&gt;first item&lt;\/li&gt;<br \/>\n&lt;li&gt;second item&lt;\/li&gt;<br \/>\n&lt;\/ul&gt;<br \/>\nEND<br \/>\ndoc = Hpricot.parse(document)<br \/>\n(doc\/'li').each do |item|<br \/>\nputs item.inner_html<br \/>\nend<br \/>\n<\/code><\/p>\n<p>\u51fa\u4f86\u7684\u7d50\u679c\u5c31\u662f first item \u8ddf second item \u56c9\u3002\u5176\u4e2d (doc\/&#8217;li&#8217;) \u610f\u540c doc.search(&#8216;li&#8217;)\uff0c\u5c31\u662f\u641c\u51fa li \u9019\u500b tag\u3002<\/p>\n<p>\u9032\u968e\u7528\u6cd5<\/p>\n<p>\u53ef\u8dd1 nested \u8ff4\u5708\uff0c\u800c\u9664\u4e86 inner_html\uff0c\u4e5f\u53ef\u4ee5\u6293\u5c6c\u6027\u503c\u5982 attributes[&#8216;href&#8217;] \u3002\u6211\u62ff\u4ea4\u5927\u67d0\u7ad9\u4f86\u7df4\u7fd2\u4e00\u4e0b:<\/p>\n<p><code>require 'rubygems'<br \/>\nrequire 'hpricot'<br \/>\nrequire 'open-uri'<br \/>\nurl = \"<a href=\"http:\/\/www.pac.nctu.edu.tw\/news\/news_msg.php\">http:\/\/www.pac.nctu.edu.tw\/news\/news_msg.php<\/a>\"<br \/>\ndoc = Hpricot(open(url))<br \/>\ndoc.search('table tr td div.tbCopy font').each do |item|<br \/>\n(item\/'a').each do |nav|<br \/>\nputs nav.attributes['href']<br \/>\nputs nav.inner_html<br \/>\nend<br \/>\nend<\/code><\/p>\n<p>\u5fc3\u5f97\u662f\u82e5 HTML \u6709\u597d\u7684\u7d50\u69cb\uff0c\u5247 hpricot \u53ef\u4ee5\u5feb\u901f\u8d70\u8a2a\u5230\u4f60\u60f3\u8981\u7684\u4f4d\u7f6e\u3002\u4f8b\u5982\u6709\u8a2d\u5b9a element IDs \u6216 class \u7684\u8a71\uff0c\u5c31\u53ef\u4ee5\u7528 doc.search(&#8216;table#myid&#8217;) \u6216 doc.search(&#8216;span.myclass&#8217;) \u5feb\u901f\u5230\u4f4d\u3002<\/p>\n<p>\u4e0d\u904e\u5c31\u7b97\u662f\u5982\u4e0a\u8ff0\u7684\u8001\u5f0fHTML\u7db2\u7ad9\uff0c\u8981\u627e CSS selector \u53ea\u8981\u642d\u914d\u670d\u7528 firefox extension <a href=\"https:\/\/addons.mozilla.org\/en-US\/firefox\/addon\/60\">Web Developer<\/a> \u4e5f\u4e0d\u96e3\u89e3\u6c7a\uff0c\u9ede\u9078 CSS &gt; View Style Information \u5c31\u53ef\u4ee5\u770b\u5230\u8def\u5f91\u4e86\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Update(2009): \u65b0\u7684 Ruby XML parsing \u738b\u9053\u662f nokogiri \u4e86 (via T &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/ihower.tw\/blog\/1666-hpricot-parsing-html\" class=\"more-link\">\u95b1\u8b80\u5168\u6587<span class=\"screen-reader-text\">\u3008\u7528 Hpricot \u4f86 parsing HTML\u3009<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[5,31],"tags":[],"class_list":["post-1666","post","type-post","status-publish","format-standard","hentry","category-programming","category-ruby","entry"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p1q6tG-qS","jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/posts\/1666","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/comments?post=1666"}],"version-history":[{"count":3,"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/posts\/1666\/revisions"}],"predecessor-version":[{"id":5584,"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/posts\/1666\/revisions\/5584"}],"wp:attachment":[{"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/media?parent=1666"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/categories?post=1666"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ihower.tw\/blog\/wp-json\/wp\/v2\/tags?post=1666"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}