Kev++

受够了蹩脚软件/网站(Tired of crappy software/website)!

Subscribe to RSS feed

Posts tagged with "lxml"

使用lxml.html的cssselect解析HTML

,

Code
from lxml.html import parse
from pprint import pprint
dom = parse('http://my.opera.com/gotovoid/blog/').getroot()
tags = {e.find('a').text:int(e.get('class')[-1]) for e in dom.cssselect('div#tagcloud li')}
pprint(sorted(tags.items(), key=lambda x:x[1], reverse=True))

Output
[('encoding', 5),
 ('python', 5),
 ('vim', 5),
 ('windows', 5),
 ('i18n', 5),
 ('bash', 5),
 ('crawler', 5),
 ('jquery', 4),
 ('sqlite', 4),
 ('google', 4),
 ('webpy', 4),
 ('sox', 4),
 ('json', 4),
 ('firebug', 4),
 ('linux', 4),
 ('apache', 4),
 ('diff', 4),
 ('api', 1),
 ('awk', 1),
 ('array', 1)]

解析平行结构HTML的xpath

, ,

解析平行结构的HTML时,可以利用following-sibling及preceding-sibling定位所需的多个Tag

Code
from lxml.html import parse
from json import dumps
dom = parse('http://oreilly.com/store/series/sc.csp')
books = []
for e in dom.xpath('//div[@id="booklist"]/a[img]'):
    url = e.attrib['href']
    cover = e.find('img').attrib['src']
    title = e.xpath('following-sibling::span[1]//b')[0].text.strip()
    authors = [i.text for i in e.xpath('following-sibling::a[preceding-sibling::span[1]/a/@href="{}"]'.format(url)) ]
    date = e.xpath('following-sibling::br[2]')[0].tail.strip()
    price = e.xpath('following-sibling::span[2]')[0].text.strip()
    books.append({
        'url':url,
        'cover':cover,
        'title':title,
        'authors':authors,
        'date':date,
        'price':price
    })

print(dumps(books, indent=4))
Output
[
    {
        "title": "XML Publishing with Adobe InDesign",
        "url": "http://oreilly.com/catalog/9781449398576/",
        "price": "$9.99",
        "cover": "http://covers.oreilly.com/images/9781449398576/bkt.gif",
        "authors": [
            "Dorothy Hoskins"
        ],
        "date": "September 2010"
    },
    {
        "title": "Log4J",
        "url": "http://oreilly.com/catalog/9780596559656/",
        "price": "$4.99",
        "cover": "http://covers.oreilly.com/images/9780596559656/bkt.gif",
        "authors": [
            "J. Steven Perry",
            "Kev++"
        ],
        "date": "October 2009"
    }
]
HTML
<div id="booklist">

    <a href="http://oreilly.com/catalog/9781449398576/">
        <img alt="XML Publishing with Adobe InDesign" class="aleft" style="padding:0 10px 0 0;margin:0;" src="http://covers.oreilly.com/images/9781449398576/bkt.gif">
    </a>
    <span style="font-size: 14px; font-weight: normal">
        <a href="http://oreilly.com/catalog/9781449398576/">
            <b>XML Publishing with Adobe InDesign</b>
        </a>
    </span>
    <br>


    By <a href="http://www.oreillynet.com/pub/au/3096">Dorothy Hoskins</a>

    <br> September 2010&nbsp; <br>
    
    Ebook: <span class="special"> $9.99</span><br>

    <p style="line-height: 14px;">
    From Adobe InDesign CS2 to InDesign CS5, the ability to work with XML content has been built into every version of InDesign. Some of the useful applications are importing database…
        <a href="http://oreilly.com/catalog/9781449398576/">
            Read more.
        </a>
    </p>
    <br clear="all"><hr><br>



    <a href="http://oreilly.com/catalog/9780596559656/">
        <img alt="Log4J" class="aleft" style="padding:0 10px 0 0;margin:0;" src="http://covers.oreilly.com/images/9780596559656/bkt.gif">
    </a>

    <span style="font-size: 14px; font-weight: normal">
        <a href="http://oreilly.com/catalog/9780596559656/">
            <b>Log4J</b>
        </a>
    </span>
    <br>


    By <a href="http://www.oreillynet.com/pub/au/905">J. Steven Perry</a>
    , <a href="http://www.oreillynet.com/pub/au/8864">Kev++</a>

    <br>October 2009&nbsp;<br>
    
    Ebook: <span class="special"> $4.99</span>
    <br>
    <p style="line-height: 14px;">
        Log4j has been around for a while now, and it seems like so many applications use it. I've used it in my applications for years now, and I'll bet you…
        <a href="http://oreilly.com/catalog/9780596559656/">Read more.</a>
    </p>
    <br clear="all"><hr><br>

</div>