lxml Scraper¶

Option to use lxml as parser backend instead of Playwright has been added in Release 0.6.0. lxml is an optional dependency and can only be installed via pip using the command below.

Terminal

pip install pydude[lxml]

Required changes to your script in order to use lxml¶

Instead of ElementHandle objects when using Playwright as parser backend, Element, "smart" strings, etc. objects are passed to the decorated functions.

Python

from dude import select


@select(xpath='.//a[contains(@class, "url")]/@href') # (1)
def result_url(href):
    return {"url": href} # (2)


@select(css="a.url")  # (3)
def result_url_css(element):
    return {"url_css": element.attrib["href"]} # (4)


@select(css='.title')
def result_title(element):
    return {"title": element.text} # (5)

Attributes can be accessed using XPath @href.
When using XPath @href (or text), "smart" strings are returned.
The lxml backend supports CSS selectors via cssselect.
Attributes can also be accessed from lxml elements using element.attrib["href"].
Text content can be accessed from lxml elements using element.text.

Running Dude with lxml¶

You can run lxml parser backend using the --lxml command-line argument or parser="lxml" parameter to run().

TerminalPython

dude scrape --url "<url>" --lxml --output data.json path/to/script.py

if __name__ == "__main__":
    import dude

    dude.run(urls=["https://dude.ron.sh/"], parser="lxml", output="data.json")

Limitations¶

Setup handlers are not supported.
Navigate handlers are not supported.

Examples¶

Examples are can be found at examples/lxml_sync.py and examples/lxml_async.py.