Skip to content

lxml Scraper

Option to use lxml as parser backend instead of Playwright has been added in Release 0.6.0. lxml is an optional dependency and can only be installed via pip using the command below.

pip install pydude[lxml]

Required changes to your script in order to use lxml

Instead of ElementHandle objects when using Playwright as parser backend, Element, "smart" strings, etc. objects are passed to the decorated functions.

from dude import select


@select(xpath='.//a[contains(@class, "url")]/@href') # (1)
def result_url(href):
    return {"url": href} # (2)


@select(css="a.url")  # (3)
def result_url_css(element):
    return {"url_css": element.attrib["href"]} # (4)


@select(css='.title')
def result_title(element):
    return {"title": element.text} # (5)
  1. Attributes can be accessed using XPath @href.
  2. When using XPath @href (or text), "smart" strings are returned.
  3. The lxml backend supports CSS selectors via cssselect.
  4. Attributes can also be accessed from lxml elements using element.attrib["href"].
  5. Text content can be accessed from lxml elements using element.text.

Running Dude with lxml

You can run lxml parser backend using the --lxml command-line argument or parser="lxml" parameter to run().

dude scrape --url "<url>" --lxml --output data.json path/to/script.py
if __name__ == "__main__":
    import dude

    dude.run(urls=["https://dude.ron.sh/"], parser="lxml", output="data.json")

Limitations

  1. Setup handlers are not supported.
  2. Navigate handlers are not supported.

Examples

Examples are can be found at examples/lxml_sync.py and examples/lxml_async.py.