lxml Scraper¶
Option to use lxml as parser backend instead of Playwright has been added in Release 0.6.0.
lxml is an optional dependency and can only be installed via pip
using the command below.
pip install pydude[lxml]
Required changes to your script in order to use lxml¶
Instead of ElementHandle objects when using Playwright as parser backend, Element, "smart" strings, etc. objects are passed to the decorated functions.
from dude import select
@select(xpath='.//a[contains(@class, "url")]/@href') # (1)
def result_url(href):
return {"url": href} # (2)
@select(css="a.url") # (3)
def result_url_css(element):
return {"url_css": element.attrib["href"]} # (4)
@select(css='.title')
def result_title(element):
return {"title": element.text} # (5)
- Attributes can be accessed using XPath
@href
. - When using XPath
@href
(ortext
), "smart" strings are returned. - The lxml backend supports CSS selectors via
cssselect
. - Attributes can also be accessed from lxml elements using
element.attrib["href"]
. - Text content can be accessed from lxml elements using
element.text
.
Running Dude with lxml¶
You can run lxml parser backend using the --lxml
command-line argument or parser="lxml"
parameter to run()
.
dude scrape --url "<url>" --lxml --output data.json path/to/script.py
if __name__ == "__main__":
import dude
dude.run(urls=["https://dude.ron.sh/"], parser="lxml", output="data.json")
Limitations¶
- Setup handlers are not supported.
- Navigate handlers are not supported.
Examples¶
Examples are can be found at examples/lxml_sync.py and examples/lxml_async.py.