Command-Line Interface (CLI)¶

CLI
usage: dude scrape [-h] [--url URL] [--playwright | --bs4 | --parsel | --lxml | --selenium] [--headed] [--browser {chromium,firefox,webkit}] [--pages PAGES] [--output OUTPUT] [--format FORMAT]
                   [--proxy-server PROXY_SERVER] [--proxy-user PROXY_USER] [--proxy-pass PROXY_PASS] [--follow-urls] [--save-per-page] [--ignore-robots-txt]
                   PATH [PATH ...]

Run the dude scraper.

options:
  -h, --help            show this help message and exit

required arguments:
  PATH                  Path to python file/s containing the handler functions.
  --url URL             Website URL to scrape. Accepts one or more url (e.g. "dude scrape --url <url1> --url <url2> ...")

optional arguments:
  --playwright          Use Playwright.
  --bs4                 Use BeautifulSoup4.
  --parsel              Use Parsel.
  --lxml                Use lxml.
  --selenium            Use Selenium.
  --headed              Run headed browser.
  --browser {chromium,firefox,webkit}
                        Browser type to use.
  --pages PAGES         Maximum number of pages to crawl before exiting (default=1). This is only valid when a navigate handler is defined.
  --output OUTPUT       Output file. If not provided, prints into the terminal.
  --format FORMAT       Output file format. If not provided, uses the extension of the output file or defaults to "json". Supports "json", "yaml/yml", and "csv" but can be extended using the @save() decorator.
  --proxy-server PROXY_SERVER
                        Proxy server.
  --proxy-user PROXY_USER
                        Proxy username.
  --proxy-pass PROXY_PASS
                        Proxy password.
  --follow-urls         Automatically follow URLs.
  --save-per-page       Flag to save data on every page extraction or not. If not, saves all the data at the end.If --follow-urls is set to true, this variable will be automatically set to true.
  --ignore-robots-txt   Flag to ignore robots.txt.