Skip to content

Custom Storage

Dude currently support json, yaml/yml and csv formats only. However, this can be extended to support a custom storage or override the existing formats using the @save() decorator. The save function should accept 2 parameters, data (list of dictionary of scraped data) and optional output (can be filename or None). Take note that the save function must return a boolean for success.

The example below prints the output to terminal using tabulate for illustration purposes only. You can use the @save() decorator in other ways like saving the scraped data to spreadsheets, database or send it to an API.

from dude import save
import tabulate


@save("table")
def save_table(data, output) -> bool:
    """
    Prints data to stdout using tabulate.
    """
    print(tabulate.tabulate(tabular_data=data, headers="keys", maxcolwidths=50))
    return True

The custom storage above can then be called using any of the options below.

dude scrape --url "<url>" path/to/script.py --format table
if __name__ == "__main__":
    import dude

    dude.run(urls=["<url>"], pages=2, format="table")

Saving on every page

It is possible to call the save functions after each page. This is useful when running in spider mode to prevent lost of data. To make use of this option, the flag is_per_page in the @save() should be set to True.

@save("table", is_per_page=True)
def save_table(data, output) -> bool:
    ...

To run the scraper in per-page save, pass --save-per-page argument.

dude scrape --url "<url>" path/to/script.py --format table --save-per-page
if __name__ == "__main__":
    import dude

    dude.run(urls=["<url>"], pages=2, format="table", save_per_page=True)

Note

The option --save-per-page is best used with events to make sure that connections or file handles are opened and closed properly. Check the examples below.

Examples

A more extensive example can be found at examples/custom_storage.py and examples/save_per_page.py.