Grouping Results¶
When scraping a page containing a list of information, for example, containing URLs, titles and descriptions, it is important to know how data can be grouped together.
By default, all scraped results are grouped by :root
which is the root document.
To specify grouping, pass group=<selector-for-grouping>
to @select()
decorator.
In the example below, the results are grouped by an element with class custom-group
. The matched selectors should be children of this element.
Click on the annotations (+ sign) for more details.
from dude import select
@select(css=".title", group=".custom-group") # (1)
def result_title(element):
return {"title": element.text_content()}
- Group the results by the CSS selector
.custom-group
.
You can also specify groups by using the @group()
decorator and passing the argument selector="<selector-for-grouping>"
.
from dude import group, select
@group(css=".custom-group") # (1)
@select(css=".title")
def result_title(element):
return {"title": element.text_content()}
- Group the results by the CSS selector
.custom-group
.
Supported group selector types¶
The @select()
decorator does not only accept group
but also group_css
, group_xpath
, group_text
and group_regex
.
Please take note that group_css
, group_xpath
, group_text
and group_regex
are specific and group
can contain any of these types.
from dude import select
@select(css=".title", group_css="<css-selector>") # (1)
@select(css=".title", group_xpath="<xpath-selector>") # (2)
@select(css=".title", group_text="<text-selector>") # (3)
@select(css=".title", group_regex="<regex-selector>") # (4)
def handler(element):
return {"<key>": "<value-extracted-from-element>"}
- Group CSS Selector
- Group XPath Selector
- Group Text Selector
- Group Regular Expression Selector
It is possible to use 2 or more of these types at the same time but only one will be used taking the precedence group
-> css
-> xpath
-> text
-> regex
.
Like the @select()
decorator, the @group()
decorator also accepts selector
, css
, xpath
, text
and regex
.
Similarly, css
, xpath
, text
and regex
are specific and selector
can contain any of these types.
from dude import select
@group(css="<css-selector>") # (1)
@select(selector="<selector>")
def handler(element):
return {"<key>": "<value-extracted-from-element>"}
- CSS Selector
It is possible to use 2 or more of these types at the same time but only one will be used taking the precedence selector
-> css
-> xpath
-> text
-> regex
.
Why we need to group the results¶
The group
parameter or the @group()
decorator has the advantage of making sure that items are in their correct group.
Take for example the HTML source below, notice that in the second div
, there is no description.
<div class="custom-group">
<p class="title">Title 1</p>
<p class="description">Description 1</p>
</div>
<div class="custom-group">
<p class="title">Title 2</p>
</div>
<div class="custom-group">
<p class="title">Title 3</p>
<p class="description">Description 3</p>
</div>
When the group is not specified, the default grouping will be used which will result in "Description 3" being grouped with "Title 2".
[
{
"_page_number": 1,
// ...
"description": "Description 1",
"title": "Title 1"
},
{
"_page_number": 1,
// ...
"description": "Description 3",
"title": "Title 2"
},
{
"_page_number": 1,
// ...
"title": "Title 3"
}
]
By specifying the group in @select(..., group=".custom-group")
, we will be able to get a better result.
[
{
"_page_number": 1,
// ...
"description": "Description 1",
"title": "Title 1"
},
{
"_page_number": 1,
// ...
"title": "Title 2"
},
{
"_page_number": 1,
// ...
"description": "Description 3",
"title": "Title 3"
}
]
Groups simplify how you write your script¶
Info
The examples below are both acceptable way to write a scraper. You have the option to choose how you write your script.
A common way developers write scraper can be illustrated using this example below (see examples/single_handler.py for the complete script).
While this works, it can be hard to maintain.
from dude import select
@select(css=".custom-group")
def result_handler(element):
"""
Perform all the heavy-lifting in a single handler.
"""
data = {}
url = element.query_selector("a.url")
if url:
data["url"] = url.get_attribute("href")
title = element.query_selector(".title")
if title:
data["title"] = title.text_content()
description = element.query_selector(".description")
if description:
data["description"] = description.text_content()
return data
It will only require us to write 3 simple functions but is much easier to read as we don't have to deal with querying the child elements, ourselves.
from dude import group, select
@select(css="a.url", group=".custom-group")
def result_url(element):
return {"url": element.get_attribute("href")}
@select(css=".title", group=".custom-group")
def result_title(element):
return {"title": element.text_content()}
@select(css=".description", group=".custom-group")
def result_description(element):
return {"description": element.text_content()}
When are @group()
decorator and group
parameter used by Dude¶
- If the
group
parameter is present, it will be used for grouping. - If the
group
parameter is not present, the selector in the@group()
decorator will be used for grouping. - If both
group
parameter and@group()
decorator are not present, the:root
element will be used for grouping.
Info
Use @group()
decorator when using multiple @select()
decorators in one function in order to reduce repetition.
Examples¶
- Grouping by
@group()
decorator: examples/group_decorator.py. - Grouping by passing
group
parameter to@select()
decorator: examples/group_in_select.py.