socio4health.utils.standard_spider.StandardSpider#

class socio4health.utils.standard_spider.StandardSpider(*args: Any, **kwargs: Any)[source]#

A standard spider for scraping links from a given URL.

url#

The URL to start scraping from. If not provided, a warning is logged.

Type:

str, optional

depth#

The maximum depth to follow links. Default is 0 (no depth limit).

Type:

int, optional

ext#

A list of file extensions to filter links. Default includes common document formats.

Type:

list, optional

key_words#

A list of keywords to filter links by filename. Default is an empty list.

Type:

list, optional

start_urls#

A list containing the starting URL for the spider.

Type:

list

A dictionary to store found links with filenames as keys and URLs as values.

Type:

dict

name#

The name of the spider, used for identification in logs and output.

Type:

str

__init__(url=None, depth=0, ext=None, key_words=None, *args, **kwargs)[source]#

Initialize the spider with parameters.

Methods

__delattr__(name, /)

Implement delattr(self, name).

__dir__()

Default dir() implementation.

__eq__(value, /)

Return self==value.

__format__(format_spec, /)

Default object formatter.

__ge__(value, /)

Return self>=value.

__getattribute__(name, /)

Return getattr(self, name).

__getstate__()

Helper for pickle.

__gt__(value, /)

Return self>value.

__hash__()

Return hash(self).

__init__([url, depth, ext, key_words])

Initialize the spider with parameters.

__init_subclass__

This method is called when a class is subclassed.

__le__(value, /)

Return self<=value.

__lt__(value, /)

Return self<value.

__ne__(value, /)

Return self!=value.

__new__(cls, *args, **kwargs)

__reduce__()

Helper for pickle.

__reduce_ex__(protocol, /)

Helper for pickle.

__repr__()

Return repr(self).

__setattr__(name, value, /)

Implement setattr(self, name, value).

__sizeof__()

Size of object in memory, in bytes.

__str__()

Return str(self).

__subclasshook__

Abstract classes can override this to customize issubclass().

_parse(response, **kwargs)

_set_crawler(crawler)

close(spider, reason)

closed(reason)

Handle actions to perform when the spider is closed.

from_crawler(crawler, *args, **kwargs)

handles_request(request)

log(message[, level])

Log the given message at the given log level

parse(response[, current_depth])

Parse the response to extract links based on criteria.

start()

Yield the initial Request objects to send.

start_requests()

update_settings(settings)

Attributes

__annotations__

__dict__

__doc__

__module__

__slots__

__weakref__

list of weak references to the object

custom_settings

name

start_urls

Start URLs.