socio4health.utils.standard_spider.StandardSpider#

class socio4health.utils.standard_spider.StandardSpider(url=None, depth=0, ext=None, key_words=None, *args, **kwargs)[source]#

A standard spider for scraping links from a given URL.

url#

The URL to start scraping from. If not provided, a warning is logged.

Type:

str, optional

depth#

The maximum depth to follow links. Default is 0 (no depth limit).

Type:

int, optional

ext#

A list of file extensions to filter links. Default includes common document formats.

Type:

list, optional

key_words#

A list of keywords or regex conditions to filter links by filename. By default it is an empty list.

Type:

list, optional

start_urls#

A list containing the starting URL for the spider.

Type:

list

A dictionary to store found links with filenames as keys and URLs as values.

Type:

dict

name#

The name of the spider, used for identification in logs and output.

Type:

str

__init__(url=None, depth=0, ext=None, key_words=None, *args, **kwargs)[source]#

Initialize the spider with parameters.

Methods

__delattr__(name, /)

Implement delattr(self, name).

__dir__()

Default dir() implementation.

__eq__(value, /)

Return self==value.

__format__(format_spec, /)

Default object formatter.

__ge__(value, /)

Return self>=value.

__getattribute__(name, /)

Return getattr(self, name).

__getstate__()

Helper for pickle.

__gt__(value, /)

Return self>value.

__hash__()

Return hash(self).

__init__([url, depth, ext, key_words])

Initialize the spider with parameters.

__init_subclass__

This method is called when a class is subclassed.

__le__(value, /)

Return self<=value.

__lt__(value, /)

Return self<value.

__ne__(value, /)

Return self!=value.

__new__(*args, **kwargs)

__reduce__()

Helper for pickle.

__reduce_ex__(protocol, /)

Helper for pickle.

__repr__()

Return repr(self).

__setattr__(name, value, /)

Implement setattr(self, name, value).

__sizeof__()

Size of object in memory, in bytes.

__str__()

Return str(self).

__subclasshook__

Abstract classes can override this to customize issubclass().

closed(reason)

Handle actions to perform when the spider is closed.

parse(response[, current_depth])

Parse the response to extract links based on criteria.

parse_item(response)

Extract a simple item from a response.

Attributes

__annotations__

__dict__

__doc__

__module__

__weakref__

list of weak references to the object

name