socio4health.utils.standard_spider.StandardSpider#

class socio4health.utils.standard_spider.StandardSpider(*args: Any, **kwargs: Any)[source]#

A standard spider for scraping links from a given URL.

url#

The URL to start scraping from. If not provided, a warning is logged.

depth#

The maximum depth to follow links. Default is 0 (no depth limit).

ext#

A list of file extensions to filter links. Default includes common document formats.

key_words#

A list of keywords or regex conditions to filter links by filename. By default it is an empty list.

start_urls#

A list containing the starting URL for the spider.

links#

A dictionary to store found links with filenames as keys and URLs as values.

name#

The name of the spider, used for identification in logs and output.

__init__(url=None, depth=0, ext=None, key_words=None, *args, **kwargs)[source]#: Initialize the spider with parameters.

Methods

`__delattr__`(name, /)	Implement delattr(self, name).
`__dir__`()	Default dir() implementation.
`__eq__`(value, /)	Return self==value.
`__format__`(format_spec, /)	Default object formatter.
`__ge__`(value, /)	Return self>=value.
`__getattribute__`(name, /)	Return getattr(self, name).
`__getstate__`()	Helper for pickle.
`__gt__`(value, /)	Return self>value.
`__hash__`()	Return hash(self).
`__init__`([url, depth, ext, key_words])	Initialize the spider with parameters.
`__init_subclass__`	This method is called when a class is subclassed.
`__le__`(value, /)	Return self<=value.
`__lt__`(value, /)	Return self<value.
`__ne__`(value, /)	Return self!=value.
`__new__`(cls, args, *kwargs)
`__reduce__`()	Helper for pickle.
`__reduce_ex__`(protocol, /)	Helper for pickle.
`__repr__`()	Return repr(self).
`__setattr__`(name, value, /)	Implement setattr(self, name, value).
`__sizeof__`()	Size of object in memory, in bytes.
`__str__`()	Return str(self).
`__subclasshook__`	Abstract classes can override this to customize issubclass().
`_parse`(response, **kwargs)
`_set_crawler`(crawler)
`close`(spider, reason)
`closed`(reason)	Handle actions to perform when the spider is closed.
`from_crawler`(crawler, args, *kwargs)
`handles_request`(request)
`log`(message[, level])	Log the given message at the given log level
`parse`(response[, current_depth])	Parse the response to extract links based on criteria.
`start`()	Yield the initial `Request` objects to send.
`start_requests`()
`update_settings`(settings)

Attributes

`__annotations__`
`__dict__`
`__doc__`
`__module__`
`__slots__`
`__weakref__`	list of weak references to the object
`custom_settings`
`name`
`start_urls`	Start URLs.