socio4health.utils.standard_spider.StandardSpider#
- class socio4health.utils.standard_spider.StandardSpider(*args: Any, **kwargs: Any)[source]#
A standard spider for scraping links from a given
URL
.- ext#
A list of file extensions to filter links. Default includes common document formats.
- Type:
list, optional
- key_words#
A list of keywords to filter links by filename. Default is an empty list.
- Type:
list, optional
- __init__(url=None, depth=0, ext=None, key_words=None, *args, **kwargs)[source]#
Initialize the spider with parameters.
Methods
__delattr__
(name, /)Implement delattr(self, name).
__dir__
()Default dir() implementation.
__eq__
(value, /)Return self==value.
__format__
(format_spec, /)Default object formatter.
__ge__
(value, /)Return self>=value.
__getattribute__
(name, /)Return getattr(self, name).
__getstate__
()Helper for pickle.
__gt__
(value, /)Return self>value.
__hash__
()Return hash(self).
__init__
([url, depth, ext, key_words])Initialize the spider with parameters.
__init_subclass__
This method is called when a class is subclassed.
__le__
(value, /)Return self<=value.
__lt__
(value, /)Return self<value.
__ne__
(value, /)Return self!=value.
__new__
(cls, *args, **kwargs)__reduce__
()Helper for pickle.
__reduce_ex__
(protocol, /)Helper for pickle.
__repr__
()Return repr(self).
__setattr__
(name, value, /)Implement setattr(self, name, value).
__sizeof__
()Size of object in memory, in bytes.
__str__
()Return str(self).
__subclasshook__
Abstract classes can override this to customize issubclass().
_parse
(response, **kwargs)_set_crawler
(crawler)close
(spider, reason)closed
(reason)Handle actions to perform when the spider is closed.
from_crawler
(crawler, *args, **kwargs)handles_request
(request)log
(message[, level])Log the given message at the given log level
parse
(response[, current_depth])Parse the response to extract links based on criteria.
start
()Yield the initial
Request
objects to send.start_requests
()update_settings
(settings)Attributes
__annotations__
__dict__
__doc__
__module__
__slots__
__weakref__
list of weak references to the object
custom_settings
Start URLs.