socio4health.Extractor#
- class socio4health.Extractor(input_path: str = None, depth: int = None, down_ext: list = None, output_path: str = None, key_words: list = None, encoding: str = 'latin1', is_fwf: bool = False, colnames: list = None, colspecs: list = None, sep: str = None, ddtype: str | Dict = 'object', dtype: str = None, engine: str = None, sheet_name: str = None, geodriver: str = None)[source]#
A class for extracting data from various sources, including online scraping and local file processing. This class supports downloading files, extracting data from compressed formats, and reading fixed-width or
CSV
files. It handles both online and local modes of operation, allowing for flexible data extraction workflows.- down_ext#
A list of file extensions to look for when downloading files. Available options include compressed formats such as
.zip
,.7z
,.tar
,.gz
, and.tgz
, as well as other file types like.csv
,.txt
, etc. This list can be customized based on the expected file types.- Type:
- output_path#
The directory where downloaded files will be saved. Defaults to the user’s data directory.
- Type:
- is_fwf#
Whether the files to be processed are fixed-width files (FWF). Defaults to
False
.- Type:
- colnames#
Column names to use when reading fixed-width files. Required if is_fwf is
True
.- Type:
- colspecs#
Column specifications for fixed-width files, defining the widths of each column. Required if
is_fwf
isTrue
.- Type:
- ddtype#
The data type to use when reading files. Can be a single type or a dictionary mapping column names to types. Defaults to
object
.- Type:
Union[str, Dict]
- dtype#
Data types to use when reading files with
pandas
. Can be a string (e.g.,'object'
) or a dictionary mapping column names to data types.- Type:
Union[str, Dict]
- engine#
The engine to use for reading Excel files (e.g.,
'openpyxl'
or'xlrd'
). Leave asNone
to use the default engine based on file extension.- Type:
- sheet_name#
The name or index of the Excel sheet to read. Can also be a list to read multiple sheets or
None
to read all sheets. Defaults to the first sheet (0
).
- geodriver#
The driver to use for reading geospatial files with
geopandas.read_file()
(e.g.,'ESRI Shapefile'
,'KML'
, etc.). Optional.- Type:
Important
In case
is_fwf
isTrue
and fixed-width files are given, bothcolnames
andcolspecs
must be provided.See also
Extractor.extract
Extracts data from the specified input path, either by scraping online or processing local files.
Extractor.delete_download_folder
Safely deletes the download folder and all its contents, with safety checks to prevent accidental deletion of important directories.
- __init__(input_path: str = None, depth: int = None, down_ext: list = None, output_path: str = None, key_words: list = None, encoding: str = 'latin1', is_fwf: bool = False, colnames: list = None, colspecs: list = None, sep: str = None, ddtype: str | Dict = 'object', dtype: str = None, engine: str = None, sheet_name: str = None, geodriver: str = None)[source]#
Methods
__delattr__
(name, /)Implement delattr(self, name).
__dir__
()Default dir() implementation.
__eq__
(value, /)Return self==value.
__format__
(format_spec, /)Default object formatter.
__ge__
(value, /)Return self>=value.
__getattribute__
(name, /)Return getattr(self, name).
__getstate__
()Helper for pickle.
__gt__
(value, /)Return self>value.
__hash__
()Return hash(self).
__init__
([input_path, depth, down_ext, ...])__init_subclass__
This method is called when a class is subclassed.
__le__
(value, /)Return self<=value.
__lt__
(value, /)Return self<value.
__ne__
(value, /)Return self!=value.
__new__
(*args, **kwargs)__reduce__
()Helper for pickle.
__reduce_ex__
(protocol, /)Helper for pickle.
__repr__
()Return repr(self).
__setattr__
(name, value, /)Implement setattr(self, name, value).
__sizeof__
()Size of object in memory, in bytes.
__str__
()Return str(self).
__subclasshook__
Abstract classes can override this to customize issubclass().
_extract_local_mode
()Local mode extraction that now uses the shared processing logic
_extract_online_mode
()Optimized online data extraction with better error handling and progress tracking
_process_downloaded_files
(downloaded_files)Process downloaded files using local mode logic
_process_files_locally
(files)Shared local processing logic used by both modes
_read_csv
(filepath)_read_excel
(filepath)_read_file
(filepath)_read_geospatial
(filepath)_read_json
(filepath)_read_parquet
(filepath)_read_txt
(filepath)delete_download_folder
([folder_path])Safely delete the download folder and all its contents.
extract
()Extracts data from the specified input path, either by scraping online sources or processing local files.
Attributes
__annotations__
__dict__
__doc__
__module__
__weakref__
list of weak references to the object