socio4health.Extractor#
- class socio4health.Extractor(input_path: str = None, depth: int = None, down_ext: list = None, output_path: str = None, key_words: list = None, encoding: str = 'latin1', is_fwf: bool = False, colnames: list = None, colspecs: list = None, sep: str = None, ddtype: str | Dict = 'object', dtype: str = None, engine: str = None, sheet_name: str = None, geodriver: str = None)[source]#
A class for extracting data from various sources, including online scraping and local file processing. This class supports downloading files, extracting data from compressed formats, and reading fixed-width or
CSVfiles. It handles both online and local modes of operation, allowing for flexible data extraction workflows.- down_ext#
A list of file extensions to look for when downloading files. Available options include compressed formats such as
.zip,.7z,.tar,.gz, and.tgz, as well as other file types like.csv,.txt, etc. This list can be customized based on the expected file types.- Type:
- output_path#
The directory where downloaded files will be saved. Defaults to the user’s data directory.
- Type:
- is_fwf#
Whether the files to be processed are fixed-width files (FWF). Defaults to
False.- Type:
- colnames#
Column names to use when reading fixed-width files. Required if is_fwf is
True.- Type:
- colspecs#
Column specifications for fixed-width files, defining the widths of each column. Required if
is_fwfisTrue.- Type:
- ddtype#
The data type to use when reading files. Can be a single type or a dictionary mapping column names to types. Defaults to
object.- Type:
Union[str, Dict]
- dtype#
Data types to use when reading files with
pandas. Can be a string (e.g.,'object') or a dictionary mapping column names to data types.- Type:
Union[str, Dict]
- engine#
The engine to use for reading Excel files (e.g.,
'openpyxl'or'xlrd'). Leave asNoneto use the default engine based on file extension.- Type:
- sheet_name#
The name or index of the Excel sheet to read. Can also be a list to read multiple sheets or
Noneto read all sheets. Defaults to the first sheet (0).
- geodriver#
The driver to use for reading geospatial files with
geopandas.read_file()(e.g.,'ESRI Shapefile','KML', etc.). Optional.- Type:
Important
In case
is_fwfisTrueand fixed-width files are given, bothcolnamesandcolspecsmust be provided.See also
Extractor.s4h_extractExtracts data from the specified input path, either by scraping online or processing local files.
Extractor.s4h_delete_download_folderSafely deletes the download folder and all its contents, with safety checks to prevent accidental deletion of important directories.
- __init__(input_path: str = None, depth: int = None, down_ext: list = None, output_path: str = None, key_words: list = None, encoding: str = 'latin1', is_fwf: bool = False, colnames: list = None, colspecs: list = None, sep: str = None, ddtype: str | Dict = 'object', dtype: str = None, engine: str = None, sheet_name: str = None, geodriver: str = None)[source]#
Methods
__delattr__(name, /)Implement delattr(self, name).
__dir__()Default dir() implementation.
__eq__(value, /)Return self==value.
__format__(format_spec, /)Default object formatter.
__ge__(value, /)Return self>=value.
__getattribute__(name, /)Return getattr(self, name).
__getstate__()Helper for pickle.
__gt__(value, /)Return self>value.
__hash__()Return hash(self).
__init__([input_path, depth, down_ext, ...])__init_subclass__This method is called when a class is subclassed.
__le__(value, /)Return self<=value.
__lt__(value, /)Return self<value.
__ne__(value, /)Return self!=value.
__new__(*args, **kwargs)__reduce__()Helper for pickle.
__reduce_ex__(protocol, /)Helper for pickle.
__repr__()Return repr(self).
__setattr__(name, value, /)Implement setattr(self, name, value).
__sizeof__()Size of object in memory, in bytes.
__str__()Return str(self).
__subclasshook__Abstract classes can override this to customize issubclass().
_extract_local_mode()Local mode extraction that now uses the shared processing logic
_extract_online_mode()Optimized online data extraction with better error handling and progress tracking
_process_downloaded_files(downloaded_files)Process downloaded files using local mode logic
_process_files_locally(files)Shared local processing logic used by both modes
_read_csv(filepath)_read_excel(filepath)_read_file(filepath)_read_geospatial(filepath)_read_json(filepath)_read_parquet(filepath)_read_txt(filepath)s4h_delete_download_folder([folder_path])Safely delete the download folder and all its contents.
Extracts data from the specified input path, either by scraping online sources or processing local files.
Attributes
__annotations____dict____doc____module____weakref__list of weak references to the object