Modules#
Extractor#
Extractor class for downloading and processing data files from various sources.
This class supports both online scraping and local file processing, handling compressed files, fixed-width files, and CSV formats.
It includes methods for downloading files, extracting data, and cleaning up after processing.
- class socio4health.extractor.Extractor(input_path: str = None, depth: int = None, down_ext: list = None, output_path: str = None, key_words: list = None, encoding: str = 'latin1', is_fwf: bool = False, colnames: list = None, colspecs: list = None, sep: str = None, ddtype: str | Dict = 'object', dtype: str = None, engine: str = None, sheet_name: str = None, geodriver: str = None)[source]#
Bases:
objectA class for extracting data from various sources, including online scraping and local file processing. This class supports downloading files, extracting data from compressed formats, and reading fixed-width or
CSVfiles. It handles both online and local modes of operation, allowing for flexible data extraction workflows.- down_ext#
A list of file extensions to look for when downloading files. Available options include compressed formats such as
.zip,.7z,.tar,.gz, and.tgz, as well as other file types like.csv,.txt, etc. This list can be customized based on the expected file types.- Type:
- output_path#
The directory where downloaded files will be saved. Defaults to the user’s data directory.
- Type:
- is_fwf#
Whether the files to be processed are fixed-width files (FWF). Defaults to
False.- Type:
- colnames#
Column names to use when reading fixed-width files. Required if is_fwf is
True.- Type:
- colspecs#
Column specifications for fixed-width files, defining the widths of each column. Required if
is_fwfisTrue.- Type:
- ddtype#
The data type to use when reading files. Can be a single type or a dictionary mapping column names to types. Defaults to
object.- Type:
Union[str, Dict]
- dtype#
Data types to use when reading files with
pandas. Can be a string (e.g.,'object') or a dictionary mapping column names to data types.- Type:
Union[str, Dict]
- engine#
The engine to use for reading Excel files (e.g.,
'openpyxl'or'xlrd'). Leave asNoneto use the default engine based on file extension.- Type:
- sheet_name#
The name or index of the Excel sheet to read. Can also be a list to read multiple sheets or
Noneto read all sheets. Defaults to the first sheet (0).
- geodriver#
The driver to use for reading geospatial files with
geopandas.read_file()(e.g.,'ESRI Shapefile','KML', etc.). Optional.- Type:
Important
In case
is_fwfisTrueand fixed-width files are given, bothcolnamesandcolspecsmust be provided.See also
Extractor.s4h_extractExtracts data from the specified input path, either by scraping online or processing local files.
Extractor.s4h_delete_download_folderSafely deletes the download folder and all its contents, with safety checks to prevent accidental deletion of important directories.
- s4h_delete_download_folder(folder_path: str | None = None) bool[source]#
Safely delete the download folder and all its contents.
- Parameters:
folder_path – Optional path to delete (defaults to the download_dir used in extraction)
- Returns:
Trueif deletion was successful,Falseotherwise- Return type:
- Raises:
ValueError – If no folder path is provided and no download_dir exists
OSError – If folder deletion fails
- s4h_extract()[source]#
Extracts data from the specified input path, either by scraping online sources or processing local files.
This method determines the operation mode based on the input path:
If the input path is a
URL, it performs online scraping to find downloadable files.If the input path is a local directory, it processes files directly from that directory.
- Returns:
List of Dask DataFrames containing the extracted data.
- Return type:
list of dask.dataframe.DataFrame
- Raises:
ValueError – If extraction fails due to an invalid input path, missing column specifications for fixed-width files, or if no valid data files are found after processing.
- socio4health.extractor.s4h_get_default_data_dir()[source]#
Returns the default data directory for storing downloaded files.
- Returns:
pathlib.Path object representing the default data directory.
- Return type:
Path
Note
This function ensures that the directory exists by creating it if necessary.
Harmonizer#
Harmonizer class for harmonizing and processing Dask DataFrames in health data integration.
- class socio4health.harmonizer.Harmonizer(min_common_columns: int = 1, similarity_threshold: float = 1, nan_threshold: float = 1.0, sample_frac: float | None = None, column_mapping: Type[Enum] | Dict[str, Dict[str, str]] | str | Path | None = None, value_mappings: Type[Enum] | Dict[str, Dict[str, Dict[str, str]]] | str | Path | None = None, theme_info: Dict[str, List[str]] | str | Path | None = None, default_country: str | None = None, strict_mapping: bool = False, dict_df: DataFrame | None = None, categories: List[str] | None = None, key_col: str | None = None, key_val: List[str | int | float] | None = None, extra_cols: List[str] | None = None, join_key: str = None, aux_key: str | None = None)[source]#
Bases:
objectInitialize the Harmonizer class for harmonizing and processing Dask DataFrames in health data integration.
- min_common_columns#
Minimum number of common columns required for vertical merge (default is 1).
- Type:
- similarity_threshold#
Similarity threshold to consider for vertical merge (default is 0.8).
- Type:
- sample_frac#
Fraction of rows to sample for NaN detection (default is
None).- Type:
float or
None
- column_mapping#
Column mapping configuration (default is
None).- Type:
Enum, dict, str orPath
- value_mappings#
Categorical value mapping configuration (default is
None).- Type:
Enum, dict, str orPath
- theme_info#
Theme/category information (default is
None).- Type:
dict, str or
Path
- strict_mapping#
Whether to enforce strict mapping of columns and values (default is
False).- Type:
- dict_df#
DataFrame with variable dictionary (default is
None).- Type:
- drop_nan_columns(ddf_or_ddfs: DataFrame | List[DataFrame]) DataFrame | List[DataFrame][source]#
Drop columns where the majority of values are
NaNusing instance parameters.- Parameters:
ddf_or_ddfs –
- Returns:
The DataFrame(s) with columns dropped where the proportion of
NaNvalues is greater than nan_threshold.- Return type:
dask.dataframe.DataFrame or list of dask.dataframe.DataFrame
- Raises:
ValueError – If
nan_thresholdis not between 0 and 1, or ifsample_fracis notNoneor a float between 0 and 1.
- s4h_compare_with_dict(ddfs: List[DataFrame]) DataFrame[source]#
Compare the columns available in the DataFrames with the variables in the dictionary and return a DataFrame with the columns that do not match in both directions.
- Parameters:
ddfs (list of dask.dataframe.DataFrame) – List of DataFrames to evaluate.
- Returns:
DataFrame with mismatched columns.
- Return type:
pd.DataFrame
- s4h_data_selector(ddfs: List[DataFrame]) List[DataFrame][source]#
Select rows from Dask DataFrames based on the instance parameters.
- Parameters:
ddfs – List of Dask DataFrames to filter.
- Returns:
List of filtered Dask DataFrames according to the
key_col,key_val,categories, andextra_cols.- Return type:
list of dask.dataframe.DataFrame
- Raises:
KeyError – If the
key_colis not found in a DataFrame.
- static s4h_get_available_columns(df_or_dfs: DataFrame | DataFrame | List[DataFrame | DataFrame]) List[str][source]#
Get a list of unique column names from a single DataFrame or a list of DataFrames. Supports both Dask and pandas DataFrames.
- Parameters:
df_or_dfs (Union[dask.dataframe.DataFrame, pandas.DataFrame,) – List[Union[dask.dataframe.DataFrame, pandas.DataFrame]]] A single DataFrame or a list of DataFrames to extract column names from. Can be Dask DataFrames, pandas DataFrames, or a mix of both.
- Returns:
Sorted list of unique column names across all provided DataFrames.
- Return type:
- Raises:
TypeError – If the input is not a DataFrame or a list of DataFrames.
- s4h_harmonize_dataframes(country_dfs: Dict[str, List[DataFrame]]) Dict[str, List[DataFrame]][source]#
Harmonize Dask DataFrames using the instance parameters.
- Parameters:
country_dfs –
Dictionary mapping country names to lists of Dask DataFrames to be harmonized.
- Returns:
Dictionary mapping country names to lists of harmonized Dask DataFrames.
- Return type:
dict of str to list of dask.dataframe.DataFrame
Note
Column and value mappings are applied per country using the provided configuration.
If
strict_mappingis enabled, unmapped columns or values will raise a ValueError.Column renaming and categorical value harmonization are performed in-place.
- s4h_join_data(ddfs: List[DataFrame]) DataFrame[source]#
Join multiple Dask DataFrames on a specified
key_col, removing duplicate columns.- Parameters:
ddfs – List of Dask DataFrames to join.
- Returns:
Merged DataFrame with duplicate columns removed.
- Return type:
- s4h_vertical_merge(ddfs: List[DataFrame]) List[DataFrame][source]#
Merge a list of Dask DataFrames vertically using instance parameters.
- Parameters:
ddfs –
List of Dask DataFrames to be merged.
- Returns:
List of merged Dask DataFrames, where each group contains DataFrames with sufficient column overlap and compatible data types.
- Return type:
list of dask.dataframe.DataFrame
Notes
DataFrames are grouped and merged if they share at least
min_common_columnscolumns and their column similarity is abovesimilarity_threshold.Only columns with matching data types are considered compatible for merging.