socio4health.utils#

utils.extractor_utils#

socio4health.utils.extractor_utils.compressed2files(input_archive, target_directory, down_ext, current_depth=0, max_depth=5, found_files={})[source]#

Extract files from a compressed archive and return the paths of the extracted files.

Parameters:
  • input_archive (str) – The path to the compressed archive file.

  • target_directory (str) – The directory where the extracted files will be saved.

  • down_ext (list) – A list of file extensions to filter the extracted files.

  • current_depth (int, optional) – The current depth of extraction, used to limit recursion depth. Default is 0.

  • max_depth (int, optional) – The maximum depth of extraction is to prevent infinite recursion. Default is 5.

  • found_files (set, optional) – A set to keep track of already found files, used to avoid duplicates. Default is an empty set.

Returns:

A set containing the paths of the extracted files that match the specified extensions.

Return type:

set

socio4health.utils.extractor_utils.create_unique_path(archive_path, filename, target_dir)[source]#

Generate unique destination path

socio4health.utils.extractor_utils.download_request(url, filename, download_dir)[source]#

Download a file from the specified URL and save it to the given directory.

Parameters:
  • url (str) – The URL of the file to download.

  • filename (str) – The name to save the downloaded file.

  • download_dir (str) – The directory where the file will be saved.

Returns:

The path to the downloaded file, or None if the download failed.

Return type:

str

socio4health.utils.extractor_utils.run_standard_spider(url, depth, down_ext, key_words)[source]#

Run the Scrapy spider to extract data from the given URL .

Parameters:
  • url (str) – The URL to start crawling from.

  • depth (int) – The depth of the crawl.

  • down_ext (list) – List of file extensions to download.

  • key_words (list) – List of keywords to filter the crawled data.

Returns:

True if spider completed successfully, False otherwise

Return type:

bool

socio4health.utils.extractor_utils.s4h_parse_fwf_dict(dict_df)[source]#

Parse a fixed-width format dictionary stored in a pandas DataFrame.

The DataFrame must contain at least the following columns: variable_name and initial_position. Either size or final_position must be present to compute column spans.

Parameters:

dict_df (pandas.DataFrame) – Dictionary table describing fixed-width columns.

Returns:

(colnames, colspecs) where colnames is a list of column names and colspecs is a list of (start, end) integer tuples suitable for use with pandas.read_fwf (0-based, end exclusive).

Return type:

tuple

Raises:

ValueError – If required columns are missing.

utils.harmonizer_utils#

socio4health.utils.harmonizer_utils.apply_value_mappings(dfs, year, value_mappings, column_aliases=None)[source]#
socio4health.utils.harmonizer_utils.extract_and_prepare_data(year, path, ext, sep=None, output_path=None, colnames=None, colspecs=None, on_bad_lines='warn')[source]#
socio4health.utils.harmonizer_utils.group_and_onehot_encode(dfs, group_col, weight_col, id_col, value_labels_by_column=None)[source]#
socio4health.utils.harmonizer_utils.harmonize_columns_by_year(dfs, year, year_mappings)[source]#
socio4health.utils.harmonizer_utils.merge_factor(dfs, factor_col, id_col)[source]#
socio4health.utils.harmonizer_utils.select_and_filter_columns(dfs, col_cols, num_cols_threshold)[source]#