socio4health.utils#
utils.extractor_utils#
- socio4health.utils.extractor_utils.compressed2files(input_archive, target_directory, down_ext, current_depth=0, max_depth=5, found_files={})[source]#
Extract files from a compressed archive and return the paths of the extracted files.
- Parameters:
input_archive (str) – The path to the compressed archive file.
target_directory (str) – The directory where the extracted files will be saved.
down_ext (list) – A list of file extensions to filter the extracted files.
current_depth (int, optional) – The current depth of extraction, used to limit recursion depth. Default is 0.
max_depth (int, optional) – The maximum depth of extraction is to prevent infinite recursion. Default is 5.
found_files (set, optional) – A set to keep track of already found files, used to avoid duplicates. Default is an empty set.
- Returns:
A
setcontaining the paths of the extracted files that match the specified extensions.- Return type:
set
- socio4health.utils.extractor_utils.create_unique_path(archive_path, filename, target_dir)[source]#
Generate unique destination path
- socio4health.utils.extractor_utils.download_request(url, filename, download_dir)[source]#
Download a file from the specified
URLand save it to the given directory.
- socio4health.utils.extractor_utils.run_standard_spider(url, depth, down_ext, key_words)[source]#
Run the Scrapy spider to extract data from the given
URL.
- socio4health.utils.extractor_utils.s4h_parse_fwf_dict(dict_df)[source]#
Parse a fixed-width format dictionary stored in a pandas DataFrame.
The DataFrame must contain at least the following columns:
variable_nameandinitial_position. Eithersizeorfinal_positionmust be present to compute column spans.- Parameters:
dict_df (pandas.DataFrame) – Dictionary table describing fixed-width columns.
- Returns:
(colnames, colspecs)wherecolnamesis a list of column names andcolspecsis a list of(start, end)integer tuples suitable for use withpandas.read_fwf(0-based, end exclusive).- Return type:
- Raises:
ValueError – If required columns are missing.
utils.harmonizer_utils#
- socio4health.utils.harmonizer_utils.apply_value_mappings(dfs, year, value_mappings, column_aliases=None)[source]#
- socio4health.utils.harmonizer_utils.extract_and_prepare_data(year, path, ext, sep=None, output_path=None, colnames=None, colspecs=None, on_bad_lines='warn')[source]#