socio4health.utils#
utils.extractor_utils#
- socio4health.utils.extractor_utils.compressed2files(input_archive, target_directory, down_ext, current_depth=0, max_depth=5, found_files={})[source]#
Extract files from a compressed archive and return the paths of the extracted files.
- Parameters:
input_archive (str) – The path to the compressed archive file.
target_directory (str) – The directory where the extracted files will be saved.
down_ext (list) – A list of file extensions to filter the extracted files.
current_depth (int, optional) – The current depth of extraction, used to limit recursion depth. Default is 0.
max_depth (int, optional) – The maximum depth of extraction is to prevent infinite recursion. Default is 5.
found_files (set, optional) – A set to keep track of already found files, used to avoid duplicates. Default is an empty set.
- Returns:
A
set
containing the paths of the extracted files that match the specified extensions.- Return type:
set
- socio4health.utils.extractor_utils.create_unique_path(archive_path, filename, target_dir)[source]#
Generate unique destination path
- socio4health.utils.extractor_utils.download_request(url, filename, download_dir)[source]#
Download a file from the specified
URL
and save it to the given directory.
- socio4health.utils.extractor_utils.parse_fwf_dict(dict_df)[source]#
Parse a dictionary DataFrame to extract column names and fixed-width format specifications.
- Parameters:
dict_df (pandas.DataFrame) – A DataFrame containing the dictionary information with columns: - ‘variable_name’: Column names - ‘initial_position’: Starting position (1-based) of each column - ‘size’: Width of each column
- Returns:
A tuple containing: - A list of column names. - A list of tuples representing column specifications (start, end) where:
start is 0-based starting position
end is 0-based ending position (exclusive)
- Return type:
- Raises:
ValueError – If no column names or sizes are found in the dictionary DataFrame.
utils.harmonizer_utils#
- socio4health.utils.harmonizer_utils.classify_rows(data: DataFrame, col1: str, col2: str, col3: str, new_column_name: str = 'category', MODEL_PATH: str = './bert_finetuned_classifier') DataFrame [source]#
Classify each row using a fine-tuned multiclass classification
BERT
model.- Parameters:
data (pd.DataFrame) – The DataFrame with text columns.
col1 (str) – Name of the first column containing survey-related text.
col2 (str) – Name of the second column containing survey-related text.
col3 (str) – Name of the third column containing survey-related text.
new_column_name (str, optional) – Name of the new column to store the predicted categories (default is
category
).MODEL_PATH (str) – Path to the model weights (default is
./bert_finetuned_classifier
)
- Returns:
pd.DataFrame with a new prediction column.
- Return type:
- socio4health.utils.harmonizer_utils.get_classifier(MODEL_PATH: str) Pipeline [source]#
Load the
BERT
fine-tuned model for classification only once.- Parameters:
MODEL_PATH (str)
- Returns:
A
HuggingFace
pipeline for text classification.- Return type:
Pipeline
- socio4health.utils.harmonizer_utils.standardize_dict(raw_dict: DataFrame) DataFrame [source]#
Cleans and structures a dictionary-like DataFrame of variables by standardizing text fields, grouping possible answers, and removing duplicates.
- Parameters:
raw_dict (pd.DataFrame) – DataFrame containing the required columns:
question
,variable_name
,description
,value
, and optionallysubquestion
.- Returns:
A cleaned and grouped DataFrame by
question
andvariable_name
, with an additional columnpossible_answers
containing concatenated descriptions.- Return type: