socio4health.utils#

utils.extractor_utils#

socio4health.utils.extractor_utils.compressed2files(input_archive, target_directory, down_ext, current_depth=0, max_depth=5, found_files={})[source]#

Extract files from a compressed archive and return the paths of the extracted files.

Parameters:

input_archive (str) – The path to the compressed archive file.
target_directory (str) – The directory where the extracted files will be saved.
down_ext (list) – A list of file extensions to filter the extracted files.
current_depth (int, optional) – The current depth of extraction, used to limit recursion depth. Default is 0.
max_depth (int, optional) – The maximum depth of extraction is to prevent infinite recursion. Default is 5.
found_files (set, optional) – A set to keep track of already found files, used to avoid duplicates. Default is an empty set.

Returns:

A set containing the paths of the extracted files that match the specified extensions.

Return type:

set

socio4health.utils.extractor_utils.create_unique_path(archive_path, filename, target_dir)[source]#: Generate unique destination path

socio4health.utils.extractor_utils.download_request(url, filename, download_dir)[source]#

Download a file from the specified URL and save it to the given directory.

Parameters:

url (str) – The URL of the file to download.
filename (str) – The name to save the downloaded file.
download_dir (str) – The directory where the file will be saved.

Returns:

The path to the downloaded file, or None if the download failed.

Return type:

str

socio4health.utils.extractor_utils.run_standard_spider(url, depth, down_ext, key_words)[source]#

Run the Scrapy spider to extract data from the given URL .

Parameters:

url (str) – The URL to start crawling from.
depth (int) – The depth of the crawl.
down_ext (list) – List of file extensions to download.
key_words (list) – List of keywords to filter the crawled data.

Returns:

True if spider completed successfully, False otherwise

Return type:

bool

socio4health.utils.extractor_utils.s4h_parse_fwf_dict(dict_df)[source]#

Parse a dictionary DataFrame to extract column names and fixed-width format specifications.

Parameters:

dict_df (pandas.DataFrame) – A DataFrame containing the dictionary information with columns: - ‘variable_name’: Column names - ‘initial_position’: Starting position (1-based) of each column - ‘size’: Width of each column or ‘final_position’: Ending position of each column

Returns:

A tuple containing: - A list of column names. - A list of tuples representing column specifications (start, end) where:

start is 0-based starting position

end is 0-based ending position (exclusive)

Return type:

tuple

Raises:

ValueError – If no column names or sizes are found in the dictionary DataFrame.

utils.harmonizer_utils#

socio4health.utils.harmonizer_utils.s4h_classify_rows(data: DataFrame, col1: str, col2: str, col3: str, new_column_name: str = 'category', MODEL_PATH: str = './bert_finetuned_classifier') → DataFrame[source]#

Classify each row using a fine-tuned multiclass classification BERT model.

Parameters:

data (pd.DataFrame) – The DataFrame with text columns.
col1 (str) – Name of the first column containing survey-related text.
col2 (str) – Name of the second column containing survey-related text.
col3 (str) – Name of the third column containing survey-related text.
new_column_name (str, optional) – Name of the new column to store the predicted categories (default is category).
MODEL_PATH (str) – Path to the model weights (default is ./bert_finetuned_classifier)

Returns:

pd.DataFrame with a new prediction column.

Return type:

pd.DataFrame

socio4health.utils.harmonizer_utils.s4h_get_classifier(MODEL_PATH: str) → Pipeline[source]#

Load the BERT fine-tuned model for classification only once.

Parameters:: MODEL_PATH (str)
Returns:: A HuggingFace pipeline for text classification.
Return type:: Pipeline

socio4health.utils.harmonizer_utils.s4h_standardize_dict(raw_dict: DataFrame) → DataFrame[source]#

Cleans and structures a dictionary-like DataFrame of variables by standardizing text fields, grouping possible answers, and removing duplicates.

Parameters:

raw_dict (pd.DataFrame) – DataFrame containing the required columns: question, variable_name, description, value, and optionally subquestion.

Returns:

A cleaned and grouped DataFrame by question and variable_name, with an additional column possible_answers containing concatenated descriptions.

Return type:

pd.DataFrame

socio4health.utils.harmonizer_utils.s4h_translate_column(data: DataFrame, column: str, language: str = 'en') → DataFrame[source]#

Translates the content of selected columns in a DataFrame using Google Translate.

Parameters:

data (pd.DataFrame) – The DataFrame containing the text columns.
column (str) – Name of the column to translate.
language (str) – Target language code (default is en).

Returns:

Original DataFrame with new column translated.

Return type:

pd.DataFrame

socio4health.utils#

utils.extractor_utils#

utils.harmonizer_utils#

This Page