socio4health.utils#

utils.extractor_utils#

socio4health.utils.extractor_utils.compressed2files(input_archive, target_directory, down_ext, current_depth=0, max_depth=5, found_files={})[source]#

Extract files from a compressed archive and return the paths of the extracted files.

Parameters:
  • input_archive (str) – The path to the compressed archive file.

  • target_directory (str) – The directory where the extracted files will be saved.

  • down_ext (list) – A list of file extensions to filter the extracted files.

  • current_depth (int, optional) – The current depth of extraction, used to limit recursion depth. Default is 0.

  • max_depth (int, optional) – The maximum depth of extraction is to prevent infinite recursion. Default is 5.

  • found_files (set, optional) – A set to keep track of already found files, used to avoid duplicates. Default is an empty set.

Returns:

A set containing the paths of the extracted files that match the specified extensions.

Return type:

set

socio4health.utils.extractor_utils.create_unique_path(archive_path, filename, target_dir)[source]#

Generate unique destination path

socio4health.utils.extractor_utils.download_request(url, filename, download_dir)[source]#

Download a file from the specified URL and save it to the given directory.

Parameters:
  • url (str) – The URL of the file to download.

  • filename (str) – The name to save the downloaded file.

  • download_dir (str) – The directory where the file will be saved.

Returns:

The path to the downloaded file, or None if the download failed.

Return type:

str

socio4health.utils.extractor_utils.parse_fwf_dict(dict_df)[source]#

Parse a dictionary DataFrame to extract column names and fixed-width format specifications.

Parameters:

dict_df (pandas.DataFrame) – A DataFrame containing the dictionary information with columns: - ‘variable_name’: Column names - ‘initial_position’: Starting position (1-based) of each column - ‘size’: Width of each column

Returns:

A tuple containing: - A list of column names. - A list of tuples representing column specifications (start, end) where:

  • start is 0-based starting position

  • end is 0-based ending position (exclusive)

Return type:

tuple

Raises:

ValueError – If no column names or sizes are found in the dictionary DataFrame.

socio4health.utils.extractor_utils.run_standard_spider(url, depth, down_ext, key_words)[source]#

Run the Scrapy spider to extract data from the given URL .

Parameters:
  • url (str) – The URL to start crawling from.

  • depth (int) – The depth of the crawl.

  • down_ext (list) – List of file extensions to download.

  • key_words (list) – List of keywords to filter the crawled data.

Return type:

None

utils.harmonizer_utils#

socio4health.utils.harmonizer_utils.classify_rows(data: DataFrame, col1: str, col2: str, col3: str, new_column_name: str = 'category', MODEL_PATH: str = './bert_finetuned_classifier') DataFrame[source]#

Classify each row using a fine-tuned multiclass classification BERT model.

Parameters:
  • data (pd.DataFrame) – The DataFrame with text columns.

  • col1 (str) – Name of the first column containing survey-related text.

  • col2 (str) – Name of the second column containing survey-related text.

  • col3 (str) – Name of the third column containing survey-related text.

  • new_column_name (str, optional) – Name of the new column to store the predicted categories (default is category).

  • MODEL_PATH (str) – Path to the model weights (default is ./bert_finetuned_classifier)

Returns:

pd.DataFrame with a new prediction column.

Return type:

pd.DataFrame

socio4health.utils.harmonizer_utils.get_classifier(MODEL_PATH: str) Pipeline[source]#

Load the BERT fine-tuned model for classification only once.

Parameters:

MODEL_PATH (str)

Returns:

A HuggingFace pipeline for text classification.

Return type:

Pipeline

socio4health.utils.harmonizer_utils.standardize_dict(raw_dict: DataFrame) DataFrame[source]#

Cleans and structures a dictionary-like DataFrame of variables by standardizing text fields, grouping possible answers, and removing duplicates.

Parameters:

raw_dict (pd.DataFrame) – DataFrame containing the required columns: question, variable_name, description, value, and optionally subquestion.

Returns:

A cleaned and grouped DataFrame by question and variable_name, with an additional column possible_answers containing concatenated descriptions.

Return type:

pd.DataFrame

socio4health.utils.harmonizer_utils.translate_column(data: DataFrame, column: str, language: str = 'en') DataFrame[source]#

Translates the content of selected columns in a DataFrame using Google Translate.

Parameters:
  • data (pd.DataFrame) – The DataFrame containing the text columns.

  • column (str) – Name of the column to translate.

  • language (str) – Target language code (default is en).

Returns:

Original DataFrame with new column translated.

Return type:

pd.DataFrame