image info

Harmonization of data#

Run the tutorial via free cloud platforms: badge Open In Colab

This notebook provides you with a tutorial on how to process sociodemographic and economic data from online data sources from Brazil. This tutorial assumes you have an intermediate or advanced understanding of Python and data manipulation.

Setting up the enviornment#

To run this notebook, you need to have the following prerequisites:

  • Python 3.10+

Additionally, you need to install the socio4health and pandas package, which can be done using pip:

!pip install socio4health pandas ipywidgets -q
[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip

Import Libraries#

import pandas as pd
from socio4health import Extractor
from socio4health.enums.data_info_enum import BraColnamesEnum, BraColspecsEnum
from socio4health.harmonizer import Harmonizer
from socio4health.utils import harmonizer_utils
import tqdm as tqdm

Extracting data from Brazil#

In this example, we will extract the Brazilian National Continuous Household Sample Survey (PNADC) for the year 2024 from the Brazilian Institute of Geography and Statistics (IBGE) website.

bra_online_extractor = Extractor(input_path="https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/", down_ext=['.txt','.zip'], is_fwf=True, colnames=BraColnamesEnum.PNADC.value, colspecs=BraColspecsEnum.PNADC.value, output_path="../data", depth=0)

Providing the raw dictionary#

We need to provide a raw dictionary to the harmonizer that contains the column names and their corresponding data types. This is necessary for the harmonization process, as it allows the harmonizer to understand the structure of the data. To know more about how to construct the raw dictionary, you can check the documentation.

raw_dict = pd.read_excel('raw_dictionary.xlsx')

The raw dictionary is then standardized using the s4h_standardize_dict method, which ensures that the dictionary is in a consistent format, making it easier to work with during the harmonization process.

dic = harmonizer_utils.s4h_standardize_dict(raw_dict)
C:\Users\isabe\PycharmProjects\socio4health\src\socio4health\utils\harmonizer_utils.py:81: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  .apply(_process_group, include_groups=True)\

Additionally, the content of columns of the dictionary can be translated into English using s4h_translate_column function from s4h_harmonizer_utils module. Translation is performed for facilitate the understanding and processing of the data.

⚠️
Warning: s4h_translate_column method may take some time depending on the size of the dictionary and the number of columns to be translated. It is recommended to use this method only if you need the content of the columns in English for further processing or analysis.
dic = harmonizer_utils.s4h_translate_column(dic, "question", language="en")
dic = harmonizer_utils.s4h_translate_column(dic, "description", language="en")
dic = harmonizer_utils.s4h_translate_column(dic, "possible_answers", language="en")
question translated
description translated
possible_answers translated
dic
variable_name No question value description possible_answers question_en description_en possible_answers_en
0 V3012 NaN concluiu com aprovação, pelo menos a primeira... 1; 2; 3 não aplicável concluiu; não concluiu; curso não classificado... Completed with approval, at least the first se... not applicable concluded; did not conclude; Course not classi...
1 V3014 NaN concluiu este curso que frequentou anteriormente 1; 2 não aplicável sim; não concluded this course that he attended earlier not applicable Yes; no
2 V3013B NaN concluiu os anos iniciais deste curso que fre... 1; 2 não aplicável sim; não Did you complete the early years of this cours... not applicable Yes; no
3 V4021 NaN exercia normalmente o trabalho em estabelecim... 1; 2 não aplicável sim; não Did you usually work in establishing this busi... not applicable Yes; no
4 V3002 NaN frequenta escola? 1; 2 não aplicável sim; não attend school? not applicable Yes; no
... ... ... ... ... ... ... ... ... ...
415 V405812 NaN valor em dinheiro do rendimento mensal que rec... valor em reais não aplicável r$ cash value of the monthly income that normally... not applicable r$
416 V405822 NaN valor estimado do produtos e mercadorias que r... valor em reais não aplicável r$ Estimated value of the products and goods that... not applicable r$
417 V405922 NaN valor estimado do produtos e mercadorias que r... valor em reais não aplicável r$ estimated value of products and goods that nor... not applicable r$
418 V405022 NaN valor estimado dos produtos e mercadorias que ... valor em reais não aplicável r$ estimated value of the products and goods that... not applicable r$
419 V405122 NaN valor estimado dos produtos e mercadorias que ... valor em reais não aplicável r$ estimated value of the products and goods I re... not applicable r$

420 rows × 9 columns

The s4h_classify_rows method is then used to classify the rows of the standardized dictionary based on the content of the specified columns. This classification helps in organizing the data and making it easier to work with during the harmonization process. The MODEL_PATH parameter specifies the path to a pre-trained model that is used for classification. You can provide your own model or use the default one provided in the files folder. The model is a fine-tuned BERT model for text classification. You can find more details about the model in the documentation.

dic = harmonizer_utils.s4h_classify_rows(dic, "question_en", "description_en", "possible_answers_en",
                                     new_column_name="category",
                                     MODEL_PATH="files/bert_finetuned_classifier")
Device set to use cpu

Extracting the data#

The extract method of the Extractor class is used retrieve the data from the specified input path. It returns a list of dataframes, each dataframe corresponding to a file extracted from the path.

dfs = bra_online_extractor.s4h_extract()
2025-09-09 17:08:03,445 - INFO - ----------------------
2025-09-09 17:08:03,453 - INFO - Starting data extraction...
2025-09-09 17:08:03,456 - INFO - Extracting data in online mode...
2025-09-09 17:08:03,459 - INFO - Scraping URL: https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/ with depth 0
2025-09-09 17:08:13,576 - INFO - Spider completed successfully for URL: https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/
2025-09-09 17:08:13,580 - INFO - Downloading files to: ../data
Downloading files:   0%|          | 0/4 [00:00<?, ?it/s]2025-09-09 17:08:27,760 - INFO - Successfully downloaded: PNADC_012024_20250815.zip
Downloading files:  25%|██▌       | 1/4 [00:14<00:42, 14.13s/it]2025-09-09 17:08:41,771 - INFO - Successfully downloaded: PNADC_022024_20250815.zip
Downloading files:  50%|█████     | 2/4 [00:28<00:28, 14.06s/it]2025-09-09 17:08:54,977 - INFO - Successfully downloaded: PNADC_032024_20250815.zip
Downloading files:  75%|███████▌  | 3/4 [00:41<00:13, 13.67s/it]2025-09-09 17:09:07,862 - INFO - Successfully downloaded: PNADC_042024_20250815.zip
Downloading files: 100%|██████████| 4/4 [00:54<00:00, 13.56s/it]
2025-09-09 17:09:07,868 - INFO - Processing (depth 0): PNADC_012024_20250815.zip
2025-09-09 17:10:00,060 - INFO - Extracted: a7db871d_PNADC_012024.txt
2025-09-09 17:10:00,107 - INFO - Processing (depth 0): PNADC_022024_20250815.zip
2025-09-09 17:10:21,652 - INFO - Extracted: 946bef7f_PNADC_022024.txt
2025-09-09 17:10:21,654 - INFO - Processing (depth 0): PNADC_032024_20250815.zip
2025-09-09 17:10:45,326 - INFO - Extracted: c8ebcb57_PNADC_032024.txt
2025-09-09 17:10:45,328 - INFO - Processing (depth 0): PNADC_042024_20250815.zip
2025-09-09 17:11:37,864 - INFO - Extracted: c1bffc2a_PNADC_042024.txt
Processing files: 100%|██████████| 4/4 [13:27<00:00, 201.91s/it]
2025-09-09 17:25:05,504 - INFO - Successfully processed 4/4 files
2025-09-09 17:25:05,516 - INFO - Extraction completed successfully.

Harmonizing the data#

First, we need to create an instance of the Harmonizer class.

har = Harmonizer()

After the dictionary is standardized and translated, it can be used to harmonize the data. For this, set the dict_df attribute of the Harmonizer instance to the standardized dictionary. This allows the harmonizer to use the information from the dictionary to process the dataframes.

har.dict_df = dic

Next, we can set the parameters for the harmonization process. The similarity_threshold parameter is used to set the threshold for the similarity among column names. The nan_threshold parameter is used to set the threshold for the number of NaN values allowed in a column. If a column has more NaN values than the specified threshold, it will be dropped from the final dataframe.

har.similarity_threshold = 0.9
har.nan_threshold = 1

The s4h_vertical_merge method merges dataframes vertically. This means the data frames will be concatenated along the rows and aligned if their column names meet the previously set similarity threshold. The available columns can be obtained using the s4h_get_available_columns method, which returns a list of column names present in all dataframes after vertical merging.

dfs = har.s4h_vertical_merge(dfs)
available_columns = har.s4h_get_available_columns(dfs)

available_columns
Grouping DataFrames: 100%|██████████| 4/4 [00:00<00:00, 10.89it/s]
Merging groups: 100%|██████████| 1/1 [00:00<00:00,  1.80it/s]
['Ano',
 'Capital',
 'Estrato',
 'RM_RIDE',
 'Trimestre',
 'UF',
 'UPA',
 'V1008',
 'V1014',
 'V1016',
 'V1022',
 'V1023',
 'V1027',
 'V1028',
 'V1028001',
 'V1028002',
 'V1028003',
 'V1028004',
 'V1028005',
 'V1028006',
 'V1028007',
 'V1028008',
 'V1028009',
 'V1028010',
 'V1028011',
 'V1028012',
 'V1028013',
 'V1028014',
 'V1028015',
 'V1028016',
 'V1028017',
 'V1028018',
 'V1028019',
 'V1028020',
 'V1028021',
 'V1028022',
 'V1028023',
 'V1028024',
 'V1028025',
 'V1028026',
 'V1028027',
 'V1028028',
 'V1028029',
 'V1028030',
 'V1028031',
 'V1028032',
 'V1028033',
 'V1028034',
 'V1028035',
 'V1028036',
 'V1028037',
 'V1028038',
 'V1028039',
 'V1028040',
 'V1028041',
 'V1028042',
 'V1028043',
 'V1028044',
 'V1028045',
 'V1028046',
 'V1028047',
 'V1028048',
 'V1028049',
 'V1028050',
 'V1028051',
 'V1028052',
 'V1028053',
 'V1028054',
 'V1028055',
 'V1028056',
 'V1028057',
 'V1028058',
 'V1028059',
 'V1028060',
 'V1028061',
 'V1028062',
 'V1028063',
 'V1028064',
 'V1028065',
 'V1028066',
 'V1028067',
 'V1028068',
 'V1028069',
 'V1028070',
 'V1028071',
 'V1028072',
 'V1028073',
 'V1028074',
 'V1028075',
 'V1028076',
 'V1028077',
 'V1028078',
 'V1028079',
 'V1028080',
 'V1028081',
 'V1028082',
 'V1028083',
 'V1028084',
 'V1028085',
 'V1028086',
 'V1028087',
 'V1028088',
 'V1028089',
 'V1028090',
 'V1028091',
 'V1028092',
 'V1028093',
 'V1028094',
 'V1028095',
 'V1028096',
 'V1028097',
 'V1028098',
 'V1028099',
 'V1028100',
 'V1028101',
 'V1028102',
 'V1028103',
 'V1028104',
 'V1028105',
 'V1028106',
 'V1028107',
 'V1028108',
 'V1028109',
 'V1028110',
 'V1028111',
 'V1028112',
 'V1028113',
 'V1028114',
 'V1028115',
 'V1028116',
 'V1028117',
 'V1028118',
 'V1028119',
 'V1028120',
 'V1028121',
 'V1028122',
 'V1028123',
 'V1028124',
 'V1028125',
 'V1028126',
 'V1028127',
 'V1028128',
 'V1028129',
 'V1028130',
 'V1028131',
 'V1028132',
 'V1028133',
 'V1028134',
 'V1028135',
 'V1028136',
 'V1028137',
 'V1028138',
 'V1028139',
 'V1028140',
 'V1028141',
 'V1028142',
 'V1028143',
 'V1028144',
 'V1028145',
 'V1028146',
 'V1028147',
 'V1028148',
 'V1028149',
 'V1028150',
 'V1028151',
 'V1028152',
 'V1028153',
 'V1028154',
 'V1028155',
 'V1028156',
 'V1028157',
 'V1028158',
 'V1028159',
 'V1028160',
 'V1028161',
 'V1028162',
 'V1028163',
 'V1028164',
 'V1028165',
 'V1028166',
 'V1028167',
 'V1028168',
 'V1028169',
 'V1028170',
 'V1028171',
 'V1028172',
 'V1028173',
 'V1028174',
 'V1028175',
 'V1028176',
 'V1028177',
 'V1028178',
 'V1028179',
 'V1028180',
 'V1028181',
 'V1028182',
 'V1028183',
 'V1028184',
 'V1028185',
 'V1028186',
 'V1028187',
 'V1028188',
 'V1028189',
 'V1028190',
 'V1028191',
 'V1028192',
 'V1028193',
 'V1028194',
 'V1028195',
 'V1028196',
 'V1028197',
 'V1028198',
 'V1028199',
 'V1028200',
 'V1029',
 'V1033',
 'V2001',
 'V2003',
 'V2005',
 'V2007',
 'V2008',
 'V20081',
 'V20082',
 'V2009',
 'V2010',
 'V3001',
 'V3002',
 'V3002A',
 'V3003',
 'V3003A',
 'V3004',
 'V3005',
 'V3005A',
 'V3006',
 'V3006A',
 'V3007',
 'V3008',
 'V3009',
 'V3009A',
 'V3010',
 'V3011',
 'V3011A',
 'V3012',
 'V3013',
 'V3013A',
 'V3013B',
 'V3014',
 'V4001',
 'V4002',
 'V4003',
 'V4004',
 'V4005',
 'V4006',
 'V4006A',
 'V4007',
 'V4008',
 'V40081',
 'V40082',
 'V40083',
 'V4009',
 'V4010',
 'V4012',
 'V40121',
 'V4013',
 'V40132',
 'V40132A',
 'V4014',
 'V4015',
 'V40151',
 'V401511',
 'V401512',
 'V4016',
 'V40161',
 'V40162',
 'V40163',
 'V4017',
 'V40171',
 'V401711',
 'V4018',
 'V40181',
 'V40182',
 'V40183',
 'V4019',
 'V4020',
 'V4021',
 'V4022',
 'V4024',
 'V4025',
 'V4026',
 'V4027',
 'V4028',
 'V4029',
 'V4032',
 'V4033',
 'V40331',
 'V403311',
 'V403312',
 'V40332',
 'V403321',
 'V403322',
 'V40333',
 'V403331',
 'V4034',
 'V40341',
 'V403411',
 'V403412',
 'V40342',
 'V403421',
 'V403422',
 'V4039',
 'V4039C',
 'V4040',
 'V40401',
 'V40402',
 'V40403',
 'V4041',
 'V4043',
 'V40431',
 'V4044',
 'V4045',
 'V4046',
 'V4047',
 'V4048',
 'V4049',
 'V4050',
 'V40501',
 'V405011',
 'V405012',
 'V40502',
 'V405021',
 'V405022',
 'V40503',
 'V405031',
 'V4051',
 'V40511',
 'V405111',
 'V405112',
 'V40512',
 'V405121',
 'V405122',
 'V4056',
 'V4056C',
 'V4057',
 'V4058',
 'V40581',
 'V405811',
 'V405812',
 'V40582',
 'V405821',
 'V405822',
 'V40583',
 'V405831',
 'V40584',
 'V4059',
 'V40591',
 'V405911',
 'V405912',
 'V40592',
 'V405921',
 'V405922',
 'V4062',
 'V4062C',
 'V4063',
 'V4063A',
 'V4064',
 'V4064A',
 'V4071',
 'V4072',
 'V4072A',
 'V4073',
 'V4074',
 'V4074A',
 'V4075A',
 'V4075A1',
 'V4076',
 'V40761',
 'V40762',
 'V40763',
 'V4077',
 'V4078',
 'V4078A',
 'V4082',
 'VD2002',
 'VD2003',
 'VD2004',
 'VD2006',
 'VD3004',
 'VD3005',
 'VD3006',
 'VD4001',
 'VD4002',
 'VD4003',
 'VD4004',
 'VD4004A',
 'VD4005',
 'VD4007',
 'VD4008',
 'VD4009',
 'VD4010',
 'VD4011',
 'VD4012',
 'VD4013',
 'VD4014',
 'VD4015',
 'VD4016',
 'VD4017',
 'VD4018',
 'VD4019',
 'VD4020',
 'VD4023',
 'VD4030',
 'VD4031',
 'VD4032',
 'VD4033',
 'VD4034',
 'VD4035',
 'VD4036',
 'VD4037',
 'filename',
 'posest',
 'posest_sxi']

For the selection of rows from the data, we can use the s4h_data_selector method. To use this method we first need to assign the categories of our interest, which can either be one or a set of the following categories: Business, Educations, Fertility, Housing, Identification, Migration, Nonstandard job, Social Security .This method allows us to select specific rows from the data based on the values in a specified column. The key_col parameter specifies the column to be used for selection, and the key_val parameter specifies the values to be selected. In this case, we will select rows where the value in the UF column is equal to 14, which corresponds to the state of Pernambuco. This allows us to filter the data to include only records from this specific state.

har.categories = ["Business"]
har.key_col = 'UF'
har.key_val = ['14']
filtered_ddfs = har.s4h_data_selector(dfs)

Finally, we can join the filtered dataframes into a single dataframe using the s4h_join_data method. This method combines the data from the filtered dataframes into a single dataframe, aligning the columns based on their names. The resulting dataframe will contain all the columns that are present in the filtered dataframes, and it will be ready for further analysis or export as a CSV file.

joined_df = har.s4h_join_data(filtered_ddfs)
available_cols = joined_df.columns.tolist()
print(f"Available columns: {available_cols}")
print(f"Shape of the joined DataFrame: {joined_df.shape}")
print(joined_df.head())
joined_df.to_csv('data/GEIH_2022_harmonized.csv', index=False)