
Harmonization of data#
Run the tutorial via free cloud platforms:
This notebook provides you with a tutorial on how to process sociodemographic and economic data from online data sources from Brazil. This tutorial assumes you have an intermediate or advanced understanding of Python and data manipulation.
Setting up the enviornment#
To run this notebook, you need to have the following prerequisites:
Python 3.10+
Additionally, you need to install the socio4health
and pandas
package, which can be done using pip
:
!pip install socio4health pandas ipywidgets -q
[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip
Import Libraries#
import pandas as pd
from socio4health import Extractor
from socio4health.enums.data_info_enum import BraColnamesEnum, BraColspecsEnum
from socio4health.harmonizer import Harmonizer
from socio4health.utils import harmonizer_utils
import tqdm as tqdm
Extracting data from Brazil#
In this example, we will extract the Brazilian National Continuous Household Sample Survey (PNADC) for the year 2024 from the Brazilian Institute of Geography and Statistics (IBGE) website.
bra_online_extractor = Extractor(input_path="https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/", down_ext=['.txt','.zip'], is_fwf=True, colnames=BraColnamesEnum.PNADC.value, colspecs=BraColspecsEnum.PNADC.value, output_path="../data", depth=0)
Providing the raw dictionary#
We need to provide a raw dictionary to the harmonizer that contains the column names and their corresponding data types. This is necessary for the harmonization process, as it allows the harmonizer to understand the structure of the data. To know more about how to construct the raw dictionary, you can check the documentation.
raw_dict = pd.read_excel('raw_dictionary.xlsx')
The raw dictionary is then standardized using the s4h_standardize_dict
method, which ensures that the dictionary is in a consistent format, making it easier to work with during the harmonization process.
dic = harmonizer_utils.s4h_standardize_dict(raw_dict)
C:\Users\isabe\PycharmProjects\socio4health\src\socio4health\utils\harmonizer_utils.py:81: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
.apply(_process_group, include_groups=True)\
Additionally, the content of columns of the dictionary can be translated into English using s4h_translate_column
function from s4h_harmonizer_utils
module. Translation is performed for facilitate the understanding and processing of the data.
s4h_translate_column
method may take some time depending on the size of the dictionary and the number of columns to be translated. It is recommended to use this method only if you need the content of the columns in English for further processing or analysis.
dic = harmonizer_utils.s4h_translate_column(dic, "question", language="en")
dic = harmonizer_utils.s4h_translate_column(dic, "description", language="en")
dic = harmonizer_utils.s4h_translate_column(dic, "possible_answers", language="en")
question translated
description translated
possible_answers translated
dic
variable_name | No | question | value | description | possible_answers | question_en | description_en | possible_answers_en | |
---|---|---|---|---|---|---|---|---|---|
0 | V3012 | NaN | concluiu com aprovação, pelo menos a primeira... | 1; 2; 3 | não aplicável | concluiu; não concluiu; curso não classificado... | Completed with approval, at least the first se... | not applicable | concluded; did not conclude; Course not classi... |
1 | V3014 | NaN | concluiu este curso que frequentou anteriormente | 1; 2 | não aplicável | sim; não | concluded this course that he attended earlier | not applicable | Yes; no |
2 | V3013B | NaN | concluiu os anos iniciais deste curso que fre... | 1; 2 | não aplicável | sim; não | Did you complete the early years of this cours... | not applicable | Yes; no |
3 | V4021 | NaN | exercia normalmente o trabalho em estabelecim... | 1; 2 | não aplicável | sim; não | Did you usually work in establishing this busi... | not applicable | Yes; no |
4 | V3002 | NaN | frequenta escola? | 1; 2 | não aplicável | sim; não | attend school? | not applicable | Yes; no |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
415 | V405812 | NaN | valor em dinheiro do rendimento mensal que rec... | valor em reais | não aplicável | r$ | cash value of the monthly income that normally... | not applicable | r$ |
416 | V405822 | NaN | valor estimado do produtos e mercadorias que r... | valor em reais | não aplicável | r$ | Estimated value of the products and goods that... | not applicable | r$ |
417 | V405922 | NaN | valor estimado do produtos e mercadorias que r... | valor em reais | não aplicável | r$ | estimated value of products and goods that nor... | not applicable | r$ |
418 | V405022 | NaN | valor estimado dos produtos e mercadorias que ... | valor em reais | não aplicável | r$ | estimated value of the products and goods that... | not applicable | r$ |
419 | V405122 | NaN | valor estimado dos produtos e mercadorias que ... | valor em reais | não aplicável | r$ | estimated value of the products and goods I re... | not applicable | r$ |
420 rows × 9 columns
The s4h_classify_rows
method is then used to classify the rows of the standardized dictionary based on the content of the specified columns. This classification helps in organizing the data and making it easier to work with during the harmonization process. The MODEL_PATH
parameter specifies the path to a pre-trained model that is used for classification. You can provide your own model or use the default one provided in the files
folder. The model is a fine-tuned BERT model for text classification. You can find more details about the model in the documentation.
dic = harmonizer_utils.s4h_classify_rows(dic, "question_en", "description_en", "possible_answers_en",
new_column_name="category",
MODEL_PATH="files/bert_finetuned_classifier")
Device set to use cpu
Extracting the data#
The extract
method of the Extractor
class is used retrieve the data from the specified input path. It returns a list of dataframes, each dataframe corresponding to a file extracted from the path.
dfs = bra_online_extractor.s4h_extract()
2025-09-09 17:08:03,445 - INFO - ----------------------
2025-09-09 17:08:03,453 - INFO - Starting data extraction...
2025-09-09 17:08:03,456 - INFO - Extracting data in online mode...
2025-09-09 17:08:03,459 - INFO - Scraping URL: https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/ with depth 0
2025-09-09 17:08:13,576 - INFO - Spider completed successfully for URL: https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/
2025-09-09 17:08:13,580 - INFO - Downloading files to: ../data
Downloading files: 0%| | 0/4 [00:00<?, ?it/s]2025-09-09 17:08:27,760 - INFO - Successfully downloaded: PNADC_012024_20250815.zip
Downloading files: 25%|██▌ | 1/4 [00:14<00:42, 14.13s/it]2025-09-09 17:08:41,771 - INFO - Successfully downloaded: PNADC_022024_20250815.zip
Downloading files: 50%|█████ | 2/4 [00:28<00:28, 14.06s/it]2025-09-09 17:08:54,977 - INFO - Successfully downloaded: PNADC_032024_20250815.zip
Downloading files: 75%|███████▌ | 3/4 [00:41<00:13, 13.67s/it]2025-09-09 17:09:07,862 - INFO - Successfully downloaded: PNADC_042024_20250815.zip
Downloading files: 100%|██████████| 4/4 [00:54<00:00, 13.56s/it]
2025-09-09 17:09:07,868 - INFO - Processing (depth 0): PNADC_012024_20250815.zip
2025-09-09 17:10:00,060 - INFO - Extracted: a7db871d_PNADC_012024.txt
2025-09-09 17:10:00,107 - INFO - Processing (depth 0): PNADC_022024_20250815.zip
2025-09-09 17:10:21,652 - INFO - Extracted: 946bef7f_PNADC_022024.txt
2025-09-09 17:10:21,654 - INFO - Processing (depth 0): PNADC_032024_20250815.zip
2025-09-09 17:10:45,326 - INFO - Extracted: c8ebcb57_PNADC_032024.txt
2025-09-09 17:10:45,328 - INFO - Processing (depth 0): PNADC_042024_20250815.zip
2025-09-09 17:11:37,864 - INFO - Extracted: c1bffc2a_PNADC_042024.txt
Processing files: 100%|██████████| 4/4 [13:27<00:00, 201.91s/it]
2025-09-09 17:25:05,504 - INFO - Successfully processed 4/4 files
2025-09-09 17:25:05,516 - INFO - Extraction completed successfully.
Harmonizing the data#
First, we need to create an instance of the Harmonizer
class.
har = Harmonizer()
After the dictionary is standardized and translated, it can be used to harmonize the data. For this, set the dict_df
attribute of the Harmonizer
instance to the standardized dictionary. This allows the harmonizer to use the information from the dictionary to process the dataframes.
har.dict_df = dic
Next, we can set the parameters for the harmonization process. The similarity_threshold
parameter is used to set the threshold for the similarity among column names. The nan_threshold
parameter is used to set the threshold for the number of NaN
values allowed in a column. If a column has more NaN
values than the specified threshold, it will be dropped from the final dataframe.
har.similarity_threshold = 0.9
har.nan_threshold = 1
The s4h_vertical_merge
method merges dataframes vertically. This means the data frames will be concatenated along the rows and aligned if their column names meet the previously set similarity threshold. The available columns can be obtained using the s4h_get_available_columns
method, which returns a list of column names present in all dataframes after vertical merging.
dfs = har.s4h_vertical_merge(dfs)
available_columns = har.s4h_get_available_columns(dfs)
available_columns
Grouping DataFrames: 100%|██████████| 4/4 [00:00<00:00, 10.89it/s]
Merging groups: 100%|██████████| 1/1 [00:00<00:00, 1.80it/s]
['Ano',
'Capital',
'Estrato',
'RM_RIDE',
'Trimestre',
'UF',
'UPA',
'V1008',
'V1014',
'V1016',
'V1022',
'V1023',
'V1027',
'V1028',
'V1028001',
'V1028002',
'V1028003',
'V1028004',
'V1028005',
'V1028006',
'V1028007',
'V1028008',
'V1028009',
'V1028010',
'V1028011',
'V1028012',
'V1028013',
'V1028014',
'V1028015',
'V1028016',
'V1028017',
'V1028018',
'V1028019',
'V1028020',
'V1028021',
'V1028022',
'V1028023',
'V1028024',
'V1028025',
'V1028026',
'V1028027',
'V1028028',
'V1028029',
'V1028030',
'V1028031',
'V1028032',
'V1028033',
'V1028034',
'V1028035',
'V1028036',
'V1028037',
'V1028038',
'V1028039',
'V1028040',
'V1028041',
'V1028042',
'V1028043',
'V1028044',
'V1028045',
'V1028046',
'V1028047',
'V1028048',
'V1028049',
'V1028050',
'V1028051',
'V1028052',
'V1028053',
'V1028054',
'V1028055',
'V1028056',
'V1028057',
'V1028058',
'V1028059',
'V1028060',
'V1028061',
'V1028062',
'V1028063',
'V1028064',
'V1028065',
'V1028066',
'V1028067',
'V1028068',
'V1028069',
'V1028070',
'V1028071',
'V1028072',
'V1028073',
'V1028074',
'V1028075',
'V1028076',
'V1028077',
'V1028078',
'V1028079',
'V1028080',
'V1028081',
'V1028082',
'V1028083',
'V1028084',
'V1028085',
'V1028086',
'V1028087',
'V1028088',
'V1028089',
'V1028090',
'V1028091',
'V1028092',
'V1028093',
'V1028094',
'V1028095',
'V1028096',
'V1028097',
'V1028098',
'V1028099',
'V1028100',
'V1028101',
'V1028102',
'V1028103',
'V1028104',
'V1028105',
'V1028106',
'V1028107',
'V1028108',
'V1028109',
'V1028110',
'V1028111',
'V1028112',
'V1028113',
'V1028114',
'V1028115',
'V1028116',
'V1028117',
'V1028118',
'V1028119',
'V1028120',
'V1028121',
'V1028122',
'V1028123',
'V1028124',
'V1028125',
'V1028126',
'V1028127',
'V1028128',
'V1028129',
'V1028130',
'V1028131',
'V1028132',
'V1028133',
'V1028134',
'V1028135',
'V1028136',
'V1028137',
'V1028138',
'V1028139',
'V1028140',
'V1028141',
'V1028142',
'V1028143',
'V1028144',
'V1028145',
'V1028146',
'V1028147',
'V1028148',
'V1028149',
'V1028150',
'V1028151',
'V1028152',
'V1028153',
'V1028154',
'V1028155',
'V1028156',
'V1028157',
'V1028158',
'V1028159',
'V1028160',
'V1028161',
'V1028162',
'V1028163',
'V1028164',
'V1028165',
'V1028166',
'V1028167',
'V1028168',
'V1028169',
'V1028170',
'V1028171',
'V1028172',
'V1028173',
'V1028174',
'V1028175',
'V1028176',
'V1028177',
'V1028178',
'V1028179',
'V1028180',
'V1028181',
'V1028182',
'V1028183',
'V1028184',
'V1028185',
'V1028186',
'V1028187',
'V1028188',
'V1028189',
'V1028190',
'V1028191',
'V1028192',
'V1028193',
'V1028194',
'V1028195',
'V1028196',
'V1028197',
'V1028198',
'V1028199',
'V1028200',
'V1029',
'V1033',
'V2001',
'V2003',
'V2005',
'V2007',
'V2008',
'V20081',
'V20082',
'V2009',
'V2010',
'V3001',
'V3002',
'V3002A',
'V3003',
'V3003A',
'V3004',
'V3005',
'V3005A',
'V3006',
'V3006A',
'V3007',
'V3008',
'V3009',
'V3009A',
'V3010',
'V3011',
'V3011A',
'V3012',
'V3013',
'V3013A',
'V3013B',
'V3014',
'V4001',
'V4002',
'V4003',
'V4004',
'V4005',
'V4006',
'V4006A',
'V4007',
'V4008',
'V40081',
'V40082',
'V40083',
'V4009',
'V4010',
'V4012',
'V40121',
'V4013',
'V40132',
'V40132A',
'V4014',
'V4015',
'V40151',
'V401511',
'V401512',
'V4016',
'V40161',
'V40162',
'V40163',
'V4017',
'V40171',
'V401711',
'V4018',
'V40181',
'V40182',
'V40183',
'V4019',
'V4020',
'V4021',
'V4022',
'V4024',
'V4025',
'V4026',
'V4027',
'V4028',
'V4029',
'V4032',
'V4033',
'V40331',
'V403311',
'V403312',
'V40332',
'V403321',
'V403322',
'V40333',
'V403331',
'V4034',
'V40341',
'V403411',
'V403412',
'V40342',
'V403421',
'V403422',
'V4039',
'V4039C',
'V4040',
'V40401',
'V40402',
'V40403',
'V4041',
'V4043',
'V40431',
'V4044',
'V4045',
'V4046',
'V4047',
'V4048',
'V4049',
'V4050',
'V40501',
'V405011',
'V405012',
'V40502',
'V405021',
'V405022',
'V40503',
'V405031',
'V4051',
'V40511',
'V405111',
'V405112',
'V40512',
'V405121',
'V405122',
'V4056',
'V4056C',
'V4057',
'V4058',
'V40581',
'V405811',
'V405812',
'V40582',
'V405821',
'V405822',
'V40583',
'V405831',
'V40584',
'V4059',
'V40591',
'V405911',
'V405912',
'V40592',
'V405921',
'V405922',
'V4062',
'V4062C',
'V4063',
'V4063A',
'V4064',
'V4064A',
'V4071',
'V4072',
'V4072A',
'V4073',
'V4074',
'V4074A',
'V4075A',
'V4075A1',
'V4076',
'V40761',
'V40762',
'V40763',
'V4077',
'V4078',
'V4078A',
'V4082',
'VD2002',
'VD2003',
'VD2004',
'VD2006',
'VD3004',
'VD3005',
'VD3006',
'VD4001',
'VD4002',
'VD4003',
'VD4004',
'VD4004A',
'VD4005',
'VD4007',
'VD4008',
'VD4009',
'VD4010',
'VD4011',
'VD4012',
'VD4013',
'VD4014',
'VD4015',
'VD4016',
'VD4017',
'VD4018',
'VD4019',
'VD4020',
'VD4023',
'VD4030',
'VD4031',
'VD4032',
'VD4033',
'VD4034',
'VD4035',
'VD4036',
'VD4037',
'filename',
'posest',
'posest_sxi']
For the selection of rows from the data, we can use the s4h_data_selector
method. To use this method we first need to assign the categories of our interest, which can either be one or a set of the following categories: Business
, Educations
, Fertility
, Housing
, Identification
, Migration
, Nonstandard job
, Social Security
.This method allows us to select specific rows from the data based on the values in a specified column. The key_col
parameter specifies the column to be used for selection, and the key_val
parameter specifies the values to be selected. In this case, we will select rows where the value in the UF
column is equal to 14
, which corresponds to the state of Pernambuco. This allows us to filter the data to include only records from this specific state.
har.categories = ["Business"]
har.key_col = 'UF'
har.key_val = ['14']
filtered_ddfs = har.s4h_data_selector(dfs)
Finally, we can join the filtered dataframes into a single dataframe using the s4h_join_data
method. This method combines the data from the filtered dataframes into a single dataframe, aligning the columns based on their names. The resulting dataframe will contain all the columns that are present in the filtered dataframes, and it will be ready for further analysis or export as a CSV
file.
joined_df = har.s4h_join_data(filtered_ddfs)
available_cols = joined_df.columns.tolist()
print(f"Available columns: {available_cols}")
print(f"Shape of the joined DataFrame: {joined_df.shape}")
print(joined_df.head())
joined_df.to_csv('data/GEIH_2022_harmonized.csv', index=False)