image info

Harmonization of data#

Run the tutorial via free cloud platforms: badge Open In Colab

This notebook provides you with a tutorial on how to process the sociodemographic and economic data from online data sources from Brazil. This tutorial assumes you have an intermediate or advanced understanding of Python and data manipulation.

Setting up the enviornment#

To run this notebook, you need to have the following prerequisites:

  • Python 3.10+

Additionally, you need to install the socio4health and pandas package, which can be done using pip:

!pip install socio4health pandas ipywidgets -q

Import Libraries#

import pandas as pd
from socio4health import Extractor
from socio4health.enums.data_info_enum import BraColnamesEnum, BraColspecsEnum
from socio4health.harmonizer import Harmonizer
from socio4health.utils import harmonizer_utils
import tqdm as tqdm

Extracting data from Brazil#

In this example, we will extract the Brazilian National Continuous Household Sample Survey (PNADC) for the year 2024 from the Brazilian Institute of Geography and Statistics (IBGE) website.

bra_online_extractor = Extractor(input_path="https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/", down_ext=['.txt','.zip'], is_fwf=True, colnames=BraColnamesEnum.PNADC.value, colspecs=BraColspecsEnum.PNADC.value, output_path="../data", depth=0)

Providing the raw dictionary#

We need to provide a raw dictionary to the harmonizer that contains the column names and their corresponding data types. This is necessary for the harmonization process, as it allows the harmonizer to understand the structure of the data. To know more about how to construct the raw dictionary, you can check the documentation.

raw_dict = pd.read_excel('raw_dictionary.xlsx')
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In[4], line 1
----> 1 raw_dict = pd.read_excel('raw_dictionary.xlsx')

File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/pandas/io/excel/_base.py:495, in read_excel(io, sheet_name, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, date_format, thousands, decimal, comment, skipfooter, storage_options, dtype_backend, engine_kwargs)
    493 if not isinstance(io, ExcelFile):
    494     should_close = True
--> 495     io = ExcelFile(
    496         io,
    497         storage_options=storage_options,
    498         engine=engine,
    499         engine_kwargs=engine_kwargs,
    500     )
    501 elif engine and engine != io.engine:
    502     raise ValueError(
    503         "Engine should not be specified when passing "
    504         "an ExcelFile - ExcelFile already has the engine set"
    505     )

File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/pandas/io/excel/_base.py:1550, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options, engine_kwargs)
   1548     ext = "xls"
   1549 else:
-> 1550     ext = inspect_excel_format(
   1551         content_or_path=path_or_buffer, storage_options=storage_options
   1552     )
   1553     if ext is None:
   1554         raise ValueError(
   1555             "Excel file format cannot be determined, you must specify "
   1556             "an engine manually."
   1557         )

File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/pandas/io/excel/_base.py:1402, in inspect_excel_format(content_or_path, storage_options)
   1399 if isinstance(content_or_path, bytes):
   1400     content_or_path = BytesIO(content_or_path)
-> 1402 with get_handle(
   1403     content_or_path, "rb", storage_options=storage_options, is_text=False
   1404 ) as handle:
   1405     stream = handle.handle
   1406     stream.seek(0)

File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    873         handle = open(
    874             handle,
    875             ioargs.mode,
   (...)    878             newline="",
    879         )
    880     else:
    881         # Binary mode
--> 882         handle = open(handle, ioargs.mode)
    883     handles.append(handle)
    885 # Convert BytesIO or file objects passed with an encoding

FileNotFoundError: [Errno 2] No such file or directory: 'raw_dictionary.xlsx'

The raw dictionary is then standardized using the standardize_dict method, which ensures that the dictionary is in a consistent format, making it easier to work with during the harmonization process.

dic = harmonizer_utils.standardize_dict(raw_dict)
C:\Users\isabe\PycharmProjects\socio4health\src\socio4health\utils\harmonizer_utils.py:78: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
  .apply(_process_group, include_groups=True)\

Additionally, the content of columns of the dictionary can be translated into English using translate_column function from harmonizer_utils module. Translation is performed for facilitate the understanding and processing of the data.

⚠️
Warning: translate_column method may take some time depending on the size of the dictionary and the number of columns to be translated. It is recommended to use this method only if you need the content of the columns in English for further processing or analysis.
dic = harmonizer_utils.translate_column(dic, "question", language="en")
dic = harmonizer_utils.translate_column(dic, "description", language="en")
dic = harmonizer_utils.translate_column(dic, "possible_answers", language="en")
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[8], line 1
----> 1 dic = harmonizer_utils.translate_column(dic, "question", language="en")
      2 dic = harmonizer_utils.translate_column(dic, "description", language="en")
      3 dic = harmonizer_utils.translate_column(dic, "possible_answers", language="en")

File ~\PycharmProjects\socio4health\src\socio4health\utils\harmonizer_utils.py:175, in translate_column(data, column, language)
    172 data = data.copy()
    174 new_col = f"{column}_{language}"
--> 175 data[new_col] = data[column].apply(translate_text)
    176 print(f"{column} translated")
    178 return data

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\core\series.py:4935, in Series.apply(self, func, convert_dtype, args, by_row, **kwargs)
   4800 def apply(
   4801     self,
   4802     func: AggFuncType,
   (...)   4807     **kwargs,
   4808 ) -> DataFrame | Series:
   4809     """
   4810     Invoke function on values of Series.
   4811 
   (...)   4926     dtype: float64
   4927     """
   4928     return SeriesApply(
   4929         self,
   4930         func,
   4931         convert_dtype=convert_dtype,
   4932         by_row=by_row,
   4933         args=args,
   4934         kwargs=kwargs,
-> 4935     ).apply()

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\core\apply.py:1422, in SeriesApply.apply(self)
   1419     return self.apply_compat()
   1421 # self.func is Callable
-> 1422 return self.apply_standard()

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\core\apply.py:1502, in SeriesApply.apply_standard(self)
   1496 # row-wise access
   1497 # apply doesn't have a `na_action` keyword and for backward compat reasons
   1498 # we need to give `na_action="ignore"` for categorical data.
   1499 # TODO: remove the `na_action="ignore"` when that default has been changed in
   1500 #  Categorical (GH51645).
   1501 action = "ignore" if isinstance(obj.dtype, CategoricalDtype) else None
-> 1502 mapped = obj._map_values(
   1503     mapper=curried, na_action=action, convert=self.convert_dtype
   1504 )
   1506 if len(mapped) and isinstance(mapped[0], ABCSeries):
   1507     # GH#43986 Need to do list(mapped) in order to get treated as nested
   1508     #  See also GH#25959 regarding EA support
   1509     return obj._constructor_expanddim(list(mapped), index=obj.index)

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\core\base.py:925, in IndexOpsMixin._map_values(self, mapper, na_action, convert)
    922 if isinstance(arr, ExtensionArray):
    923     return arr.map(mapper, na_action=na_action)
--> 925 return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\core\algorithms.py:1743, in map_array(arr, mapper, na_action, convert)
   1741 values = arr.astype(object, copy=False)
   1742 if na_action is None:
-> 1743     return lib.map_infer(values, mapper, convert=convert)
   1744 else:
   1745     return lib.map_infer_mask(
   1746         values, mapper, mask=isna(values).view(np.uint8), convert=convert
   1747     )

File pandas/_libs/lib.pyx:2999, in pandas._libs.lib.map_infer()

File ~\PycharmProjects\socio4health\src\socio4health\utils\harmonizer_utils.py:167, in translate_column.<locals>.translate_text(text)
    165     return text
    166 if len(text) < 5000:
--> 167     return GoogleTranslator(source='auto', target=language).translate(text)
    168 else:
    169     print("Rows with contents longer than 5000 characters are cut off")

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\deep_translator\google.py:67, in GoogleTranslator.translate(self, text, **kwargs)
     64 if self.payload_key:
     65     self._url_params[self.payload_key] = text
---> 67 response = requests.get(
     68     self._base_url, params=self._url_params, proxies=self.proxies
     69 )
     70 if response.status_code == 429:
     71     raise TooManyRequests()

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\requests\api.py:73, in get(url, params, **kwargs)
     62 def get(url, params=None, **kwargs):
     63     r"""Sends a GET request.
     64 
     65     :param url: URL for the new :class:`Request` object.
   (...)     70     :rtype: requests.Response
     71     """
---> 73     return request("get", url, params=params, **kwargs)

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\requests\api.py:59, in request(method, url, **kwargs)
     55 # By using the 'with' statement we are sure the session is closed, thus we
     56 # avoid leaving sockets open which can trigger a ResourceWarning in some
     57 # cases, and look like a memory leak in others.
     58 with sessions.Session() as session:
---> 59     return session.request(method=method, url=url, **kwargs)

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\requests\sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    584 send_kwargs = {
    585     "timeout": timeout,
    586     "allow_redirects": allow_redirects,
    587 }
    588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
    591 return resp

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\requests\sessions.py:703, in Session.send(self, request, **kwargs)
    700 start = preferred_clock()
    702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
    705 # Total elapsed time of the request (approximately)
    706 elapsed = preferred_clock() - start

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\requests\adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
    483     timeout = TimeoutSauce(connect=timeout, read=timeout)
    485 try:
--> 486     resp = conn.urlopen(
    487         method=request.method,
    488         url=url,
    489         body=request.body,
    490         headers=request.headers,
    491         redirect=False,
    492         assert_same_host=False,
    493         preload_content=False,
    494         decode_content=False,
    495         retries=self.max_retries,
    496         timeout=timeout,
    497         chunked=chunked,
    498     )
    500 except (ProtocolError, OSError) as err:
    501     raise ConnectionError(err, request=request)

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\urllib3\connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
    784 response_conn = conn if not release_conn else None
    786 # Make the request on the HTTPConnection object
--> 787 response = self._make_request(
    788     conn,
    789     method,
    790     url,
    791     timeout=timeout_obj,
    792     body=body,
    793     headers=headers,
    794     chunked=chunked,
    795     retries=retries,
    796     response_conn=response_conn,
    797     preload_content=preload_content,
    798     decode_content=decode_content,
    799     **response_kw,
    800 )
    802 # Everything went great!
    803 clean_exit = True

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\urllib3\connectionpool.py:464, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
    461 try:
    462     # Trigger any extra validation we need to do.
    463     try:
--> 464         self._validate_conn(conn)
    465     except (SocketTimeout, BaseSSLError) as e:
    466         self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\urllib3\connectionpool.py:1093, in HTTPSConnectionPool._validate_conn(self, conn)
   1091 # Force connect early to allow us to validate the connection.
   1092 if conn.is_closed:
-> 1093     conn.connect()
   1095 # TODO revise this, see https://github.com/urllib3/urllib3/issues/2791
   1096 if not conn.is_verified and not conn.proxy_is_verified:

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\urllib3\connection.py:790, in HTTPSConnection.connect(self)
    787     # Remove trailing '.' from fqdn hostnames to allow certificate validation
    788     server_hostname_rm_dot = server_hostname.rstrip(".")
--> 790     sock_and_verified = _ssl_wrap_socket_and_match_hostname(
    791         sock=sock,
    792         cert_reqs=self.cert_reqs,
    793         ssl_version=self.ssl_version,
    794         ssl_minimum_version=self.ssl_minimum_version,
    795         ssl_maximum_version=self.ssl_maximum_version,
    796         ca_certs=self.ca_certs,
    797         ca_cert_dir=self.ca_cert_dir,
    798         ca_cert_data=self.ca_cert_data,
    799         cert_file=self.cert_file,
    800         key_file=self.key_file,
    801         key_password=self.key_password,
    802         server_hostname=server_hostname_rm_dot,
    803         ssl_context=self.ssl_context,
    804         tls_in_tls=tls_in_tls,
    805         assert_hostname=self.assert_hostname,
    806         assert_fingerprint=self.assert_fingerprint,
    807     )
    808     self.sock = sock_and_verified.socket
    810 # If an error occurs during connection/handshake we may need to release
    811 # our lock so another connection can probe the origin.

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\urllib3\connection.py:969, in _ssl_wrap_socket_and_match_hostname(sock, cert_reqs, ssl_version, ssl_minimum_version, ssl_maximum_version, cert_file, key_file, key_password, ca_certs, ca_cert_dir, ca_cert_data, assert_hostname, assert_fingerprint, server_hostname, ssl_context, tls_in_tls)
    966     if is_ipaddress(normalized):
    967         server_hostname = normalized
--> 969 ssl_sock = ssl_wrap_socket(
    970     sock=sock,
    971     keyfile=key_file,
    972     certfile=cert_file,
    973     key_password=key_password,
    974     ca_certs=ca_certs,
    975     ca_cert_dir=ca_cert_dir,
    976     ca_cert_data=ca_cert_data,
    977     server_hostname=server_hostname,
    978     ssl_context=context,
    979     tls_in_tls=tls_in_tls,
    980 )
    982 try:
    983     if assert_fingerprint:

File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\urllib3\util\ssl_.py:458, in ssl_wrap_socket(sock, keyfile, certfile, cert_reqs, ca_certs, server_hostname, ssl_version, ciphers, ssl_context, ca_cert_dir, key_password, ca_cert_data, tls_in_tls)
    456 if ca_certs or ca_cert_dir or ca_cert_data:
    457     try:
--> 458         context.load_verify_locations(ca_certs, ca_cert_dir, ca_cert_data)
    459     except OSError as e:
    460         raise SSLError(e) from e

KeyboardInterrupt: 

The classify_rows method is then used to classify the rows of the standardized dictionary based on the content of the specified columns. This classification helps in organizing the data and making it easier to work with during the harmonization process. The MODEL_PATH parameter specifies the path to a pre-trained model that is used for classification. You can provide your own model or use the default one provided in the files folder. The model is a fine-tuned BERT model for text classification. You can find more details about the model in the documentation.

dic = harmonizer_utils.classify_rows(dic, "question_en", "description_en", "possible_answers_en",
                                     new_column_name="category",
                                     MODEL_PATH="files/bert_finetuned_classifier")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 1
----> 1 dic = harmonizer_utils.classify_rows(dic, "question_en", "description_en", "possible_answers_en",
      2                                      new_column_name="category",
      3                                      MODEL_PATH="files/bert_finetuned_classifier")

File ~\PycharmProjects\socio4health\src\socio4health\utils\harmonizer_utils.py:241, in classify_rows(data, col1, col2, col3, new_column_name, MODEL_PATH)
    239         raise TypeError("The parameters col1, col2 and col3 must be strings.")
    240     if col not in data.columns:
--> 241         raise ValueError(f"The column '{col}' is not found in the DataFrame.")
    243 if not isinstance(new_column_name, str) or not new_column_name:
    244     raise ValueError("new_column_name must be a non-empty string.")

ValueError: The column 'question_en' is not found in the DataFrame.

Extracting the data#

The extract method of the Extractor class is used retrieve the data from the specified input path. It returns a list of dataframes, each dataframe corresponding to a file extracted from the path.

dfs = bra_online_extractor.extract()
2025-08-11 11:20:47,090 - INFO - ----------------------
2025-08-11 11:20:47,091 - INFO - Starting data extraction...
2025-08-11 11:20:47,092 - INFO - Extracting data in online mode...
2025-08-11 11:20:47,094 - INFO - Scraping URL: https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/ with depth 0
2025-08-11 11:20:48,941 - INFO - Successfully saved links to Output_scrap.json.
2025-08-11 11:20:49,089 - INFO - Downloading files to: ../data
Downloading files:   0%|          | 0/4 [00:00<?, ?it/s]2025-08-11 11:23:18,669 - INFO - Successfully downloaded: PNADC_012024.zip
Downloading files:  25%|██▌       | 1/4 [02:29<07:29, 149.90s/it]2025-08-11 11:23:31,982 - INFO - Successfully downloaded: PNADC_022024.zip
Downloading files:  50%|█████     | 2/4 [02:42<02:18, 69.36s/it] 2025-08-11 11:23:45,347 - INFO - Successfully downloaded: PNADC_032024.zip
Downloading files:  75%|███████▌  | 3/4 [02:56<00:43, 43.79s/it]2025-08-11 11:27:34,408 - INFO - Successfully downloaded: PNADC_042024.zip
Downloading files: 100%|██████████| 4/4 [06:45<00:00, 101.36s/it]
2025-08-11 11:27:34,585 - INFO - Processing (depth 0): PNADC_012024.zip
2025-08-11 11:27:48,576 - INFO - Extracted: 527fc860_PNADC_012024.txt
2025-08-11 11:27:48,594 - INFO - Processing (depth 0): PNADC_022024.zip
2025-08-11 11:28:10,944 - INFO - Extracted: 59b8bc43_PNADC_022024.txt
2025-08-11 11:28:10,949 - INFO - Processing (depth 0): PNADC_032024.zip
2025-08-11 11:28:38,720 - INFO - Extracted: 6703e676_PNADC_032024.txt
2025-08-11 11:28:38,752 - INFO - Processing (depth 0): PNADC_042024.zip
2025-08-11 11:29:05,214 - INFO - Extracted: fbbfc8d2_PNADC_042024.txt
Processing files:   0%|          | 0/4 [00:00<?, ?it/s]2025-08-11 11:29:09,412 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:29:10,780 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:29:10,794 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:29:10,798 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:29:10,799 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:29:10,800 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:29:10,807 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:29:10,809 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:29:10,815 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:30:22,484 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:30:22,655 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:30:22,977 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:30:23,441 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:30:25,346 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:30:25,651 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:30:25,951 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:30:31,747 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:31:45,883 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:31:53,754 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:31:54,989 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:32:12,154 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:32:12,154 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:32:12,937 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:32:14,465 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:32:15,519 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:32:37,354 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:32:40,600 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
Processing files:  25%|██▌       | 1/4 [04:09<12:29, 249.68s/it]2025-08-11 11:33:15,339 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:15,576 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:15,576 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:15,577 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:15,577 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:15,577 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:15,577 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:15,578 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:15,579 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:59,737 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:01,070 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:01,386 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:02,500 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:02,519 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:04,139 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:04,535 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:05,699 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:51,589 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:57,616 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:57,619 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:57,917 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:35:01,565 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:35:02,895 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:35:03,717 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:35:06,165 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:35:28,267 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:35:46,956 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
Processing files:  50%|█████     | 2/4 [06:50<06:35, 197.57s/it]2025-08-11 11:35:56,430 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:35:56,867 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:35:56,868 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:35:56,868 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:35:56,869 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:35:56,869 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:35:56,870 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:35:56,871 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:35:56,872 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:36:34,255 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:36:34,382 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:36:35,884 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:36:36,148 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:36:39,779 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:36:41,449 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:36:42,582 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:36:46,729 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:02,540 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:07,609 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:08,648 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:10,392 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:10,896 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:11,564 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:16,129 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:23,880 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:40,085 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:47,832 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
Processing files:  75%|███████▌  | 3/4 [09:17<02:54, 174.40s/it]2025-08-11 11:38:23,261 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:38:23,490 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:38:23,490 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:38:23,490 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:38:23,490 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:38:23,490 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:38:23,491 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:38:23,491 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:38:23,492 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:39:09,084 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:39:09,621 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:39:19,081 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:39:19,103 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:39:20,163 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:39:20,793 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:39:21,181 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:39:25,673 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:40:52,590 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:40:53,385 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:41:02,250 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:41:02,827 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:41:04,508 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:41:04,787 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:41:09,654 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:41:16,589 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:41:29,174 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
Processing files: 100%|██████████| 4/4 [13:14<00:00, 198.71s/it]
2025-08-11 11:42:20,495 - INFO - Successfully processed 4/4 files
2025-08-11 11:42:20,503 - INFO - Extraction completed successfully.

Harmonizing the data#

First, we need to create an instance of the Harmonizer class.

har = Harmonizer()

After the dictionary is standardized and translated, it can be used to harmonize the data. For this, set the dict_df attribute of the Harmonizer instance to the standardized dictionary. This allows the harmonizer to use the information from the dictionary to process the dataframes.

har.dict_df = dic

Next, we can set the parameters for the harmonization process. The similarity_threshold parameter is used to set the threshold for the similarity among column names. The nan_threshold parameter is used to set the threshold for the number of NaN values allowed in a column. If a column has more NaN values than the specified threshold, it will be dropped from the final dataframe.

har.similarity_threshold = 0.9
har.nan_threshold = 1

The vertical_merge method merges dataframes vertically. This means the data frames will be concatenated along the rows and aligned if their column names meet the previously set similarity threshold. The available columns can be obtained using the get_available_columns method, which returns a list of column names present in all dataframes after vertical merging.

dfs = har.vertical_merge(dfs)
available_columns = har.get_available_columns(dfs)
2025-08-11 11:42:49,949 - WARNING - C:\Users\isabe\PycharmProjects\socio4health\.venv\Lib\site-packages\tqdm\std.py:580: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
  if rate and total else datetime.utcfromtimestamp(0))

Grouping DataFrames: 100%|██████████| 4/4 [00:00<00:00, 80.62it/s]
2025-08-11 11:42:50,003 - WARNING - C:\Users\isabe\PycharmProjects\socio4health\.venv\Lib\site-packages\tqdm\std.py:580: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
  if rate and total else datetime.utcfromtimestamp(0))

Merging groups: 100%|██████████| 1/1 [00:00<00:00,  6.23it/s]

For the selection of rows from the data, we can use the data_selector method. To use this method we first need to assign the categories of our interest, which can either be one or a set of the following categories: Business, Educations, Fertility, Housing, Identification, Migration, Nonstandard job, Social Security .This method allows us to select specific rows from the data based on the values in a specified column. The key_col parameter specifies the column to be used for selection, and the key_val parameter specifies the values to be selected. In this case, we will select rows where the value in the DPTO column is equal to 25, which corresponds to the state of São Paulo.

har.categories = ["Business"]
har.key_col = 'DPTO'
har.key_val = ['25']
filtered_ddfs = har.data_selector(dfs)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[15], line 4
      2 har.key_col = 'DPTO'
      3 har.key_val = ['25']
----> 4 filtered_ddfs = har.data_selector(dfs)

File ~\PycharmProjects\socio4health\src\socio4health\harmonizer.py:590, in Harmonizer.data_selector(self, ddfs)
    588 for ddf in ddfs:
    589     if self.key_col not in ddf.columns:
--> 590         raise KeyError(f"Key column '{self.key_col}' not found in DataFrame")
    592     filtered_ddf = ddf[ddf[self.key_col].isin(self.key_val)]
    593     if len(filtered_ddf) == 0:

KeyError: "Key column 'DPTO' not found in DataFrame"

Finally, we can join the filtered dataframes into a single dataframe using the join_data method. This method combines the data from the filtered dataframes into a single dataframe, aligning the columns based on their names. The resulting dataframe will contain all the columns that are present in the filtered dataframes, and it will be ready for further analysis or export as a CSV file.

joined_df = har.join_data(filtered_ddfs)
available_cols = joined_df.columns.tolist()
print(f"Available columns: {available_cols}")
print(f"Shape of the joined DataFrame: {joined_df.shape}")
print(joined_df.head())
joined_df.to_csv('data/GEIH_2022_harmonized.csv', index=False)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 joined_df = har.join_data(filtered_ddfs)
      2 available_cols = joined_df.columns.tolist()
      3 print(f"Available columns: {available_cols}")

NameError: name 'filtered_ddfs' is not defined