
Harmonization of data#
Run the tutorial via free cloud platforms:
This notebook provides you with a tutorial on how to process the sociodemographic and economic data from online data sources from Brazil. This tutorial assumes you have an intermediate or advanced understanding of Python and data manipulation.
Setting up the enviornment#
To run this notebook, you need to have the following prerequisites:
Python 3.10+
Additionally, you need to install the socio4health
and pandas
package, which can be done using pip
:
!pip install socio4health pandas ipywidgets -q
Import Libraries#
import pandas as pd
from socio4health import Extractor
from socio4health.enums.data_info_enum import BraColnamesEnum, BraColspecsEnum
from socio4health.harmonizer import Harmonizer
from socio4health.utils import harmonizer_utils
import tqdm as tqdm
Extracting data from Brazil#
In this example, we will extract the Brazilian National Continuous Household Sample Survey (PNADC) for the year 2024 from the Brazilian Institute of Geography and Statistics (IBGE) website.
bra_online_extractor = Extractor(input_path="https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/", down_ext=['.txt','.zip'], is_fwf=True, colnames=BraColnamesEnum.PNADC.value, colspecs=BraColspecsEnum.PNADC.value, output_path="../data", depth=0)
Providing the raw dictionary#
We need to provide a raw dictionary to the harmonizer that contains the column names and their corresponding data types. This is necessary for the harmonization process, as it allows the harmonizer to understand the structure of the data. To know more about how to construct the raw dictionary, you can check the documentation.
raw_dict = pd.read_excel('raw_dictionary.xlsx')
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
Cell In[4], line 1
----> 1 raw_dict = pd.read_excel('raw_dictionary.xlsx')
File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/pandas/io/excel/_base.py:495, in read_excel(io, sheet_name, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, date_format, thousands, decimal, comment, skipfooter, storage_options, dtype_backend, engine_kwargs)
493 if not isinstance(io, ExcelFile):
494 should_close = True
--> 495 io = ExcelFile(
496 io,
497 storage_options=storage_options,
498 engine=engine,
499 engine_kwargs=engine_kwargs,
500 )
501 elif engine and engine != io.engine:
502 raise ValueError(
503 "Engine should not be specified when passing "
504 "an ExcelFile - ExcelFile already has the engine set"
505 )
File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/pandas/io/excel/_base.py:1550, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options, engine_kwargs)
1548 ext = "xls"
1549 else:
-> 1550 ext = inspect_excel_format(
1551 content_or_path=path_or_buffer, storage_options=storage_options
1552 )
1553 if ext is None:
1554 raise ValueError(
1555 "Excel file format cannot be determined, you must specify "
1556 "an engine manually."
1557 )
File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/pandas/io/excel/_base.py:1402, in inspect_excel_format(content_or_path, storage_options)
1399 if isinstance(content_or_path, bytes):
1400 content_or_path = BytesIO(content_or_path)
-> 1402 with get_handle(
1403 content_or_path, "rb", storage_options=storage_options, is_text=False
1404 ) as handle:
1405 stream = handle.handle
1406 stream.seek(0)
File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/pandas/io/common.py:882, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
873 handle = open(
874 handle,
875 ioargs.mode,
(...) 878 newline="",
879 )
880 else:
881 # Binary mode
--> 882 handle = open(handle, ioargs.mode)
883 handles.append(handle)
885 # Convert BytesIO or file objects passed with an encoding
FileNotFoundError: [Errno 2] No such file or directory: 'raw_dictionary.xlsx'
The raw dictionary is then standardized using the standardize_dict
method, which ensures that the dictionary is in a consistent format, making it easier to work with during the harmonization process.
dic = harmonizer_utils.standardize_dict(raw_dict)
C:\Users\isabe\PycharmProjects\socio4health\src\socio4health\utils\harmonizer_utils.py:78: FutureWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
.apply(_process_group, include_groups=True)\
Additionally, the content of columns of the dictionary can be translated into English using translate_column
function from harmonizer_utils
module. Translation is performed for facilitate the understanding and processing of the data.
translate_column
method may take some time depending on the size of the dictionary and the number of columns to be translated. It is recommended to use this method only if you need the content of the columns in English for further processing or analysis.
dic = harmonizer_utils.translate_column(dic, "question", language="en")
dic = harmonizer_utils.translate_column(dic, "description", language="en")
dic = harmonizer_utils.translate_column(dic, "possible_answers", language="en")
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
Cell In[8], line 1
----> 1 dic = harmonizer_utils.translate_column(dic, "question", language="en")
2 dic = harmonizer_utils.translate_column(dic, "description", language="en")
3 dic = harmonizer_utils.translate_column(dic, "possible_answers", language="en")
File ~\PycharmProjects\socio4health\src\socio4health\utils\harmonizer_utils.py:175, in translate_column(data, column, language)
172 data = data.copy()
174 new_col = f"{column}_{language}"
--> 175 data[new_col] = data[column].apply(translate_text)
176 print(f"{column} translated")
178 return data
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\core\series.py:4935, in Series.apply(self, func, convert_dtype, args, by_row, **kwargs)
4800 def apply(
4801 self,
4802 func: AggFuncType,
(...) 4807 **kwargs,
4808 ) -> DataFrame | Series:
4809 """
4810 Invoke function on values of Series.
4811
(...) 4926 dtype: float64
4927 """
4928 return SeriesApply(
4929 self,
4930 func,
4931 convert_dtype=convert_dtype,
4932 by_row=by_row,
4933 args=args,
4934 kwargs=kwargs,
-> 4935 ).apply()
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\core\apply.py:1422, in SeriesApply.apply(self)
1419 return self.apply_compat()
1421 # self.func is Callable
-> 1422 return self.apply_standard()
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\core\apply.py:1502, in SeriesApply.apply_standard(self)
1496 # row-wise access
1497 # apply doesn't have a `na_action` keyword and for backward compat reasons
1498 # we need to give `na_action="ignore"` for categorical data.
1499 # TODO: remove the `na_action="ignore"` when that default has been changed in
1500 # Categorical (GH51645).
1501 action = "ignore" if isinstance(obj.dtype, CategoricalDtype) else None
-> 1502 mapped = obj._map_values(
1503 mapper=curried, na_action=action, convert=self.convert_dtype
1504 )
1506 if len(mapped) and isinstance(mapped[0], ABCSeries):
1507 # GH#43986 Need to do list(mapped) in order to get treated as nested
1508 # See also GH#25959 regarding EA support
1509 return obj._constructor_expanddim(list(mapped), index=obj.index)
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\core\base.py:925, in IndexOpsMixin._map_values(self, mapper, na_action, convert)
922 if isinstance(arr, ExtensionArray):
923 return arr.map(mapper, na_action=na_action)
--> 925 return algorithms.map_array(arr, mapper, na_action=na_action, convert=convert)
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\pandas\core\algorithms.py:1743, in map_array(arr, mapper, na_action, convert)
1741 values = arr.astype(object, copy=False)
1742 if na_action is None:
-> 1743 return lib.map_infer(values, mapper, convert=convert)
1744 else:
1745 return lib.map_infer_mask(
1746 values, mapper, mask=isna(values).view(np.uint8), convert=convert
1747 )
File pandas/_libs/lib.pyx:2999, in pandas._libs.lib.map_infer()
File ~\PycharmProjects\socio4health\src\socio4health\utils\harmonizer_utils.py:167, in translate_column.<locals>.translate_text(text)
165 return text
166 if len(text) < 5000:
--> 167 return GoogleTranslator(source='auto', target=language).translate(text)
168 else:
169 print("Rows with contents longer than 5000 characters are cut off")
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\deep_translator\google.py:67, in GoogleTranslator.translate(self, text, **kwargs)
64 if self.payload_key:
65 self._url_params[self.payload_key] = text
---> 67 response = requests.get(
68 self._base_url, params=self._url_params, proxies=self.proxies
69 )
70 if response.status_code == 429:
71 raise TooManyRequests()
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\requests\api.py:73, in get(url, params, **kwargs)
62 def get(url, params=None, **kwargs):
63 r"""Sends a GET request.
64
65 :param url: URL for the new :class:`Request` object.
(...) 70 :rtype: requests.Response
71 """
---> 73 return request("get", url, params=params, **kwargs)
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\requests\api.py:59, in request(method, url, **kwargs)
55 # By using the 'with' statement we are sure the session is closed, thus we
56 # avoid leaving sockets open which can trigger a ResourceWarning in some
57 # cases, and look like a memory leak in others.
58 with sessions.Session() as session:
---> 59 return session.request(method=method, url=url, **kwargs)
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\requests\sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
584 send_kwargs = {
585 "timeout": timeout,
586 "allow_redirects": allow_redirects,
587 }
588 send_kwargs.update(settings)
--> 589 resp = self.send(prep, **send_kwargs)
591 return resp
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\requests\sessions.py:703, in Session.send(self, request, **kwargs)
700 start = preferred_clock()
702 # Send the request
--> 703 r = adapter.send(request, **kwargs)
705 # Total elapsed time of the request (approximately)
706 elapsed = preferred_clock() - start
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\requests\adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies)
483 timeout = TimeoutSauce(connect=timeout, read=timeout)
485 try:
--> 486 resp = conn.urlopen(
487 method=request.method,
488 url=url,
489 body=request.body,
490 headers=request.headers,
491 redirect=False,
492 assert_same_host=False,
493 preload_content=False,
494 decode_content=False,
495 retries=self.max_retries,
496 timeout=timeout,
497 chunked=chunked,
498 )
500 except (ProtocolError, OSError) as err:
501 raise ConnectionError(err, request=request)
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\urllib3\connectionpool.py:787, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw)
784 response_conn = conn if not release_conn else None
786 # Make the request on the HTTPConnection object
--> 787 response = self._make_request(
788 conn,
789 method,
790 url,
791 timeout=timeout_obj,
792 body=body,
793 headers=headers,
794 chunked=chunked,
795 retries=retries,
796 response_conn=response_conn,
797 preload_content=preload_content,
798 decode_content=decode_content,
799 **response_kw,
800 )
802 # Everything went great!
803 clean_exit = True
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\urllib3\connectionpool.py:464, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length)
461 try:
462 # Trigger any extra validation we need to do.
463 try:
--> 464 self._validate_conn(conn)
465 except (SocketTimeout, BaseSSLError) as e:
466 self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\urllib3\connectionpool.py:1093, in HTTPSConnectionPool._validate_conn(self, conn)
1091 # Force connect early to allow us to validate the connection.
1092 if conn.is_closed:
-> 1093 conn.connect()
1095 # TODO revise this, see https://github.com/urllib3/urllib3/issues/2791
1096 if not conn.is_verified and not conn.proxy_is_verified:
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\urllib3\connection.py:790, in HTTPSConnection.connect(self)
787 # Remove trailing '.' from fqdn hostnames to allow certificate validation
788 server_hostname_rm_dot = server_hostname.rstrip(".")
--> 790 sock_and_verified = _ssl_wrap_socket_and_match_hostname(
791 sock=sock,
792 cert_reqs=self.cert_reqs,
793 ssl_version=self.ssl_version,
794 ssl_minimum_version=self.ssl_minimum_version,
795 ssl_maximum_version=self.ssl_maximum_version,
796 ca_certs=self.ca_certs,
797 ca_cert_dir=self.ca_cert_dir,
798 ca_cert_data=self.ca_cert_data,
799 cert_file=self.cert_file,
800 key_file=self.key_file,
801 key_password=self.key_password,
802 server_hostname=server_hostname_rm_dot,
803 ssl_context=self.ssl_context,
804 tls_in_tls=tls_in_tls,
805 assert_hostname=self.assert_hostname,
806 assert_fingerprint=self.assert_fingerprint,
807 )
808 self.sock = sock_and_verified.socket
810 # If an error occurs during connection/handshake we may need to release
811 # our lock so another connection can probe the origin.
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\urllib3\connection.py:969, in _ssl_wrap_socket_and_match_hostname(sock, cert_reqs, ssl_version, ssl_minimum_version, ssl_maximum_version, cert_file, key_file, key_password, ca_certs, ca_cert_dir, ca_cert_data, assert_hostname, assert_fingerprint, server_hostname, ssl_context, tls_in_tls)
966 if is_ipaddress(normalized):
967 server_hostname = normalized
--> 969 ssl_sock = ssl_wrap_socket(
970 sock=sock,
971 keyfile=key_file,
972 certfile=cert_file,
973 key_password=key_password,
974 ca_certs=ca_certs,
975 ca_cert_dir=ca_cert_dir,
976 ca_cert_data=ca_cert_data,
977 server_hostname=server_hostname,
978 ssl_context=context,
979 tls_in_tls=tls_in_tls,
980 )
982 try:
983 if assert_fingerprint:
File ~\PycharmProjects\socio4health\.venv\Lib\site-packages\urllib3\util\ssl_.py:458, in ssl_wrap_socket(sock, keyfile, certfile, cert_reqs, ca_certs, server_hostname, ssl_version, ciphers, ssl_context, ca_cert_dir, key_password, ca_cert_data, tls_in_tls)
456 if ca_certs or ca_cert_dir or ca_cert_data:
457 try:
--> 458 context.load_verify_locations(ca_certs, ca_cert_dir, ca_cert_data)
459 except OSError as e:
460 raise SSLError(e) from e
KeyboardInterrupt:
The classify_rows
method is then used to classify the rows of the standardized dictionary based on the content of the specified columns. This classification helps in organizing the data and making it easier to work with during the harmonization process. The MODEL_PATH
parameter specifies the path to a pre-trained model that is used for classification. You can provide your own model or use the default one provided in the files
folder. The model is a fine-tuned BERT model for text classification. You can find more details about the model in the documentation.
dic = harmonizer_utils.classify_rows(dic, "question_en", "description_en", "possible_answers_en",
new_column_name="category",
MODEL_PATH="files/bert_finetuned_classifier")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[9], line 1
----> 1 dic = harmonizer_utils.classify_rows(dic, "question_en", "description_en", "possible_answers_en",
2 new_column_name="category",
3 MODEL_PATH="files/bert_finetuned_classifier")
File ~\PycharmProjects\socio4health\src\socio4health\utils\harmonizer_utils.py:241, in classify_rows(data, col1, col2, col3, new_column_name, MODEL_PATH)
239 raise TypeError("The parameters col1, col2 and col3 must be strings.")
240 if col not in data.columns:
--> 241 raise ValueError(f"The column '{col}' is not found in the DataFrame.")
243 if not isinstance(new_column_name, str) or not new_column_name:
244 raise ValueError("new_column_name must be a non-empty string.")
ValueError: The column 'question_en' is not found in the DataFrame.
Extracting the data#
The extract
method of the Extractor
class is used retrieve the data from the specified input path. It returns a list of dataframes, each dataframe corresponding to a file extracted from the path.
dfs = bra_online_extractor.extract()
2025-08-11 11:20:47,090 - INFO - ----------------------
2025-08-11 11:20:47,091 - INFO - Starting data extraction...
2025-08-11 11:20:47,092 - INFO - Extracting data in online mode...
2025-08-11 11:20:47,094 - INFO - Scraping URL: https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/ with depth 0
2025-08-11 11:20:48,941 - INFO - Successfully saved links to Output_scrap.json.
2025-08-11 11:20:49,089 - INFO - Downloading files to: ../data
Downloading files: 0%| | 0/4 [00:00<?, ?it/s]2025-08-11 11:23:18,669 - INFO - Successfully downloaded: PNADC_012024.zip
Downloading files: 25%|██▌ | 1/4 [02:29<07:29, 149.90s/it]2025-08-11 11:23:31,982 - INFO - Successfully downloaded: PNADC_022024.zip
Downloading files: 50%|█████ | 2/4 [02:42<02:18, 69.36s/it] 2025-08-11 11:23:45,347 - INFO - Successfully downloaded: PNADC_032024.zip
Downloading files: 75%|███████▌ | 3/4 [02:56<00:43, 43.79s/it]2025-08-11 11:27:34,408 - INFO - Successfully downloaded: PNADC_042024.zip
Downloading files: 100%|██████████| 4/4 [06:45<00:00, 101.36s/it]
2025-08-11 11:27:34,585 - INFO - Processing (depth 0): PNADC_012024.zip
2025-08-11 11:27:48,576 - INFO - Extracted: 527fc860_PNADC_012024.txt
2025-08-11 11:27:48,594 - INFO - Processing (depth 0): PNADC_022024.zip
2025-08-11 11:28:10,944 - INFO - Extracted: 59b8bc43_PNADC_022024.txt
2025-08-11 11:28:10,949 - INFO - Processing (depth 0): PNADC_032024.zip
2025-08-11 11:28:38,720 - INFO - Extracted: 6703e676_PNADC_032024.txt
2025-08-11 11:28:38,752 - INFO - Processing (depth 0): PNADC_042024.zip
2025-08-11 11:29:05,214 - INFO - Extracted: fbbfc8d2_PNADC_042024.txt
Processing files: 0%| | 0/4 [00:00<?, ?it/s]2025-08-11 11:29:09,412 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:29:10,780 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:29:10,794 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:29:10,798 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:29:10,799 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:29:10,800 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:29:10,807 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:29:10,809 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:29:10,815 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:30:22,484 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:30:22,655 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:30:22,977 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:30:23,441 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:30:25,346 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:30:25,651 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:30:25,951 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:30:31,747 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:31:45,883 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:31:53,754 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:31:54,989 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:32:12,154 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:32:12,154 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:32:12,937 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:32:14,465 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:32:15,519 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:32:37,354 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
2025-08-11 11:32:40,600 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/527fc860_PNADC_012024.txt
Processing files: 25%|██▌ | 1/4 [04:09<12:29, 249.68s/it]2025-08-11 11:33:15,339 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:15,576 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:15,576 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:15,577 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:15,577 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:15,577 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:15,577 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:15,578 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:15,579 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:33:59,737 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:01,070 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:01,386 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:02,500 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:02,519 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:04,139 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:04,535 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:05,699 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:51,589 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:57,616 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:57,619 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:34:57,917 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:35:01,565 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:35:02,895 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:35:03,717 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:35:06,165 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:35:28,267 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
2025-08-11 11:35:46,956 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/59b8bc43_PNADC_022024.txt
Processing files: 50%|█████ | 2/4 [06:50<06:35, 197.57s/it]2025-08-11 11:35:56,430 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:35:56,867 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:35:56,868 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:35:56,868 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:35:56,869 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:35:56,869 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:35:56,870 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:35:56,871 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:35:56,872 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:36:34,255 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:36:34,382 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:36:35,884 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:36:36,148 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:36:39,779 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:36:41,449 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:36:42,582 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:36:46,729 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:02,540 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:07,609 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:08,648 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:10,392 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:10,896 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:11,564 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:16,129 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:23,880 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:40,085 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
2025-08-11 11:37:47,832 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/6703e676_PNADC_032024.txt
Processing files: 75%|███████▌ | 3/4 [09:17<02:54, 174.40s/it]2025-08-11 11:38:23,261 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:38:23,490 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:38:23,490 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:38:23,490 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:38:23,490 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:38:23,490 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:38:23,491 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:38:23,491 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:38:23,492 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:39:09,084 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:39:09,621 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:39:19,081 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:39:19,103 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:39:20,163 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:39:20,793 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:39:21,181 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:39:25,673 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:40:52,590 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:40:53,385 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:41:02,250 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:41:02,827 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:41:04,508 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:41:04,787 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:41:09,654 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:41:16,589 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
2025-08-11 11:41:29,174 - DEBUG - open file: C:/Users/isabe/PycharmProjects/socio4health/docs/source/notebooks/../data/fbbfc8d2_PNADC_042024.txt
Processing files: 100%|██████████| 4/4 [13:14<00:00, 198.71s/it]
2025-08-11 11:42:20,495 - INFO - Successfully processed 4/4 files
2025-08-11 11:42:20,503 - INFO - Extraction completed successfully.
Harmonizing the data#
First, we need to create an instance of the Harmonizer
class.
har = Harmonizer()
After the dictionary is standardized and translated, it can be used to harmonize the data. For this, set the dict_df
attribute of the Harmonizer
instance to the standardized dictionary. This allows the harmonizer to use the information from the dictionary to process the dataframes.
har.dict_df = dic
Next, we can set the parameters for the harmonization process. The similarity_threshold
parameter is used to set the threshold for the similarity among column names. The nan_threshold
parameter is used to set the threshold for the number of NaN
values allowed in a column. If a column has more NaN
values than the specified threshold, it will be dropped from the final dataframe.
har.similarity_threshold = 0.9
har.nan_threshold = 1
The vertical_merge
method merges dataframes vertically. This means the data frames will be concatenated along the rows and aligned if their column names meet the previously set similarity threshold. The available columns can be obtained using the get_available_columns
method, which returns a list of column names present in all dataframes after vertical merging.
dfs = har.vertical_merge(dfs)
available_columns = har.get_available_columns(dfs)
2025-08-11 11:42:49,949 - WARNING - C:\Users\isabe\PycharmProjects\socio4health\.venv\Lib\site-packages\tqdm\std.py:580: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
if rate and total else datetime.utcfromtimestamp(0))
Grouping DataFrames: 100%|██████████| 4/4 [00:00<00:00, 80.62it/s]
2025-08-11 11:42:50,003 - WARNING - C:\Users\isabe\PycharmProjects\socio4health\.venv\Lib\site-packages\tqdm\std.py:580: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
if rate and total else datetime.utcfromtimestamp(0))
Merging groups: 100%|██████████| 1/1 [00:00<00:00, 6.23it/s]
For the selection of rows from the data, we can use the data_selector
method. To use this method we first need to assign the categories of our interest, which can either be one or a set of the following categories: Business
, Educations
, Fertility
, Housing
, Identification
, Migration
, Nonstandard job
, Social Security
.This method allows us to select specific rows from the data based on the values in a specified column. The key_col
parameter specifies the column to be used for selection, and the key_val
parameter specifies the values to be selected. In this case, we will select rows where the value in the DPTO
column is equal to 25
, which corresponds to the state of São Paulo.
har.categories = ["Business"]
har.key_col = 'DPTO'
har.key_val = ['25']
filtered_ddfs = har.data_selector(dfs)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[15], line 4
2 har.key_col = 'DPTO'
3 har.key_val = ['25']
----> 4 filtered_ddfs = har.data_selector(dfs)
File ~\PycharmProjects\socio4health\src\socio4health\harmonizer.py:590, in Harmonizer.data_selector(self, ddfs)
588 for ddf in ddfs:
589 if self.key_col not in ddf.columns:
--> 590 raise KeyError(f"Key column '{self.key_col}' not found in DataFrame")
592 filtered_ddf = ddf[ddf[self.key_col].isin(self.key_val)]
593 if len(filtered_ddf) == 0:
KeyError: "Key column 'DPTO' not found in DataFrame"
Finally, we can join the filtered dataframes into a single dataframe using the join_data
method. This method combines the data from the filtered dataframes into a single dataframe, aligning the columns based on their names. The resulting dataframe will contain all the columns that are present in the filtered dataframes, and it will be ready for further analysis or export as a CSV
file.
joined_df = har.join_data(filtered_ddfs)
available_cols = joined_df.columns.tolist()
print(f"Available columns: {available_cols}")
print(f"Shape of the joined DataFrame: {joined_df.shape}")
print(joined_df.head())
joined_df.to_csv('data/GEIH_2022_harmonized.csv', index=False)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[16], line 1
----> 1 joined_df = har.join_data(filtered_ddfs)
2 available_cols = joined_df.columns.tolist()
3 print(f"Available columns: {available_cols}")
NameError: name 'filtered_ddfs' is not defined