
Extraction of Colombia, Brazil and Peru online data#
Run the tutorial via free cloud platforms:
This notebook provides you with an introduction on how to retrieve data from online data sources through web scraping, as well as from local files from Colombia, Brazil, Peru, and the Dominican Republic. This tutorial assumes you have an intermediate or advanced understanding of Python and data manipulation.
Setting up the environment#
To run this notebook, you need to have the following prerequisites:
Python 3.10+
Additionally, you need to install the socio4health
and pandas
package, which can be done using pip
:
!pip install socio4health pandas -q
[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip
Import Libraries#
To perform the data extraction, the socio4health
library provides the Extractor
class for data extraction, and the Harmonizer
class for data harmonization of the retrieved date. We will also use pandas
for data manipulation.
from socio4health import Extractor
from socio4health.enums.data_info_enum import BraColnamesEnum, BraColspecsEnum
Use case 1: Extracting data from Colombia#
To extract data from Colombia, we will use the Extractor
class from the socio4health
library. The Extractor
class provides methods to retrieve data from various sources, including online databases and local files. In this example, we will extract the Large Integrated Household Survey - GEIH - 2022 (Gran Encuesta Integrada de Hogares - GEIH - 2022) dataset from the Colombian Nacional Administration of Statistics (DANE) website
The Extractor
class requires the following parameters:
input_path
: TheURL
or local path to the data source.down_ext
: A list of file extensions to download. This can include.CSV
,.csv
,.zip
, etc.sep
: The separator used in the data files (e.g.,;
for semicolon-separated values).output_path
: The local path where the extracted data will be saved.depth
: The depth of the directory structure to traverse when downloading files. A depth of0
means only the files in the specified directory will be downloaded.
col_online_extractor = Extractor(input_path="https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata", down_ext=['.CSV','.csv','.zip'], sep=';', output_path="../data", depth=0)
After the instance is set up, we can call the s4h_extract
method to download and extract the data. The method returns a list of pandas
DataFrames containing the extracted data.
col_dfs = col_online_extractor.s4h_extract()
2025-09-24 12:24:17,077 - INFO - ----------------------
2025-09-24 12:24:17,078 - INFO - Starting data extraction...
2025-09-24 12:24:17,079 - INFO - Extracting data in online mode...
2025-09-24 12:24:17,080 - INFO - Scraping URL: https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata with depth 0
2025-09-24 12:24:21,753 - INFO - Spider completed successfully for URL: https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata
2025-09-24 12:24:21,755 - INFO - Downloading files to: ../data
Downloading files: 0%| | 0/12 [00:00<?, ?it/s]2025-09-24 12:24:25,589 - INFO - Successfully downloaded: GEIH_Enero_2022_Marco_2018.zip
Downloading files: 8%|▊ | 1/12 [00:03<00:41, 3.81s/it]2025-09-24 12:24:29,405 - INFO - Successfully downloaded: GEIH_Febrero_2022_Marco_2018.zip
Downloading files: 17%|█▋ | 2/12 [00:07<00:38, 3.81s/it]2025-09-24 12:24:32,439 - INFO - Successfully downloaded: GEIH_Marzo_2022_Marco_2018.zip
Downloading files: 25%|██▌ | 3/12 [00:10<00:31, 3.46s/it]2025-09-24 12:24:34,187 - INFO - Successfully downloaded: GEIH_Mayo_2022_Marco_2018.zip
Downloading files: 33%|███▎ | 4/12 [00:12<00:22, 2.78s/it]2025-09-24 12:24:37,230 - INFO - Successfully downloaded: GEIH_Junio_2022_Marco_2018.zip
Downloading files: 42%|████▏ | 5/12 [00:15<00:20, 2.88s/it]2025-09-24 12:24:40,109 - INFO - Successfully downloaded: GEIH_Julio_2022_Marco_2018.zip
Downloading files: 50%|█████ | 6/12 [00:18<00:17, 2.88s/it]2025-09-24 12:24:44,680 - INFO - Successfully downloaded: GEIH_Agosto_2022_Marco_2018.zip
Downloading files: 58%|█████▊ | 7/12 [00:22<00:17, 3.43s/it]2025-09-24 12:24:48,759 - INFO - Successfully downloaded: GEIH_Septiembre_Marco_2018.zip
Downloading files: 67%|██████▋ | 8/12 [00:26<00:14, 3.64s/it]2025-09-24 12:24:51,119 - INFO - Successfully downloaded: GEIH_Octubre_Marco_2018.zip
Downloading files: 75%|███████▌ | 9/12 [00:29<00:09, 3.24s/it]2025-09-24 12:24:54,440 - INFO - Successfully downloaded: GEIH_Diciembre_2022_Marco_2018.zip
Downloading files: 83%|████████▎ | 10/12 [00:32<00:06, 3.26s/it]2025-09-24 12:24:59,539 - INFO - Successfully downloaded: GEIH_Abril_2022_Marco_2018_Act.zip
Downloading files: 92%|█████████▏| 11/12 [00:37<00:03, 3.83s/it]2025-09-24 12:25:02,078 - INFO - Successfully downloaded: GEIH_Noviembre_2022_Marco_2018.act.zip
Downloading files: 100%|██████████| 12/12 [00:40<00:00, 3.36s/it]
2025-09-24 12:25:02,087 - INFO - Processing (depth 0): GEIH_Enero_2022_Marco_2018.zip
2025-09-24 12:25:14,100 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Características generales, seguridad social en salud y educación.csv
2025-09-24 12:25:14,118 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:25:14,139 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Fuerza de trabajo.csv
2025-09-24 12:25:14,160 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Migración.CSV
2025-09-24 12:25:14,177 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_No ocupados.csv
2025-09-24 12:25:14,210 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Ocupados.csv
2025-09-24 12:25:14,237 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otras formas de trabajo.csv
2025-09-24 12:25:14,260 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ingresos e impuestos.csv
2025-09-24 12:25:14,281 - INFO - Extracted: c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Tipo de investigación.csv
2025-09-24 12:25:15,073 - INFO - Processing (depth 0): GEIH_Febrero_2022_Marco_2018.zip
2025-09-24 12:25:54,279 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:25:54,289 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:25:54,316 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Fuerza de trabajo.CSV
2025-09-24 12:25:54,337 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Migración.CSV
2025-09-24 12:25:54,351 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_No ocupados.CSV
2025-09-24 12:25:54,367 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Ocupados.CSV
2025-09-24 12:25:54,384 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Otras formas de trabajo.CSV
2025-09-24 12:25:54,401 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:25:54,419 - INFO - Extracted: fb8de51c_GEIH_Febrero_2022_Marco_2018_CSV_Tipo de investigación.CSV
2025-09-24 12:25:54,627 - INFO - Processing (depth 0): GEIH_Marzo_2022_Marco_2018.zip
2025-09-24 12:26:06,808 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:26:06,821 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Datos del hogar y la vivienda.CSV
2025-09-24 12:26:06,840 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Fuerza de trabajo.CSV
2025-09-24 12:26:06,860 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Migración.CSV
2025-09-24 12:26:06,871 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_No ocupados.CSV
2025-09-24 12:26:06,890 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Ocupados.CSV
2025-09-24 12:26:06,908 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Otras formas de trabajo.CSV
2025-09-24 12:26:06,934 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Otros ingresos e impuestos.CSV
2025-09-24 12:26:06,951 - INFO - Extracted: 0e57b75b_GEIH_Marzo_2022_Marco_2018_csv_Tipo de investigación.CSV
2025-09-24 12:26:07,163 - INFO - Processing (depth 0): GEIH_Mayo_2022_Marco_2018.zip
2025-09-24 12:26:07,573 - INFO - Processing (depth 1): CSV.zip
2025-09-24 12:26:08,009 - INFO - Extracted: abcbb904_csv_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:26:08,011 - INFO - Extracted: abcbb904_csv_Datos del hogar y la vivienda.CSV
2025-09-24 12:26:08,013 - INFO - Extracted: abcbb904_csv_Fuerza de trabajo.CSV
2025-09-24 12:26:08,015 - INFO - Extracted: abcbb904_csv_Migración.CSV
2025-09-24 12:26:08,017 - INFO - Extracted: abcbb904_csv_No ocupados.CSV
2025-09-24 12:26:08,021 - INFO - Extracted: abcbb904_csv_Ocupados.CSV
2025-09-24 12:26:08,024 - INFO - Extracted: abcbb904_csv_Otras formas de trabajo.CSV
2025-09-24 12:26:08,028 - INFO - Extracted: abcbb904_csv_Otros ingresos e impuestos.CSV
2025-09-24 12:26:08,032 - INFO - Extracted: abcbb904_csv_Tipo de investigación.CSV
2025-09-24 12:26:08,036 - INFO - Processing (depth 1): DTA.zip
2025-09-24 12:26:13,712 - WARNING - No matches in DTA.zip. Contents:
2025-09-24 12:26:13,713 - WARNING - No files found matching the specified extensions.
2025-09-24 12:26:13,714 - INFO - Processing (depth 1): SAV.zip
2025-09-24 12:26:19,948 - WARNING - No matches in SAV.zip. Contents:
2025-09-24 12:26:19,949 - WARNING - No files found matching the specified extensions.
2025-09-24 12:26:19,955 - INFO - Processing (depth 0): GEIH_Junio_2022_Marco_2018.zip
2025-09-24 12:26:31,636 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:26:31,648 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:26:31,662 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Fuerza de trabajo.CSV
2025-09-24 12:26:31,678 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Migración.CSV
2025-09-24 12:26:31,688 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_No ocupados.CSV
2025-09-24 12:26:31,706 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Ocupados.CSV
2025-09-24 12:26:31,727 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Otras formas de trabajo.CSV
2025-09-24 12:26:31,744 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:26:31,757 - INFO - Extracted: 37710ed1_GEIH_Junio_2022_Marco_2018_CSV(1)_CSV_Tipo de investigación.CSV
2025-09-24 12:26:31,990 - INFO - Processing (depth 0): GEIH_Julio_2022_Marco_2018.zip
2025-09-24 12:27:09,303 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:27:09,318 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:27:09,337 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Fuerza de trabajo.CSV
2025-09-24 12:27:09,362 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Migración.CSV
2025-09-24 12:27:09,373 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_No ocupados.CSV
2025-09-24 12:27:09,392 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Ocupados.CSV
2025-09-24 12:27:09,411 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Otras formas de trabajo.CSV
2025-09-24 12:27:09,428 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:27:09,442 - INFO - Extracted: 7a9c1851_GEIH_Julio_2022_Marco_2018_GEIH_Julio_Marco_2018_CSV_Tipo de investigación.CSV
2025-09-24 12:27:09,665 - INFO - Processing (depth 0): GEIH_Agosto_2022_Marco_2018.zip
2025-09-24 12:27:21,343 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:27:21,356 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:27:21,376 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Fuerza de trabajo.CSV
2025-09-24 12:27:21,399 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Migración.CSV
2025-09-24 12:27:21,409 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_No ocupados.CSV
2025-09-24 12:27:21,430 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Ocupados.CSV
2025-09-24 12:27:21,447 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Otras formas de trabajo.CSV
2025-09-24 12:27:21,464 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:27:21,477 - INFO - Extracted: c80488e3_GEIH_Agosto_2022_Marco_2018_CSV_Tipo de investigación.CSV
2025-09-24 12:27:21,659 - INFO - Processing (depth 0): GEIH_Septiembre_Marco_2018.zip
2025-09-24 12:27:21,901 - INFO - Processing (depth 1): CSV.zip
2025-09-24 12:27:22,348 - INFO - Extracted: 25ad86f4_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:27:22,351 - INFO - Extracted: 25ad86f4_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:27:22,354 - INFO - Extracted: 25ad86f4_CSV_Fuerza de trabajo.CSV
2025-09-24 12:27:22,356 - INFO - Extracted: 25ad86f4_CSV_Migración.CSV
2025-09-24 12:27:22,359 - INFO - Extracted: 25ad86f4_CSV_No ocupados.CSV
2025-09-24 12:27:22,362 - INFO - Extracted: 25ad86f4_CSV_Ocupados.CSV
2025-09-24 12:27:22,367 - INFO - Extracted: 25ad86f4_CSV_Otras formas de trabajo.CSV
2025-09-24 12:27:22,370 - INFO - Extracted: 25ad86f4_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:27:22,373 - INFO - Extracted: 25ad86f4_CSV_Tipo de investigación.CSV
2025-09-24 12:27:22,376 - INFO - Processing (depth 1): DTA.zip
2025-09-24 12:27:28,035 - WARNING - No matches in DTA.zip. Contents:
2025-09-24 12:27:28,036 - WARNING - No files found matching the specified extensions.
2025-09-24 12:27:28,037 - INFO - Processing (depth 1): SAV.zip
2025-09-24 12:27:33,899 - WARNING - No matches in SAV.zip. Contents:
2025-09-24 12:27:33,900 - WARNING - No files found matching the specified extensions.
2025-09-24 12:27:33,908 - INFO - Processing (depth 0): GEIH_Octubre_Marco_2018.zip
2025-09-24 12:27:34,090 - INFO - Processing (depth 1): CSV 4.zip
2025-09-24 12:27:34,294 - INFO - Extracted: 4a4be0fc_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:27:34,297 - INFO - Extracted: 4a4be0fc_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:27:34,300 - INFO - Extracted: 4a4be0fc_CSV_Fuerza de trabajo.CSV
2025-09-24 12:27:34,303 - INFO - Extracted: 4a4be0fc_CSV_Migración.CSV
2025-09-24 12:27:34,307 - INFO - Extracted: 4a4be0fc_CSV_No ocupados.CSV
2025-09-24 12:27:34,312 - INFO - Extracted: 4a4be0fc_CSV_Ocupados.CSV
2025-09-24 12:27:34,316 - INFO - Extracted: 4a4be0fc_CSV_Otras formas de trabajo.CSV
2025-09-24 12:27:34,319 - INFO - Extracted: 4a4be0fc_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:27:34,323 - INFO - Extracted: 4a4be0fc_CSV_Tipo de investigación.CSV
2025-09-24 12:27:34,326 - INFO - Processing (depth 1): DTA 4.zip
2025-09-24 12:27:39,914 - WARNING - No matches in DTA 4.zip. Contents:
2025-09-24 12:27:39,915 - WARNING - No files found matching the specified extensions.
2025-09-24 12:27:39,916 - INFO - Processing (depth 1): SAV 4.zip
2025-09-24 12:27:45,730 - WARNING - No matches in SAV 4.zip. Contents:
2025-09-24 12:27:45,731 - WARNING - No files found matching the specified extensions.
2025-09-24 12:27:45,739 - INFO - Processing (depth 0): GEIH_Diciembre_2022_Marco_2018.zip
2025-09-24 12:27:57,984 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:27:57,994 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Datos del hogar y la vivienda.CSV
2025-09-24 12:27:58,009 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Fuerza de trabajo.CSV
2025-09-24 12:27:58,022 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Migración.CSV
2025-09-24 12:27:58,049 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_No ocupados.CSV
2025-09-24 12:27:58,082 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Ocupados.CSV
2025-09-24 12:27:58,108 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Otras formas de trabajo.CSV
2025-09-24 12:27:58,130 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Otros ingresos e impuestos.CSV
2025-09-24 12:27:58,143 - INFO - Extracted: 5881e2ed_GEIH_Diciembre_2022_Marco_2018_CVS_Tipo de investigación.CSV
2025-09-24 12:27:58,370 - INFO - Processing (depth 0): GEIH_Abril_2022_Marco_2018_Act.zip
2025-09-24 12:28:10,618 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:28:10,632 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:28:10,650 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Fuerza de trabajo.CSV
2025-09-24 12:28:10,672 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Migración.CSV
2025-09-24 12:28:10,686 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_No ocupados.CSV
2025-09-24 12:28:10,706 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Ocupados.CSV
2025-09-24 12:28:10,726 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Otras formas de trabajo.CSV
2025-09-24 12:28:10,743 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:28:10,755 - INFO - Extracted: 4079475d_GEIH_Abril_2022_Marco_2018_CSV_Tipo de investigación.CSV
2025-09-24 12:28:10,978 - INFO - Processing (depth 0): GEIH_Noviembre_2022_Marco_2018.act.zip
2025-09-24 12:28:22,841 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Características generales, seguridad social en salud y educación.CSV
2025-09-24 12:28:22,851 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Datos del hogar y la vivienda.CSV
2025-09-24 12:28:22,866 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Fuerza de trabajo.CSV
2025-09-24 12:28:22,882 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Migración.CSV
2025-09-24 12:28:22,892 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_No ocupados.CSV
2025-09-24 12:28:22,913 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Ocupados.CSV
2025-09-24 12:28:22,956 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Otras formas de trabajo.CSV
2025-09-24 12:28:22,978 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Otros ingresos e impuestos.CSV
2025-09-24 12:28:22,993 - INFO - Extracted: 0d034efb_GEIH_Noviembre_2022_Marco_2018_CSV 5_CSV_Tipo de investigación.CSV
Processing files: 100%|██████████| 108/108 [00:28<00:00, 3.78it/s]
2025-09-24 12:28:51,874 - INFO - Successfully processed 108/108 files
2025-09-24 12:28:51,875 - INFO - Extraction completed successfully.
col_dfs[0].head()
PERIODO | DIRECTORIO | SECUENCIA_P | ORDEN | HOGAR | P7495 | P7500S1 | P7500S1A1 | P7500S2 | P7500S2A1 | ... | P3371S1 | P3371S2 | P3371S3 | P3371S4 | P3372 | P3372S1 | FEX_C18 | PER | REGIS | filename | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20220104 | 5000000 | 1 | 1 | 1 | 2 | <NA> | <NA> | <NA> | <NA> | ... | <NA> | <NA> | <NA> | <NA> | 2 | <NA> | 1432.4633227 | 2022 | 90 | c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ... |
1 | 20220104 | 5000000 | 1 | 2 | 1 | 2 | <NA> | <NA> | <NA> | <NA> | ... | <NA> | <NA> | <NA> | <NA> | 2 | <NA> | 1432.4633227 | 2022 | 90 | c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ... |
2 | 20220104 | 5000000 | 1 | 6 | 1 | 2 | <NA> | <NA> | <NA> | <NA> | ... | <NA> | <NA> | <NA> | <NA> | 2 | <NA> | 1432.4633227 | 2022 | 90 | c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ... |
3 | 20220104 | 5000001 | 1 | 1 | 1 | 2 | <NA> | <NA> | <NA> | <NA> | ... | <NA> | <NA> | <NA> | <NA> | 2 | <NA> | 1088.7962663 | 2022 | 90 | c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ... |
4 | 20220104 | 5000001 | 1 | 2 | 1 | 2 | <NA> | <NA> | <NA> | <NA> | ... | <NA> | <NA> | <NA> | <NA> | 2 | <NA> | 1088.7962663 | 2022 | 90 | c8bca186_GEIH_Enero_2022_Marco_2018_CSV_Otros ... |
5 rows × 60 columns
Use case 2: Extracting data from Brazil#
We are downloading the Brazilian data from the Brazilian Institute of Geography and Statistics (IBGE) website. The Extractor
class is used to download the data. In this case, we are extracting the Brazilian National Continuous Household Sample Survey (PNADC) for the year 2024
is_fwf
parameter is set to True
, which indicates that the data files are in fixed-width format. The colnames
and colspecs
parameters must be provided. In this example, they are set to the corresponding available enums for PNADC data, which define the column names and specifications for the dataset.
See more details in
socio4health.enums.data_info_enum documentation
.
bra_online_extractor = Extractor(input_path="https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/", down_ext=['.txt','.zip'], is_fwf=True, colnames=BraColnamesEnum.PNADC.value, colspecs=BraColspecsEnum.PNADC.value, output_path="../data", depth=0)
bra_dfs = bra_online_extractor.s4h_extract()
2025-09-24 12:30:08,711 - INFO - ----------------------
2025-09-24 12:30:08,713 - INFO - Starting data extraction...
2025-09-24 12:30:08,713 - INFO - Extracting data in online mode...
2025-09-24 12:30:08,715 - INFO - Scraping URL: https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/ with depth 0
2025-09-24 12:30:13,261 - INFO - Spider completed successfully for URL: https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/
2025-09-24 12:30:13,264 - INFO - Downloading files to: ../data
Downloading files: 0%| | 0/4 [00:00<?, ?it/s]2025-09-24 12:30:28,678 - INFO - Successfully downloaded: PNADC_012024_20250815.zip
Downloading files: 25%|██▌ | 1/4 [00:15<00:46, 15.41s/it]2025-09-24 12:41:31,509 - INFO - Successfully downloaded: PNADC_022024_20250815.zip
Downloading files: 50%|█████ | 2/4 [11:18<13:12, 396.25s/it]2025-09-24 12:41:57,057 - INFO - Successfully downloaded: PNADC_032024_20250815.zip
Downloading files: 75%|███████▌ | 3/4 [11:43<03:46, 226.98s/it]2025-09-24 12:42:33,947 - INFO - Successfully downloaded: PNADC_042024_20250815.zip
Downloading files: 100%|██████████| 4/4 [12:20<00:00, 185.17s/it]
2025-09-24 12:42:33,956 - INFO - Processing (depth 0): PNADC_012024_20250815.zip
2025-09-24 12:42:49,933 - INFO - Extracted: a7db871d_PNADC_012024.txt
2025-09-24 12:42:49,942 - INFO - Processing (depth 0): PNADC_022024_20250815.zip
2025-09-24 12:43:32,038 - INFO - Extracted: 946bef7f_PNADC_022024.txt
2025-09-24 12:43:32,041 - INFO - Processing (depth 0): PNADC_032024_20250815.zip
2025-09-24 12:44:02,915 - INFO - Extracted: c8ebcb57_PNADC_032024.txt
2025-09-24 12:44:02,916 - INFO - Processing (depth 0): PNADC_042024_20250815.zip
2025-09-24 12:44:16,725 - INFO - Extracted: c1bffc2a_PNADC_042024.txt
Processing files: 100%|██████████| 4/4 [04:57<00:00, 74.45s/it]
2025-09-24 12:49:14,550 - INFO - Successfully processed 4/4 files
2025-09-24 12:49:14,554 - INFO - Extraction completed successfully.
bra_dfs[0].head()
Ano | Trimestre | UF | Capital | RM_RIDE | UPA | Estrato | V1008 | V1014 | V1016 | ... | V1028192 | V1028193 | V1028194 | V1028195 | V1028196 | V1028197 | V1028198 | V1028199 | V1028200 | filename | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2024 | 1 | 11 | 11 | <NA> | 110000016 | 1110011 | 03 | 11 | 1 | ... | 000242.37393247 | 000000.00000000 | 000000.00000000 | 000132.86482247 | 000252.85458864 | 000271.03799675 | 000122.61081652 | 000125.78602243 | 000113.09511303 | a7db871d_PNADC_012024.txt |
1 | 2024 | 1 | 11 | 11 | <NA> | 110000016 | 1110011 | 06 | 11 | 1 | ... | 000405.66107457 | 000000.00000000 | 000000.00000000 | 000205.06572241 | 000410.23613176 | 000437.83686366 | 000190.08927267 | 000200.15696949 | 000182.15329508 | a7db871d_PNADC_012024.txt |
2 | 2024 | 1 | 11 | 11 | <NA> | 110000016 | 1110011 | 06 | 11 | 1 | ... | 000405.66107457 | 000000.00000000 | 000000.00000000 | 000205.06572241 | 000410.23613176 | 000437.83686366 | 000190.08927267 | 000200.15696949 | 000182.15329508 | a7db871d_PNADC_012024.txt |
3 | 2024 | 1 | 11 | 11 | <NA> | 110000016 | 1110011 | 08 | 11 | 1 | ... | 000485.38386591 | 000000.00000000 | 000000.00000000 | 000242.53160028 | 000474.75504741 | 000520.88948037 | 000223.17316781 | 000229.81795045 | 000213.22589782 | a7db871d_PNADC_012024.txt |
4 | 2024 | 1 | 11 | 11 | <NA> | 110000016 | 1110011 | 08 | 11 | 1 | ... | 000485.38386591 | 000000.00000000 | 000000.00000000 | 000242.53160028 | 000474.75504741 | 000520.88948037 | 000223.17316781 | 000229.81795045 | 000213.22589782 | a7db871d_PNADC_012024.txt |
5 rows × 421 columns
Use case 3: Extracting data from Peru#
Peruvian data is extracted from the National Institute of Statistics and Informatics (INEI) website. In this case, we are extracting the National Household Survey (ENAHO) for the year 2022. The down_ext
parameter is set to download .csv
and .zip
files, and the sep
parameter is set to ;
, indicating that the data files are semicolon-separated values.
per_online_extractor = Extractor(input_path="https://www.inei.gob.pe/media/DATOS_ABIERTOS/ENAHO/DATA/2022.zip", down_ext=['.csv','.zip'], output_path="../data", depth=0)
per_dfs = per_online_extractor.s4h_extract()
2025-09-24 12:50:18,463 - INFO - ----------------------
2025-09-24 12:50:18,464 - INFO - Starting data extraction...
2025-09-24 12:50:18,465 - INFO - Extracting data in online mode...
2025-09-24 12:50:18,466 - INFO - Detected direct file download URL - skipping scraping
2025-09-24 12:50:18,467 - INFO - Downloading large file (2022.zip)...
0.00B [00:00, ?B/s]2025-09-24 12:57:51,195 - INFO - Successfully downloaded: 2022.zip
0.00B [07:32, ?B/s]
2025-09-24 12:57:51,201 - INFO - Processing (depth 0): 2022.zip
2025-09-24 12:58:01,073 - INFO - Extracted: a0643d91_784-Modulo01_Enaho01-2022-100.csv
2025-09-24 12:58:01,083 - INFO - Extracted: a0643d91_784-Modulo02_ENAHO-TABLA-CIUO-88.csv
2025-09-24 12:58:01,090 - INFO - Extracted: a0643d91_784-Modulo02_ENAHO-TABLA-CNO-2015.csv
2025-09-24 12:58:01,112 - INFO - Extracted: a0643d91_784-Modulo02_Enaho01-2022-200.csv
2025-09-24 12:58:01,261 - INFO - Extracted: a0643d91_784-Modulo03_Enaho01a-2022-300.csv
2025-09-24 12:58:01,265 - INFO - Extracted: a0643d91_784-Modulo04_ENAHO-TABLA-PAISES.csv
2025-09-24 12:58:01,270 - INFO - Extracted: a0643d91_784-Modulo04_ENAHO-TABLA-UBIGEO.csv
2025-09-24 12:58:01,545 - INFO - Extracted: a0643d91_784-Modulo04_Enaho01a-2022-400.csv
2025-09-24 12:58:01,551 - INFO - Extracted: a0643d91_784-Modulo04_TABLA-UBIGEO-1874 DISTRITOS.csv
2025-09-24 12:58:01,557 - INFO - Extracted: a0643d91_784-Modulo05_ENAHO-TABLA-CIIU-REV3.csv
2025-09-24 12:58:01,564 - INFO - Extracted: a0643d91_784-Modulo05_ENAHO-TABLA-CIIU-REV4.csv
2025-09-24 12:58:01,572 - INFO - Extracted: a0643d91_784-Modulo05_ENAHO-TABLA-CIUO-88.csv
2025-09-24 12:58:01,580 - INFO - Extracted: a0643d91_784-Modulo05_ENAHO-TABLA-CNO-2015.csv
2025-09-24 12:58:01,890 - INFO - Extracted: a0643d91_784-Modulo05_Enaho01a-2022-500.csv
2025-09-24 12:58:04,521 - INFO - Extracted: a0643d91_784-Modulo07_Enaho01-2022-601.csv
2025-09-24 12:58:04,602 - INFO - Extracted: a0643d91_784-Modulo08_Enaho01-2022-602.csv
2025-09-24 12:58:04,618 - INFO - Extracted: a0643d91_784-Modulo08_Enaho01-2022-602A.csv
2025-09-24 12:58:05,144 - INFO - Extracted: a0643d91_784-Modulo09_Enaho01-2022-603.csv
2025-09-24 12:58:06,157 - INFO - Extracted: a0643d91_784-Modulo10_Enaho01-2022-604.csv
2025-09-24 12:58:06,679 - INFO - Extracted: a0643d91_784-Modulo11_Enaho01-2022-605.csv
2025-09-24 12:58:07,338 - INFO - Extracted: a0643d91_784-Modulo12_Enaho01-2022-606.csv
2025-09-24 12:58:08,099 - INFO - Extracted: a0643d91_784-Modulo13_Enaho01-2022-607.csv
2025-09-24 12:58:08,837 - INFO - Extracted: a0643d91_784-Modulo15_Enaho01-2022-609.csv
2025-09-24 12:58:09,513 - INFO - Extracted: a0643d91_784-Modulo16_Enaho01-2022-610.csv
2025-09-24 12:58:10,013 - INFO - Extracted: a0643d91_784-Modulo17_Enaho01-2022-611.csv
2025-09-24 12:58:10,753 - INFO - Extracted: a0643d91_784-Modulo18_Enaho01-2022-612.csv
2025-09-24 12:58:10,778 - INFO - Extracted: a0643d91_784-Modulo22_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:10,783 - INFO - Extracted: a0643d91_784-Modulo22_Enaho02-2022-2000.csv
2025-09-24 12:58:10,938 - INFO - Extracted: a0643d91_784-Modulo22_Enaho02-2022-2000A.csv
2025-09-24 12:58:11,015 - INFO - Extracted: a0643d91_784-Modulo22_Enaho02-2022-2100.csv
2025-09-24 12:58:11,104 - INFO - Extracted: a0643d91_784-Modulo23_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:11,174 - INFO - Extracted: a0643d91_784-Modulo23_Enaho02-2022-2200.csv
2025-09-24 12:58:11,303 - INFO - Extracted: a0643d91_784-Modulo24_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:11,375 - INFO - Extracted: a0643d91_784-Modulo24_Enaho02-2022-2300.csv
2025-09-24 12:58:11,473 - INFO - Extracted: a0643d91_784-Modulo25_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:11,602 - INFO - Extracted: a0643d91_784-Modulo25_Enaho02-2022-2400.csv
2025-09-24 12:58:11,799 - INFO - Extracted: a0643d91_784-Modulo26_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:11,816 - INFO - Extracted: a0643d91_784-Modulo26_Enaho02-2022-2500.csv
2025-09-24 12:58:12,003 - INFO - Extracted: a0643d91_784-Modulo27_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:12,078 - INFO - Extracted: a0643d91_784-Modulo27_Enaho02-2022-2600.csv
2025-09-24 12:58:12,111 - INFO - Extracted: a0643d91_784-Modulo28_ENAHO-TABLA-AGROPECUARIO.csv
2025-09-24 12:58:12,126 - INFO - Extracted: a0643d91_784-Modulo28_Enaho02-2022-2700.csv
2025-09-24 12:58:12,241 - INFO - Extracted: a0643d91_784-Modulo34_Sumaria-2022-12g.csv
2025-09-24 12:58:12,269 - INFO - Extracted: a0643d91_784-Modulo34_Sumaria-2022.csv
2025-09-24 12:58:12,343 - INFO - Extracted: a0643d91_784-Modulo37_Enaho01-2022-700.csv
2025-09-24 12:58:12,377 - INFO - Extracted: a0643d91_784-Modulo37_Enaho01-2022-700A.csv
2025-09-24 12:58:12,494 - INFO - Extracted: a0643d91_784-Modulo37_Enaho01-2022-700B.csv
2025-09-24 12:58:12,701 - INFO - Extracted: a0643d91_784-Modulo77_Enaho04-2022-1-Preg-1-a-13.csv
2025-09-24 12:58:12,719 - INFO - Extracted: a0643d91_784-Modulo77_Enaho04-2022-2-Preg-14-a-22.csv
2025-09-24 12:58:12,894 - INFO - Extracted: a0643d91_784-Modulo77_Enaho04-2022-3-Preg-23.csv
2025-09-24 12:58:12,981 - INFO - Extracted: a0643d91_784-Modulo77_Enaho04-2022-4-Preg-24.csv
2025-09-24 12:58:13,023 - INFO - Extracted: a0643d91_784-Modulo77_Enaho04-2022-5-Preg-25.csv
2025-09-24 12:58:13,814 - INFO - Extracted: a0643d91_784-Modulo78_Enaho01-2022-606D.csv
2025-09-24 12:58:13,878 - INFO - Extracted: a0643d91_784-Modulo84_Enaho01-2022-800A.csv
2025-09-24 12:58:13,948 - INFO - Extracted: a0643d91_784-Modulo84_Enaho01-2022-800B.csv
2025-09-24 12:58:14,253 - INFO - Extracted: a0643d91_784-Modulo85_Enaho01B-2022-1.csv
2025-09-24 12:58:14,415 - INFO - Extracted: a0643d91_784-Modulo85_Enaho01B-2022-2.csv
Processing files: 100%|██████████| 57/57 [00:59<00:00, 1.05s/it]
2025-09-24 12:59:14,152 - INFO - Successfully processed 57/57 files
2025-09-24 12:59:14,153 - INFO - Extraction completed successfully.
per_dfs[0].head()
AÑO | MES | CONGLOME | VIVIENDA | HOGAR | UBIGEO | DOMINIO | ESTRATO | P612N | P612 | ... | P612H | TICUEST01 | D612G | D612H | I612G | I612H | FACTOR07 | NCONGLOME | SUB_CONGLOME | filename | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2022 | 01 | 005030 | 008 | 11 | 010201 | 7 | 4 | 1 | 2 | ... | 2 | 106.890243530273 | 006618 | 00 | a0643d91_784-Modulo18_Enaho01-2022-612.csv | |||||
1 | 2022 | 01 | 005030 | 008 | 11 | 010201 | 7 | 4 | 2 | 1 | ... | 1502 | 2 | 1526.17028808594 | 152.617034912109 | 106.890243530273 | 006618 | 00 | a0643d91_784-Modulo18_Enaho01-2022-612.csv | ||
2 | 2022 | 01 | 005030 | 008 | 11 | 010201 | 7 | 4 | 3 | 2 | ... | 2 | 106.890243530273 | 006618 | 00 | a0643d91_784-Modulo18_Enaho01-2022-612.csv | |||||
3 | 2022 | 01 | 005030 | 008 | 11 | 010201 | 7 | 4 | 4 | 1 | ... | 120 | 2 | 121.931053161621 | 10.1609210968018 | 106.890243530273 | 006618 | 00 | a0643d91_784-Modulo18_Enaho01-2022-612.csv | ||
4 | 2022 | 01 | 005030 | 008 | 11 | 010201 | 7 | 4 | 5 | 2 | ... | 2 | 106.890243530273 | 006618 | 00 | a0643d91_784-Modulo18_Enaho01-2022-612.csv |
5 rows × 25 columns
Further steps#
Harmonize the extracted data using the
Harmonizer
class from thesocio4health
library. You can follow the Harmonization tutorial for more details.