
Extraction of Colombia, Brazil and Peru online data#
Run the tutorial via free cloud platforms:
This notebook provides you with an introduction on how to retrieve data from online data sources through web scraping, as well as from local files from Colombia, Brazil, Peru, and the Dominican Republic. This tutorial assumes you have an intermediate or advanced understanding of Python and data manipulation.
Setting up the environment#
To run this notebook, you need to have the following prerequisites:
Python 3.10+
Additionally, you need to install the socio4health
and pandas
package, which can be done using pip
:
!pip install socio4health pandas -q
Import Libraries#
To perform the data extraction, the socio4health
library provides the Extractor
class for data extraction, and the Harmonizer
class for data harmonization of the retrieved date. We will also use pandas
for data manipulation.
import datetime
import pandas as pd
from socio4health import Extractor
from socio4health.enums.data_info_enum import BraColnamesEnum, BraColspecsEnum
from socio4health.harmonizer import Harmonizer
from socio4health.utils import harmonizer_utils
Use case 1: Extracting data from Colombia#
To extract data from Colombia, we will use the Extractor
class from the socio4health
library. The Extractor
class provides methods to retrieve data from various sources, including online databases and local files. In this example, we will extract the Large Integrated Household Survey - GEIH - 2022 (Gran Encuesta Integrada de Hogares - GEIH - 2022) dataset from the Colombian Nacional Administration of Statistics (DANE) website
The Extractor
class requires the following parameters:
input_path
: TheURL
or local path to the data source.down_ext
: A list of file extensions to download. This can include.CSV
,.csv
,.zip
, etc.sep
: The separator used in the data files (e.g.,;
for semicolon-separated values).output_path
: The local path where the extracted data will be saved.depth
: The depth of the directory structure to traverse when downloading files. A depth of0
means only the files in the specified directory will be downloaded.
col_online_extractor = Extractor(input_path="https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata", down_ext=['.CSV','.csv','.zip'], sep=';', output_path="../data", depth=0)
After the instance is set up, we can call the extract
method to download and extract the data. The method returns a list of pandas
DataFrames containing the extracted data.
col_dfs = col_online_extractor.extract()
2025-08-15 04:49:37,215 - INFO - ----------------------
2025-08-15 04:49:37,216 - INFO - Starting data extraction...
2025-08-15 04:49:37,216 - INFO - Extracting data in online mode...
2025-08-15 04:49:37,217 - INFO - Scraping URL: https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata with depth 0
2025-08-15 04:49:37,234 - WARNING - /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/scrapy/utils/request.py:120: ScrapyDeprecationWarning: 'REQUEST_FINGERPRINTER_IMPLEMENTATION' is a deprecated setting.
It will be removed in a future version of Scrapy.
return cls(crawler)
2025-08-15 04:49:37,421 - ERROR - Failed during web scraping: This event loop is already running
2025-08-15 04:49:37,422 - ERROR - Exception while extracting data: Web scraping failed: This event loop is already running
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
File ~/work/socio4health/socio4health/src/socio4health/extractor.py:232, in Extractor._extract_online_mode(self)
231 logging.info(f"Scraping URL: {self.input_path} with depth {self.depth}")
--> 232 run_standard_spider(self.input_path, self.depth, self.down_ext, self.key_words)
234 # Read scraped links
File ~/work/socio4health/socio4health/src/socio4health/utils/extractor_utils.py:47, in run_standard_spider(url, depth, down_ext, key_words)
46 process.crawl(StandardSpider, url=url, depth=depth, down_ext=down_ext, key_words=key_words)
---> 47 process.start()
File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/scrapy/crawler.py:502, in CrawlerProcess.start(self, stop_after_crawl, install_signal_handlers)
499 reactor.addSystemEventTrigger(
500 "after", "startup", install_shutdown_handlers, self._signal_shutdown
501 )
--> 502 reactor.run(installSignalHandlers=install_signal_handlers)
File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/twisted/internet/asyncioreactor.py:253, in AsyncioSelectorReactor.run(self, installSignalHandlers)
252 self.startRunning(installSignalHandlers=installSignalHandlers)
--> 253 self._asyncioEventloop.run_forever()
254 if self._justStopped:
File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/asyncio/base_events.py:634, in BaseEventLoop.run_forever(self)
633 self._check_closed()
--> 634 self._check_running()
635 self._set_coroutine_origin_tracking(self._debug)
File /opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/asyncio/base_events.py:626, in BaseEventLoop._check_running(self)
625 if self.is_running():
--> 626 raise RuntimeError('This event loop is already running')
627 if events._get_running_loop() is not None:
RuntimeError: This event loop is already running
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
File ~/work/socio4health/socio4health/src/socio4health/extractor.py:175, in Extractor.extract(self)
174 if self.input_path and self.input_path.startswith("http"):
--> 175 self._extract_online_mode()
176 elif self.input_path and os.path.isdir(self.input_path):
File ~/work/socio4health/socio4health/src/socio4health/extractor.py:239, in Extractor._extract_online_mode(self)
238 logging.error(f"Failed during web scraping: {e}")
--> 239 raise ValueError(f"Web scraping failed: {str(e)}")
241 # Step 2: Filter and confirm files to download
ValueError: Web scraping failed: This event loop is already running
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Cell In[4], line 1
----> 1 col_dfs = col_online_extractor.extract()
File ~/work/socio4health/socio4health/src/socio4health/extractor.py:181, in Extractor.extract(self)
179 except Exception as e:
180 logging.error(f"Exception while extracting data: {e}")
--> 181 raise ValueError(f"Extraction failed: {str(e)}")
183 return self.dataframes
ValueError: Extraction failed: Web scraping failed: This event loop is already running
Use case 2: Extracting data from Brazil#
We are downloading the Brazilian data from the Brazilian Institute of Geography and Statistics (IBGE) website. The Extractor
class is used to download the data. In this case, we are extracting the Brazilian National Continuous Household Sample Survey (PNADC) for the year 2024
is_fwf
parameter is set to True
, which indicates that the data files are in fixed-width format. The colnames
and colspecs
parameters must be provided. In this example, they are set to the corresponding available enums for PNADC data, which define the column names and specifications for the dataset.
See more details in
socio4health.enums.data_info_enum documentation
.
bra_online_extractor = Extractor(input_path="https://ftp.ibge.gov.br/Trabalho_e_Rendimento/Pesquisa_Nacional_por_Amostra_de_Domicilios_continua/Trimestral/Microdados/2024/", down_ext=['.txt','.zip'], is_fwf=True, colnames=BraColnamesEnum.PNADC.value, colspecs=BraColspecsEnum.PNADC.value, output_path="../data", depth=0)
bra_dfs = bra_online_extractor.extract()
Use case 3: Extracting data from Peru#
Peruvian data is extracted from the National Institute of Statistics and Informatics (INEI) website. In this case, we are extracting the National Household Survey (ENAHO) for the year 2022. The down_ext
parameter is set to download .csv
and .zip
files, and the sep
parameter is set to ;
, indicating that the data files are semicolon-separated values.
per_online_extractor = Extractor(input_path="https://www.inei.gob.pe/media/DATOS_ABIERTOS/ENAHO/DATA/2022.zip", down_ext=['.csv','.zip'], output_path="../data", depth=0)
per_dfs = per_online_extractor.extract()
<socio4health.extractor.Extractor at 0x1891198c6e0>
Further steps#
Harmonize the extracted data using the
Harmonizer
class from thesocio4health
library. You can follow the Harmonization tutorial for more details.