{
"cells": [
{
"metadata": {},
"cell_type": "markdown",
"source": "\n",
"id": "ce4b64d6881c2a34"
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"\n",
"\n",
"# Extraction of Colombia, Brazil and Peru online data"
],
"id": "e2b0c70b588bed1d"
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"**Run the tutorial via free cloud platforms:** [](https://mybinder.org/v2/gh/harmonize-tools/socio4health/HEAD?urlpath=%2Fdoc%2Ftree%2Fdocs%2Fsource%2Fnotebooks%2Fextractor.ipynb) \n",
"
\n",
"\n",
"\n"
],
"id": "bbb9be9fde0e004a"
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"This notebook provides you with an introduction on how to retrieve data from online data sources through **web scraping**, as well as from local files from Colombia, Brazil, Peru, and the Dominican Republic. This tutorial assumes you have an **intermediate** or **advanced** understanding of Python and data manipulation.\n",
"\n",
"## Setting up the environment\n",
"\n",
"To run this notebook, you need to have the following prerequisites:\n",
"\n",
"- Python 3.10+\n",
"\n",
"Additionally, you need to install the `socio4health` and `pandas` package, which can be done using ``pip``:\n",
"\n"
],
"id": "695f4aa2c770640a"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-07-10T16:49:44.305739Z",
"start_time": "2025-07-10T16:49:30.215946Z"
}
},
"cell_type": "code",
"source": "!pip install socio4health pandas -q",
"id": "a29a453e4e438474",
"outputs": [],
"execution_count": 2
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Import Libraries\n",
"\n",
"To perform the data extraction, the `socio4health` library provides the `Extractor` class for data extraction, and the `Harmonizer` class for data harmonization of the retrieved date. We will also use `pandas` for data manipulation.\n"
],
"id": "a9faa7b1a0405434"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-07-10T16:49:44.360496Z",
"start_time": "2025-07-10T16:49:44.349010Z"
}
},
"cell_type": "code",
"source": [
"import datetime\n",
"import pandas as pd\n",
"from socio4health import Extractor\n",
"from socio4health.enums.data_info_enum import BraColnamesEnum, BraColspecsEnum\n",
"from socio4health.harmonizer import Harmonizer\n",
"from socio4health.utils import harmonizer_utils"
],
"id": "d0e08601b93ce10d",
"outputs": [],
"execution_count": 3
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Use case 1: Extracting data from Colombia\n",
"\n",
"To extract data from Colombia, we will use the `Extractor` class from the `socio4health` library. The `Extractor` class provides methods to retrieve data from various sources, including online databases and local files. In this example, we will extract the Large Integrated Household Survey - GEIH - 2022 (Gran Encuesta Integrada de Hogares - **GEIH** - 2022) dataset from the Colombian Nacional Administration of Statistics (**DANE**) website\n",
"\n",
"The `Extractor` class requires the following parameters:\n",
"- `input_path`: The `URL` or local path to the data source.\n",
"- `down_ext`: A list of file extensions to download. This can include `.CSV`, `.csv`, `.zip`, etc.\n",
"- `sep`: The separator used in the data files (e.g., `;` for semicolon-separated values).\n",
"- `output_path`: The local path where the extracted data will be saved.\n",
"- `depth`: The depth of the directory structure to traverse when downloading files. A depth of `0` means only the files in the specified directory will be downloaded.\n"
],
"id": "d117d3d107ee158b"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-07-10T16:53:40.152716Z",
"start_time": "2025-07-10T16:53:40.107571Z"
}
},
"cell_type": "code",
"source": "col_online_extractor = Extractor(input_path=\"https://microdatos.dane.gov.co/index.php/catalog/771/get-microdata\", down_ext=['.CSV','.csv','.zip'], sep=';', output_path=\"../data\", depth=0)",
"id": "d881365674fef602",
"outputs": [],
"execution_count": 4
},
{
"metadata": {},
"cell_type": "markdown",
"source": "After the instance is set up, we can call the `extract` method to download and extract the data. The method returns a list of `pandas` DataFrames containing the extracted data.",
"id": "b0583ac89ee19937"
},
{
"metadata": {
"ExecuteTime": {
"end_time": "2025-08-11T20:39:34.498226Z",
"start_time": "2025-08-11T20:39:32.660687Z"
}
},
"cell_type": "code",
"source": "col_dfs = col_online_extractor.extract()",
"id": "20265f6f8a3355ad",
"outputs": [
{
"ename": "NameError",
"evalue": "name 'col_online_extractor' is not defined",
"output_type": "error",
"traceback": [
"\u001B[31m---------------------------------------------------------------------------\u001B[39m",
"\u001B[31mNameError\u001B[39m Traceback (most recent call last)",
"\u001B[36mCell\u001B[39m\u001B[36m \u001B[39m\u001B[32mIn[1]\u001B[39m\u001B[32m, line 1\u001B[39m\n\u001B[32m----> \u001B[39m\u001B[32m1\u001B[39m col_dfs = \u001B[43mcol_online_extractor\u001B[49m.extract()\n",
"\u001B[31mNameError\u001B[39m: name 'col_online_extractor' is not defined"
]
}
],
"execution_count": 1
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"## Use case 2: Extracting data from Brazil\n",
"\n",
"We are downloading the Brazilian data from the Brazilian Institute of Geography and Statistics (**IBGE**) website. The `Extractor` class is used to download the data. In this case, we are extracting the Brazilian National Continuous Household Sample Survey (**PNADC**) for the year 2024\n",
"\n"
],
"id": "bb348d1d04c4bc92"
},
{
"metadata": {},
"cell_type": "markdown",
"source": [
"
is_fwf
parameter is set to True
, which indicates that the data files are in fixed-width format. The colnames
and colspecs
parameters must be provided. In this example, they are set to the corresponding available enums for PNADC data, which define the column names and specifications for the dataset.\n",
" See more details in\n",
" \n",
" socio4health.enums.data_info_enum documentation\n",
" .\n",
"