Citizen Science and STEM Education with R

Teaching and Research Companion Notebook — Open Urban Air Data from Madrid (2020–2024)

Authors
Affiliations

Jesús Cáceres Tello

Faculty of Computer Science, Complutense University of Madrid

José Javier Galán Hernández

Dpto. Computer Science, University of Alcalá

1 💡 Purpose of the Notebook

This Quarto Notebook serves as a teaching and research companion to the article
Citizen Science and STEM Education with R: Reproducible Learning from Open Urban Air Quality Data (Applied Sciences, 2025).

It reproduces the main analytical workflow implemented in the study and illustrates how R and Quarto can be integrated into STEM education to foster data literacy, environmental awareness, and methodological transparency.


2 🔁 Reproducible Data Workflow

The complete workflow integrates open data, computational reproducibility, and STEM learning.
It can be applied to other urban contexts or courses focusing on environmental informatics, statistical modelling, or sustainability transitions.

Fig. 1. Open Data and Methodological Pipeline.

3 🗂️️ Data Sources and Structure

3.1 Air Quality Data

Air quality datasets were retrieved from the Madrid Open Data Portal (Portal de Datos Abiertos del Ayuntamiento de Madrid).
Measurements include nitrogen dioxide (NO₂), ozone (O₃), particulate matter (PM₁₀, PM₂.₅), sulphur dioxide (SO₂), and carbon monoxide (CO) recorded hourly across 24 urban stations (2020–2024).

Fig. 2a. Air quality monitoring stations across the Madrid urban area (2020–2024).

3.2 Pollutant Coverage

Each station has different pollutant coverage and measurement frequency, which provides an excellent example for students to explore data completeness and measurement uncertainty in open environmental datasets.

Fig. 2b. Pollutant coverage and measurement frequency across monitoring stations.

3.3 Data Processing Workflow

Data from both sources were processed in R through three main stages:

  1. Reading and cleaning monthly CSVs (removing redundant columns and correcting data types).
  2. Validating records with confirmed measurements (VAL flag).
  3. Pivoting and compressing results into parquet format for efficiency and consistency.

Fig. 3. Data acquisition, validation, and harmonisation workflow implemented in R.

4 📊 Exploratory Analysis

Exploratory analysis introduces students to descriptive statistics and visual analytics in R using open environmental data.
The focus is on NO₂ (a primary traffic-related pollutant) and O₃ (a secondary pollutant formed photochemically), both key indicators of urban air quality dynamics.

Fig. 4. Annual variability of NO₂ and O₃ concentrations (2020–2024).

4.1 Annual and Seasonal Variability

Fig. 4. Annual variability of NO₂ and O₃ concentrations (2020–2024).

4.2 Distribution Analysis

Boxplots provide a powerful visual tool to discuss dispersion, central tendency, and outliers across pollutants.
In this context, students learn how descriptive statistics translate into environmental interpretation, reinforcing quantitative reasoning with real data.

Fig. 8. Boxplots of NO₂ and O₃ concentrations across 2020–2024.

#🔮 Forecasting with Prophet

Time-series forecasting introduces students to predictive modelling using open environmental data.
The Prophet model (Taylor & Letham, 2018) was selected for its interpretability, decomposition structure, and robustness to missing values — key features for teaching reproducible forecasting in R.

4.3 Model for NO₂

Students can visualise how additive components — trend, seasonality, and residuals — reveal the influence of human activity and meteorological cycles on pollutant evolution. This exercise supports reproducible experimentation with forecasting horizons, cross-validation, and performance metrics such as RMSE or MAE.

Fig. 5a. Prophet forecast for NO₂ concentrations (2020–2024).

Fig. 5b. Prophet forecast for O₃ concentrations (2020–2024).

5 🌦 Meteorological Integration

Meteorological factors shape pollutant behaviour and are fundamental in understanding atmospheric processes.
By integrating temperature, solar radiation, and wind speed data from AEMET, learners can explore multivariate relationships within an urban ecosystem.

5.1 Integration Workflow

Fig. 7. Integration workflow of meteorological and air quality data (2020–2024).

6 🎓 Learning and Reproducibility Framework

Reproducibility is both a scientific and pedagogical value.
This framework unifies open data, transparent computation, and educational innovation, reinforcing the culture of open science.

Fig. 6. Reproducible learning and open-science framework for STEM education.

7 🧑‍🏫 Educational Applications

This Notebook can be directly incorporated into undergraduate or postgraduate STEM courses focused on data analysis, environmental informatics, or sustainability.

Suggested learning activities: 1. Reproduce pollutant forecasts with modified training periods.
2. Explore correlations between additional meteorological variables.
3. Design inquiry-based projects connecting data to local environmental policies.
4. Document and publish reproducible reports using Quarto and GitHub.

Through these exercises, students not only practise coding but also embrace scientific integrity and civic engagement through data.


8 🌐 Repository and Citation

All code, figures, and harmonised datasets are openly available at:
https://github.com/jcaceres-academico/OpenUrbanAirandMeteorological

When citing this educational resource, please use:

Cáceres-Tello, J.; Galán-Hernández, J.J. (2025). Citizen Science and STEM Education with R: Reproducible Learning from Open Urban Air Quality Data. Applied Sciences, 15(x), xxxx.
DOI: [placeholder]

This ensures traceability and recognition for open-source academic contributions.