TY - JOUR
T1 - Strategic imputation of groundwater data using machine learning
T2 - Insights from diverse aquifers in the Chao-Phraya River Basin
AU - Sharma, Yaggesh Kumar
AU - Kim, Seokhyeon
AU - Tayerani Charmchi, Amir Saman
AU - Kang, Doosun
AU - Batelaan, Okke
PY - 2025/2
Y1 - 2025/2
N2 - Effective groundwater monitoring is essential for sustainable water management, particularly in data-sparse regions. To address inconsistencies in groundwater level data, we developed a machine learning framework for robust data imputation, tested in the Chao-Phraya River (CPR) Basin, a region facing significant groundwater challenges due to high population density and ecological importance. Our study evaluated five models—K-Nearest Neighbors (KNN), Multiple Imputation by Chained Equations (MICE), Multilayer Perceptron (MLP), Random Forest (RF), and Soft Imputation (SI) —to fill gaps in monthly groundwater level data across various locations, aquifer depths, and data loss scenarios. Results show that MICE perform well in high-density well environments, while SI excels with lower well density, maintaining Pearson correlation coefficients (R) above 0.80 and RMSE values below 6 even at 10% data loss. The Coefficient of Variation (COV) analysis also confirmed that imputed data remains stable and reliable. However, the study also reveals a significant decrease in model performance in regions with fewer wells, as indicated by increased RMSE and reduced R. Our findings indicate that machine learning models are capable of handling groundwater level observations with missing data. The well density in a region has a significant impact on these model's performance. Imputation techniques should be tailored to each aquifer's specific characteristics and surroundings in order to get accurate groundwater data.
AB - Effective groundwater monitoring is essential for sustainable water management, particularly in data-sparse regions. To address inconsistencies in groundwater level data, we developed a machine learning framework for robust data imputation, tested in the Chao-Phraya River (CPR) Basin, a region facing significant groundwater challenges due to high population density and ecological importance. Our study evaluated five models—K-Nearest Neighbors (KNN), Multiple Imputation by Chained Equations (MICE), Multilayer Perceptron (MLP), Random Forest (RF), and Soft Imputation (SI) —to fill gaps in monthly groundwater level data across various locations, aquifer depths, and data loss scenarios. Results show that MICE perform well in high-density well environments, while SI excels with lower well density, maintaining Pearson correlation coefficients (R) above 0.80 and RMSE values below 6 even at 10% data loss. The Coefficient of Variation (COV) analysis also confirmed that imputed data remains stable and reliable. However, the study also reveals a significant decrease in model performance in regions with fewer wells, as indicated by increased RMSE and reduced R. Our findings indicate that machine learning models are capable of handling groundwater level observations with missing data. The well density in a region has a significant impact on these model's performance. Imputation techniques should be tailored to each aquifer's specific characteristics and surroundings in order to get accurate groundwater data.
KW - Chao-Phraya River Basin
KW - Data imputation
KW - Drought
KW - Groundwater management
KW - Machine learning models
KW - Well density
UR - http://www.scopus.com/inward/record.url?scp=85212218062&partnerID=8YFLogxK
U2 - 10.1016/j.gsd.2024.101394
DO - 10.1016/j.gsd.2024.101394
M3 - Article
AN - SCOPUS:85212218062
SN - 2352-801X
VL - 28
JO - Groundwater for Sustainable Development
JF - Groundwater for Sustainable Development
M1 - 101394
ER -