------------------- GENERAL INFORMATION ------------------- 1. Title of Dataset MidwestET-500: Upscaled Daily Evapotranspiration Estimates for the U.S. Midwest (2019–2024) 2. Author Information Principal Investigator Contact Information Name: Aleksei Rozanov Institution: University of Minnesota Email: rozan012@umn.edu ORCID: 0009-0004-6285-7066 Associate or Co-investigator Contact Information Name: Samikshya Subedi Institution: University of Minnesota Email: subed036@umn.edu Associate or Co-investigator Contact Information Name: Vasudha Sharma Institution: University of Minnesota Email: vasudha@umn.edu ORCID: 0000-0001-7957-9155 Associate or Co-investigator Contact Information Name: Bryan Runck Institution: University of Minnesota Email: runck014@umn.edu ORCID: 0000-0002-7015-1539 3. Date published or finalized for release: 2025-08-04 4. Date of data collection (single date, range, approximate date): 20190101–20241231 5. Geographic location of data collection: U.S. Upper Midwest (36°N to 49°N, −104°W to −82°W) 6. Information about funding sources that supported the collection of the data: Funding for this project was provided by the Minnesota Environment and Natural Resources Trust Fund as recommended by the Legislative-Citizen Commission on Minnesota Resources (LCCMR) ENRTF 2021-266. 7. Overview of the data (abstract): MidwestET-500 is a high-resolution evapotranspiration (ET) dataset covering the U.S. Midwest from 2019 to 2024 at 500 m spatial and daily temporal resolution. The data were generated using a LightGBM model trained on eddy covariance flux tower data, satellite remote sensing, and gridded meteorology. Knowledge-guided features, including Penman–Monteith-derived inputs, were used to enhance performance and interpretability. The model was validated with GroupKFold CV and independently benchmarked against OpenET and Minnesota Mesonet Penman–Monteith estimates. The data are released in NetCDF4 format and support applications such as irrigation planning, hydrologic modeling, and drought monitoring. -------------------------- SHARING/ACCESS INFORMATION -------------------------- 1. Licenses/restrictions placed on the data: CC BY 4.0 – Open access with attribution required. 2. Links to publications that cite or use the data: Rozanov, A. et al. (2025). Knowledge-Guided Tree-Based Models for Evapotranspiration Upscaling in the U.S. Midwest [in preparation] 3. Was data derived from another source? No. 4. Terms of Use: Data Repository for the University of Minnesota (DRUM): By using these files, users agree to the Terms of Use: https://conservancy.umn.edu/pages/policies/#drum-terms-of-use --------------------- DATA & FILE OVERVIEW --------------------- 1. File List ET_2019.nc ET_2020.nc ET_2021.nc ET_2022.nc ET_2023.nc ET_2024.nc lightgbm_model.txt -------------------------- METHODOLOGICAL INFORMATION -------------------------- 1. Description of methods used for collection/generation of data: ET (mm d-1) values were predicted using a LightGBM model trained on eddy covariance data with inputs from MODIS reflectance, gridded ERA5-Land meteorology, and site-level static features. The model included knowledge-guided predictors derived from the Penman–Monteith equation. 2. Methods for processing the data: Raw inputs were harmonized to a 500 m grid and daily temporal resolution. The model was trained using GroupKFold CV (site-year-stratified) to prevent leakage. Predictions were generated over the full Midwest domain for each day from 2019–2024. 3. Instrument- or software-specific information needed to interpret the data: NetCDF4 reader (e.g., xarray in Python, Panoply, or ncview). Coordinates are in WGS84. 4. Standards and calibration information, if appropriate: Flux data from EC towers were gap-filled and quality-controlled using FLUXNET standard procedures. 5. Environmental/experimental conditions: Model trained and validated across diverse agricultural and natural land cover types in the Midwest, under varying seasonal and meteorological conditions. 6. Describe any quality-assurance procedures performed on the data: Model predictions validated against OpenET (r=0.94) and Penman–Monteith estimates (r=0.89). Errors quantified using RMSE, MAE, and R². Visual inspection of spatial/temporal consistency. 7. People involved with sample collection, processing, analysis and/or submission: All authors. 8. Input data description and sources: The selected meteorology variables reflect the primary physical processes governing ET: -Air temperature and dewpoint temperature at 2 m control vapor pressure deficit, a key driver of atmospheric demand for ET; -Wind components and surface pressure influence turbulent transfer and enhance vapor removal at the surface, and affects air density, which modulates latent heat fluxes, respectively. -Surface net solar radiation directly supplies the energy needed for evaporation and transpiration; -Total evaporation and total precipitation represent a process-based model proxy for moisture fluxes, and water availability, setting boundary conditions for soil evaporation and plant transpiration respectively; To extract proxies potentially related to ET, we included the following MOD09GA variables: -Sensor and viewing geometry (sensor zenith angle, sensor azimuth angle, solar zenith angle, solar azimuth angle), providing information about observation geometry and solar energy input. -Surface reflectance bands 1–7 (Red, NIR1, Blue, Green, NIR2, SWIR1, SWIR2), capturing vegetation and land surface conditions closely tied to ET processes. -Clouds QA (State 1km), which was binary decoded to flag pixels affected by clouds or shadows (1) versus clear observations (0). Lastly, the target (dependent) variable for the supervised ML task was derived from AmeriFlux and FLUXNET eddy covariance (EC) time series at 38 sites across the U.S. Midwest, processed using the standardized OneFlux methodology. Specifically, we used LE_F_MDS (latent energy flux) as the dependent variable, pre-filtered to include only daily values with LE_F_MDS_QC ≥ 0.75, ensuring that observations with less than 25% gap-filling were retained. Vermote, E., & Wolfe, R. (2021). MODIS/Terra Surface Reflectance Daily L2G Global 1km and 500m SIN Grid V061 [Data set]. NASA Land Processes Distributed Active Archive Center. https://doi.org/10.5067/MODIS/MOD09GA.061 Muñoz Sabater, J. (2019). ERA5-Land hourly data from 1950 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS). https://doi.org/10.24381/cds.e2161bac Pastorello, G., Trotta, C., Canfora, E., Chu, H., Christianson, D., Cheah, Y.W., et al. (2020). The FLUXNET2015 dataset and the ONEFlux processing pipeline for eddy covariance data. Scientific Data, 7(1), 225. https://doi.org/10.1038/s41597-020-0534-3