Predicting Streamflow with Machine Learning

A Note to the Learner

Welcome to the intersection of hydrology and data science! If you're new to these fields, you're in the right place. The goal of this guide is to introduce you to two powerful ways of modelling the world. The first is based on the laws of physics, helping us understand why water moves the way it does. The second is based on data, allowing computers to find patterns we might miss. We'll explore how these two approaches can be combined to make better predictions about something vital to us all: the flow of water in our rivers.

1. Why Predict Streamflow? Two Competing Approaches

Accurate streamflow prediction is critical for managing our planet's most precious resource: water. It helps us prepare for floods, operate reservoirs for water supply and energy production, manage irrigation for agriculture, and protect delicate ecosystems that depend on specific water levels. To make these predictions, hydrologists have traditionally used two main types of models: those based on physical processes and those driven by data.

These two methods represent fundamentally different ways of thinking about the problem, each with its own strengths and weaknesses.

Process-Based Models (The Physicist's Approach)	Data-Driven Models (The Data Scientist's Approach)
Core Principle: Rooted in governing equations and physical laws that describe how water moves through a landscape.	Core Principle: Directly discern patterns from input data (like rainfall) to make predictions about output data (like streamflow).
Strengths: Can be used for hypothesis testing and understanding the role of individual hydrological components (e.g., snowmelt, groundwater).	Strengths: Often achieve superior predictive accuracy without requiring complex and time-consuming manual calibration.
Primary Challenge: Require extensive datasets on watershed topography and numerous variables, which can be difficult and expensive to collect.	Primary Challenge: Lack of physical components makes it hard to enhance our understanding of the watershed; can be a "black box" that is difficult to interpret.

Despite the challenges of data-driven models, their predictive power is undeniable. This has led researchers to develop increasingly sophisticated versions, leading to state-of-the-art models that are changing the field of hydrology.

2. A Closer Look at Data-Driven Models: The Power of LSTMs

One of the most powerful tools in the data scientist's toolkit for streamflow prediction is the Long Short-Term Memory (LSTM) model. An LSTM is a type of deep learning model, which is a subfield of machine learning that uses complex neural networks. It is particularly well-suited for rainfall-streamflow modelling for two key reasons:

Remembering the Past: LSTMs are designed to "grasp long-term dependencies." This is crucial in hydrology because the amount of water in a river today depends heavily on past conditions—not just yesterday's rain, but the rain, snow, and temperature from weeks or even months ago. LSTMs can "remember" this history to make more accurate predictions.
Handling Time: Streamflow is a time series—a sequence of data points ordered in time. LSTMs are built to handle this kind of data and are resistant to a common problem in simpler neural networks called "vanishing gradients", which can make it difficult for models to learn from long-term patterns.

Even with their impressive performance, a key criticism of LSTMs remains their "lack of interpretability and physical consistency." A purely data-driven model might make a prediction that is mathematically plausible but physically impossible, simply because it doesn't understand the fundamental laws of nature, like the conservation of mass.

This leads to a fascinating question: What if we could combine the predictive power of an LSTM with the physical grounding of a process-based model?

3. The Best of Both Worlds: Physics-Informed Machine Learning (PILSTM)

Physics-Informed Machine Learning (PIML) is a strategy to create hybrid models that get the best of both worlds. The core idea is to prevent purely data-driven models from learning spurious (or false) relationships from the training data. By grounding the model in physical laws, we can improve its ability to generalise—that is, to make accurate predictions for conditions it has never seen before.

A powerful example of this approach is the Physics-Informed LSTM (PILSTM) model. Here is how it works in three core steps:

1. Start with a Physical Foundation The model begins with a simple but powerful process-based rainfall-runoff equation based on the principle of water balance.

2. Incorporate it into the LSTM The PILSTM model incorporates this physical model by changing how it learns. Normally, an LSTM learns by minimising a single objective: the error between its prediction and the observed streamflow. The PILSTM's learning process, governed by its 'loss function', is given a second objective: it must also minimise the error between its prediction and the output from the water-balance model. These two objectives are balanced using a tunable weight, allowing the model to lean more on the observed data or the physical theory depending on the situation.

3. Guide the Learning Process This new loss function effectively penalises the model for making predictions that violate the principle of mass conservation. This constraint doesn't force the model to be identical to the simple physical model, but it guides it, encouraging it to generate predictions that are more physically consistent and interpretable.

This hybrid approach sounds promising in theory, but its true value is revealed when tested under challenging real-world conditions.

4. Putting Models to the Test: Performance Under Pressure

The dissertation tested the PILSTM against a standard LSTM and a purely physical model in two challenging scenarios that hydrologists often face.

4.1. The Data Scarcity Challenge

Hydrologists often need to make predictions for rivers or streams where there is very little historical data available to train a model. This is known as the "data-scarce" problem.

The Power of Pretraining: The single biggest performance boost for LSTM-based models came from "pretraining" them on a large, geographically diverse dataset (like the CAMELS dataset). This gave the models a strong foundational understanding of different rainfall-runoff behaviours before they were fine-tuned on a specific location.
A Critical Safeguard: However, for watersheds with unique characteristics that were poorly represented in the large pretraining dataset, this process could actually make the model worse. In these specific cases, integrating physical information using a PILSTM served as a "safeguard against poor performance", especially when the local data was very limited. This is because the physical model provides a fundamental baseline of water-balance principles, preventing the pretrained model from making physically implausible predictions when the patterns it learnt from the large dataset do not apply to the unique local watershed.

4.2. The Climate Change Challenge (Non-Stationary Scenarios)

Climate change means that future weather patterns may not look like the past. A model trained only on historical data might fail when faced with new, more extreme conditions. This is called a "non-stationary" scenario. The study tested this by training models on historically "dry" years and testing them on "wet" years, and vice-versa.

LSTM Robustness: In general, LSTM-based models proved to be quite robust. Models trained on wet conditions were still able to perform well when predicting for dry conditions.
Physics for Extremes: For very dry watersheds characterised by long storage times and limited streamflow variability, the purely physical model produced the best performance. This highlights that in certain extreme environments, a model based on first principles can be the most reliable.
PILSTM's Advantage: Overall, the PILSTM demonstrated a clear advantage in handling non-stationarity. The combination of machine learning and physical principles was highly effective at leveraging limited training data to improve predictive power compared to using either approach alone.

These tests show that there is no single "best" model for all situations, but understanding their strengths and weaknesses allows us to choose the right tool for the job.

5. A Smarter Future for Hydrology

As we've seen, the world of hydrological modelling is rich and evolving. Purely data-driven models like LSTMs have emerged as incredibly powerful predictors, often outperforming traditional methods. However, their "black box" nature can be a significant drawback, as they may lack physical interpretability and produce results that don't align with the laws of nature.

By integrating fundamental physical principles into the machine learning process, as demonstrated by the PILSTM model, we can create a promising hybrid approach. This strategy leads to predictions that are not only more accurate and robust in challenging scenarios—like data scarcity and climate change—but are also more interpretable and consistent with our scientific understanding of water-balance processes. These interdisciplinary strategies, bridging data-driven insights with the physical dynamics of hydrological systems, represent a transformative step forward in our ability to understand and manage our planet's water resources.