stats_math_utils#
- Hydrological_model_validator.Processing.stats_math_utils.compute_coverage_stats(data: ndarray, Mmask: ndarray) Tuple[ndarray, ndarray][source]#
Compute percentage of data availability and cloud coverage within a basin mask for each time step of a 3D dataset.
- Parameters:
data (np.ndarray) – 3D array of shape (time, y, x), e.g., Schl_complete.
Mmask (np.ndarray) – 2D boolean mask (True = ocean, False = land), shape (y, x).
- Returns:
data_available_percent : % of non-NaN data inside mask per time step.
cloud_coverage_percent : % of NaN data (cloud coverage) inside mask per time step.
- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
ValueError – If input dimensions do not match expected shapes.
Examples
>>> data = np.random.rand(10, 3, 3) >>> data[:, 1:, 1:] = np.nan # simulate cloud-covered region >>> mask = np.array([[1, 1, 1], ... [1, 1, 1], ... [1, 1, 1]], dtype=bool) >>> compute_coverage_stats(data, mask) (array([...]), array([...]))
- Hydrological_model_validator.Processing.stats_math_utils.compute_fft(data: ndarray | Dict[str, ndarray | generic], dt: float = 1.0) Tuple[ndarray, ndarray | Dict[str, ndarray]][source]#
Compute the Fast Fourier Transform (FFT) and corresponding positive frequencies.
- Parameters:
data (np.ndarray, 1D array-like, or dict of 1D arrays) – Input data to transform. Can be a single 1D array or a dictionary of 1D arrays.
dt (float, optional) – Sampling interval between data points (default is 1).
- Returns:
freqs (np.ndarray) – Array of positive FFT frequencies.
fft_result (np.ndarray or dict of np.ndarray) – FFT results for each input array; if input was dict, returns dict of FFT arrays.
- Raises:
ValueError – If input dict is empty.
Examples
>>> import numpy as np >>> data = np.array([1, 2, 3, 4, 5, 6]) >>> freqs, fft_vals = compute_fft(data, dt=1) >>> freqs array([0. , 0.16666667, 0.33333333]) >>> list(fft_vals) [21.+0.j, -3.+5.19615242j, -3.+1.73205081j]
>>> data_dict = {'a': np.array([1,2,3,4]), 'b': np.array([4,3,2,1])} >>> freqs, fft_dict = compute_fft(data_dict) >>> freqs array([0. , 0.25]) >>> fft_dict['a'] array([10.+0.j, -2.+2.j])
- Hydrological_model_validator.Processing.stats_math_utils.compute_lagged_correlations(series1: Series, series2: Series, max_lag: int = 30) Series[source]#
Compute Pearson correlation coefficients between two pandas Series over a range of time lags.
- Parameters:
series1 (pd.Series) – Reference time series.
series2 (pd.Series) – Time series to be shifted and correlated with series1.
max_lag (int, optional) – Maximum lag (positive and negative) to compute correlations for (default is 30).
- Returns:
Correlation coefficients indexed by lag values (from -max_lag to +max_lag).
- Return type:
pd.Series
Examples
>>> s1 = pd.Series([1, 2, 3, 4, 5]) >>> s2 = pd.Series([5, 4, 3, 2, 1]) >>> compute_lagged_correlations(s1, s2, max_lag=2) -2 -0.5 -1 -0.5 0 -1.0 1 -0.5 2 -0.5 dtype: float64
- Hydrological_model_validator.Processing.stats_math_utils.corr_no_nan(series1: Series, series2: Series) float[source]#
Compute the Pearson correlation coefficient between two pandas Series, ignoring NaNs.
- Parameters:
series1 (pd.Series) – First time series.
series2 (pd.Series) – Second time series.
- Returns:
Pearson correlation coefficient computed over overlapping non-NaN values.
- Return type:
float
Examples
>>> s1 = pd.Series([1, 2, 3, None, 5]) >>> s2 = pd.Series([2, 2, 3, 4, None]) >>> corr_no_nan(s1, s2) 0.9819805060619657
- Hydrological_model_validator.Processing.stats_math_utils.cross_correlation(m: ndarray | Series | DataArray, o: ndarray | Series | DataArray, time_dim: str = 'time') float | DataArray[source]#
Compute the Pearson correlation coefficient between model and observation data.
- Parameters:
m (np.ndarray, pd.Series, or xr.DataArray) – Model data.
o (np.ndarray, pd.Series, or xr.DataArray) – Observation data.
time_dim (str, optional) – Name of the time dimension over which to compute the correlation (default is ‘time’).
- Returns:
Pearson correlation coefficient.
- Return type:
float or xr.DataArray
Examples
>>> m = np.array([1.0, 2.0, 3.0]) >>> o = np.array([1.1, 1.9, 3.2]) >>> cross_correlation(m, o) 0.9981908926857269
- Hydrological_model_validator.Processing.stats_math_utils.detrend_dim(da: DataArray, dim: str, mask: DataArray | None = None, min_valid_points: int = 5) DataArray[source]#
Detrend a DataArray along a dimension using scipy.signal.detrend, optionally masking out points and skipping those with insufficient valid data.
- Parameters:
da (xr.DataArray) – Input data with dimension dim.
dim (str) – Dimension name along which to detrend (e.g., ‘time’).
mask (xr.DataArray, optional) – Boolean mask to apply before detrending. True = valid points.
min_valid_points (int, default=5) – Minimum number of valid (non-NaN) points along dim required to perform detrending.
- Returns:
Detrended data array.
- Return type:
xr.DataArray
- Hydrological_model_validator.Processing.stats_math_utils.detrend_linear(data: ndarray | List[float] | Series) ndarray[source]#
Remove a linear trend from 1D numeric data using least squares linear regression.
- Parameters:
data (np.ndarray or list or pd.Series) – 1D numeric input data from which the linear trend will be removed.
- Returns:
Detrended 1D array with the linear trend removed.
- Return type:
np.ndarray
Examples
>>> import numpy as np >>> data = np.arange(10) + np.random.rand(10) >>> detrended = detrend(data)
- Hydrological_model_validator.Processing.stats_math_utils.detrend_poly_dim(data_array: DataArray, dim: str, degree: int = 1) DataArray[source]#
Remove a polynomial trend of specified degree along a given dimension in an xarray.DataArray.
- Parameters:
data_array (xarray.DataArray) – Input data array containing the data to detrend.
dim (str) – Dimension name along which the polynomial trend will be fitted and removed.
degree (int, optional) – Degree of the polynomial to fit and remove (default is 1, linear detrend).
- Returns:
Detrended data array with the polynomial trend removed along the specified dimension.
- Return type:
xarray.DataArray
Examples
>>> import xarray as xr >>> data = xr.DataArray(np.random.rand(10, 5), dims=['time', 'space']) >>> detrended = detrend_dim(data, dim='time', degree=1)
- Hydrological_model_validator.Processing.stats_math_utils.detrended_monthly_anomaly(data_array: DataArray) Tuple[DataArray, DataArray][source]#
Calculate detrended monthly anomalies by removing the linear trend first, then removing the monthly climatology.
- Parameters:
data_array (xarray.DataArray) – Time series data with a ‘time’ coordinate, typically monthly.
- Returns:
Tuple containing: - anomalies: Detrended monthly anomalies. - monthly_climatology: Monthly climatology of the detrended data.
- Return type:
Tuple[xarray.DataArray, xarray.DataArray]
Examples
>>> anomalies, climatology = detrended_monthly_anomaly(data_array)
- Hydrological_model_validator.Processing.stats_math_utils.extract_multidecadal_peak(freqs: ndarray, amps: ndarray, frequency_threshold: float = 0.1) Dict[str, float] | None[source]#
Extract amplitude and frequency/period of the largest peak within a frequency threshold region.
- Parameters:
freqs (np.ndarray) – Array of frequencies from power spectrum.
amps (np.ndarray) – Corresponding amplitude array from power spectrum.
frequency_threshold (float, optional) – Frequency threshold defining multidecadal region (default 1/10 cycles per year).
- Returns:
Dictionary with keys ‘Peak Amplitude’, ‘Peak Frequency (cycles/year)’, ‘Peak Period (years)’ for the largest amplitude peak in the multidecadal region, or None if no frequencies below threshold.
- Return type:
dict or None
Examples
>>> peak = extract_multidecadal_peak(freqs, amps)
- Hydrological_model_validator.Processing.stats_math_utils.extract_multidecadal_peaks_from_spectra(power_spectra: Dict[str, Tuple[ndarray, ...]], frequency_threshold: float = 0.1) DataFrame[source]#
Extract multidecadal peak amplitude and frequency information from multiple regions’ power spectra.
- Parameters:
power_spectra (dict) – Dictionary mapping region names to tuples containing frequency array, amplitude array, and others.
frequency_threshold (float, optional) – Threshold frequency below which frequencies are considered multidecadal (default 1/10 cycles/year).
- Returns:
DataFrame with peak amplitude, frequency, and period for each region.
- Return type:
pd.DataFrame
Examples
>>> df = extract_multidecadal_peaks_from_spectra(power_spectra)
- Hydrological_model_validator.Processing.stats_math_utils.fit_huber(mod_data: ndarray, sat_data: ndarray) Tuple[ndarray, ndarray][source]#
Fit a robust linear regression (Huber) model to 1D predictor and response arrays.
- Parameters:
mod_data (np.ndarray) – 1D array of model data (predictor variable). Must be the same length as sat_data.
sat_data (np.ndarray) – 1D array of satellite data (response variable). Must be the same length as mod_data.
- Returns:
Tuple (x_vals, y_vals) containing 100 points on the fitted Huber regression line.
- Return type:
Tuple[np.ndarray, np.ndarray]
- Raises:
ValueError – If inputs are not 1D NumPy arrays of the same length, or are empty.
Examples
>>> mod = np.array([1, 2, 3, 4]) >>> sat = np.array([1.2, 1.9, 3.1, 3.9]) >>> x, y = fit_huber(mod, sat) >>> len(x), len(y) (100, 100)
- Hydrological_model_validator.Processing.stats_math_utils.fit_lowess(mod_data: ndarray, sat_data: ndarray, frac: float = 0.3) ndarray[source]#
Fit a LOWESS (Locally Weighted Scatterplot Smoothing) curve to satellite vs. model data.
- Parameters:
mod_data (np.ndarray) – 1D array of model data (predictor). Must be the same length as sat_data.
sat_data (np.ndarray) – 1D array of satellite data (response). Must be the same length as mod_data.
frac (float, optional) – Fraction of the data used when estimating each y-value. Must be in the interval (0, 1]. Default is 0.3.
- Returns:
Array of shape (N, 2) with columns [sorted_mod_data, smoothed_sat_data], suitable for plotting.
- Return type:
np.ndarray
- Raises:
ValueError – If inputs are not 1D NumPy arrays of equal length, or if frac is not in (0, 1].
Examples
>>> mod = np.array([1, 2, 3, 4]) >>> sat = np.array([1.2, 1.8, 2.5, 3.9]) >>> smooth = fit_lowess(mod, sat, frac=0.5) >>> smooth.shape (4, 2)
- Hydrological_model_validator.Processing.stats_math_utils.identify_extreme_events(time_series: ndarray | Series | DataArray, threshold_multiplier: float = 1.5, step: int | None = None, comparison_index: int | None = None) Dict[str, ndarray | List[int]][source]#
Identify extreme positive and negative events in a time series using a threshold based on standard deviation. Optionally, sample these extreme events at fixed intervals starting at a given index.
- Parameters:
time_series (np.ndarray or pd.Series or xarray.DataArray) – 1D time series data.
threshold_multiplier (float, optional) – Multiplier for standard deviation to define extreme events (default 1.5).
step (int, optional) – Step interval to sample the time series for extreme events (e.g., 12 for monthly December events).
comparison_index (int, optional) – Starting index for sampling when step is provided (e.g., 12 to start from December).
- Returns:
Dictionary containing: - ‘positive_events_mask’: Boolean mask array for extreme positive events. - ‘negative_events_mask’: Boolean mask array for extreme negative events. - ‘sampled_positive_indices’: Indices of sampled positive extreme events (if step provided). - ‘sampled_negative_indices’: Indices of sampled negative extreme events (if step provided).
- Return type:
dict
Examples
>>> events = identify_extreme_events(n34_array, threshold_multiplier=1.5, step=12, comparison_index=12)
- Hydrological_model_validator.Processing.stats_math_utils.mean_bias(m: ndarray | Series | DataArray, o: ndarray | Series | DataArray, time_dim: str = 'time') float | DataArray[source]#
Compute the mean bias between model and observations over the specified time dimension.
- Parameters:
m (np.ndarray, pd.Series, or xr.DataArray) – Model data.
o (np.ndarray, pd.Series, or xr.DataArray) – Observation data.
time_dim (str, optional) – Name of the time dimension to compute mean over (default is ‘time’).
- Returns:
Mean bias: model minus observation, averaged over time.
- Return type:
float or xr.DataArray
Examples
>>> m = np.array([1.1, 2.2, 3.3]) >>> o = np.array([1.0, 2.0, 3.0]) >>> mean_bias(m, o) 0.19999999999999973
- Hydrological_model_validator.Processing.stats_math_utils.monthly_anomaly(data_array: DataArray) Tuple[DataArray, DataArray][source]#
Calculate monthly mean anomalies without detrending by removing the monthly climatology.
- Parameters:
data_array (xarray.DataArray) – Time series data with a ‘time’ coordinate, must be monthly or higher frequency.
- Returns:
Tuple containing: - anomalies: Monthly anomalies computed by subtracting monthly means. - monthly_climatology: Mean values for each month.
- Return type:
Tuple[xarray.DataArray, xarray.DataArray]
Examples
>>> anomalies, climatology = monthly_anomaly(data_array)
- Hydrological_model_validator.Processing.stats_math_utils.np_correlation(field_time_xy: ndarray, index_time: ndarray) ndarray[source]#
Calculate Pearson correlation coefficient map between a 3D spatial-temporal field and a 1D index time series.
- Parameters:
field_time_xy (np.ndarray) – 3D array (time, y, x) representing spatial field over time.
index_time (np.ndarray) – 1D array representing time series index.
- Returns:
2D correlation map with shape (y, x).
- Return type:
np.ndarray
Examples
>>> corr = np_correlation(field, index)
- Hydrological_model_validator.Processing.stats_math_utils.np_covariance(field_time_xy: ndarray, index_time: ndarray) ndarray[source]#
Calculate covariance between a 3D spatial-temporal field and a 1D index time series.
- Parameters:
field_time_xy (np.ndarray) – 3D array with shape (time, y, x), spatial field over time.
index_time (np.ndarray) – 1D array with length equal to time dimension in field_time_xy.
- Returns:
2D covariance map with shape (y, x).
- Return type:
np.ndarray
Examples
>>> cov = np_covariance(field, index)
- Hydrological_model_validator.Processing.stats_math_utils.np_regression(field_time_xy: ndarray, index_time: ndarray, std_units: str = 'yes') ndarray[source]#
Compute regression coefficients between a 3D spatial-temporal field and a 1D index time series.
- Parameters:
field_time_xy (np.ndarray) – 3D spatial-temporal field array (time, y, x).
index_time (np.ndarray) – 1D index time series.
std_units (str, optional) – If ‘yes’, normalize regression by standard deviation of index_time.
- Returns:
2D regression coefficients map (y, x).
- Return type:
np.ndarray
Examples
>>> regression = np_regression(field, index)
- Hydrological_model_validator.Processing.stats_math_utils.round_up_to_nearest(x: float | int, base: float = 1.0) float[source]#
Round up the given number to the nearest multiple of the specified base.
- Parameters:
x (float or int) – Number to round up.
base (float, optional) – The base multiple to round up to. Must be positive (default is 1.0).
- Returns:
The rounded up value as a multiple of the base.
- Return type:
float
- Raises:
ValueError – If base is not positive.
Examples
>>> round_up_to_nearest(5.3, base=2) 6.0 >>> round_up_to_nearest(7, base=0.5) 7.0
- Hydrological_model_validator.Processing.stats_math_utils.spatial_mean(data_array: DataArray, mask: DataArray | ndarray) DataArray[source]#
Compute the spatial mean of a DataArray over latitude and longitude, applying a boolean mask.
- Parameters:
data_array (xr.DataArray) – Input data array with dimensions including ‘lat’ and ‘lon’.
mask (xr.DataArray or np.ndarray) – 2D boolean mask with shape matching (lat, lon). True indicates valid points.
- Returns:
Spatial mean over lat/lon of data within the mask, ignoring NaNs.
- Return type:
xr.DataArray
Examples
>>> import numpy as np >>> import xarray as xr >>> data = xr.DataArray(np.random.rand(3,4,5), dims=('time', 'lat', 'lon')) >>> mask = xr.DataArray(np.array([[True, False, True, True, False], ... [True, True, False, False, True], ... [False, True, True, False, True], ... [True, False, True, True, True]]), ... dims=('lat', 'lon')) >>> spatial_mean(data.isel(time=0), mask) <xarray.DataArray ()> array(0.5267)
- Hydrological_model_validator.Processing.stats_math_utils.standard_deviation_error(m: ndarray | Series | DataArray, o: ndarray | Series | DataArray, time_dim: str = 'time') float | DataArray[source]#
Compute the difference between the standard deviations of model and observation data.
- Parameters:
m (np.ndarray, pd.Series, or xr.DataArray) – Model data.
o (np.ndarray, pd.Series, or xr.DataArray) – Observation data.
time_dim (str, optional) – Name of the time dimension to compute std over (default is ‘time’).
- Returns:
The difference between model and observation standard deviations (model - obs).
- Return type:
float or xr.DataArray
- Raises:
ValueError – If m and o have different lengths along the time dimension.
Examples
>>> m = np.array([1.0, 2.0, 3.0]) >>> o = np.array([1.5, 2.5, 3.5]) >>> standard_deviation_error(m, o) 0.0
- Hydrological_model_validator.Processing.stats_math_utils.std_dev(da: ndarray | Series | DataArray, time_dim: str = 'time') float | DataArray[source]#
Compute the standard deviation along the specified time dimension.
- Parameters:
da (np.ndarray, pd.Series, or xr.DataArray) – Input data array or series.
time_dim (str, optional) – Name of the time dimension to compute std over (default is ‘time’).
- Returns:
Standard deviation along the time dimension.
- Return type:
float or xr.DataArray
Examples
>>> import numpy as np >>> data = np.array([1.0, 2.0, 3.0, 4.0]) >>> std_dev(data) 1.118033988749895
- Hydrological_model_validator.Processing.stats_math_utils.unbiased_rmse(m: ndarray | Series | DataArray, o: ndarray | Series | DataArray, time_dim: str = 'time') float | DataArray[source]#
Compute the unbiased Root Mean Square Error (centered RMSE) between model and observations.
- Parameters:
m (np.ndarray, pd.Series, or xr.DataArray) – Model data.
o (np.ndarray, pd.Series, or xr.DataArray) – Observation data.
time_dim (str, optional) – Name of the time dimension to compute RMSE over (default is ‘time’).
- Returns:
Unbiased RMSE value.
- Return type:
float or xr.DataArray
Examples
>>> m = np.array([1.0, 2.0, 3.0]) >>> o = np.array([1.1, 1.9, 3.1]) >>> unbiased_rmse(m, o) 0.10000000000000009
- Hydrological_model_validator.Processing.stats_math_utils.yearly_anomaly(data_array: DataArray) Tuple[DataArray, DataArray][source]#
Calculate yearly mean anomalies without detrending by removing the yearly climatology.
- Parameters:
data_array (xarray.DataArray) – Time series data with a ‘time’ coordinate, typically annual or higher frequency.
- Returns:
Tuple containing: - anomalies: Yearly anomalies computed by subtracting yearly means. - yearly_climatology: Mean values for each year.
- Return type:
Tuple[xarray.DataArray, xarray.DataArray]
Examples
>>> anomalies, climatology = yearly_anomaly(data_array)