stats_math_utils#

Hydrological_model_validator.Processing.stats_math_utils.compute_coverage_stats(data: ndarray, Mmask: ndarray) Tuple[ndarray, ndarray][source]#

Compute percentage of data availability and cloud coverage within a basin mask for each time step of a 3D dataset.

Parameters:
  • data (np.ndarray) – 3D array of shape (time, y, x), e.g., Schl_complete.

  • Mmask (np.ndarray) – 2D boolean mask (True = ocean, False = land), shape (y, x).

Returns:

  • data_available_percent : % of non-NaN data inside mask per time step.

  • cloud_coverage_percent : % of NaN data (cloud coverage) inside mask per time step.

Return type:

Tuple[np.ndarray, np.ndarray]

Raises:

ValueError – If input dimensions do not match expected shapes.

Examples

>>> data = np.random.rand(10, 3, 3)
>>> data[:, 1:, 1:] = np.nan  # simulate cloud-covered region
>>> mask = np.array([[1, 1, 1],
...                  [1, 1, 1],
...                  [1, 1, 1]], dtype=bool)
>>> compute_coverage_stats(data, mask)
(array([...]), array([...]))
Hydrological_model_validator.Processing.stats_math_utils.compute_fft(data: ndarray | Dict[str, ndarray | generic], dt: float = 1.0) Tuple[ndarray, ndarray | Dict[str, ndarray]][source]#

Compute the Fast Fourier Transform (FFT) and corresponding positive frequencies.

Parameters:
  • data (np.ndarray, 1D array-like, or dict of 1D arrays) – Input data to transform. Can be a single 1D array or a dictionary of 1D arrays.

  • dt (float, optional) – Sampling interval between data points (default is 1).

Returns:

  • freqs (np.ndarray) – Array of positive FFT frequencies.

  • fft_result (np.ndarray or dict of np.ndarray) – FFT results for each input array; if input was dict, returns dict of FFT arrays.

Raises:

ValueError – If input dict is empty.

Examples

>>> import numpy as np
>>> data = np.array([1, 2, 3, 4, 5, 6])
>>> freqs, fft_vals = compute_fft(data, dt=1)
>>> freqs
array([0.        , 0.16666667, 0.33333333])
>>> list(fft_vals)
[21.+0.j, -3.+5.19615242j, -3.+1.73205081j]
>>> data_dict = {'a': np.array([1,2,3,4]), 'b': np.array([4,3,2,1])}
>>> freqs, fft_dict = compute_fft(data_dict)
>>> freqs
array([0. , 0.25])
>>> fft_dict['a']
array([10.+0.j, -2.+2.j])
Hydrological_model_validator.Processing.stats_math_utils.compute_lagged_correlations(series1: Series, series2: Series, max_lag: int = 30) Series[source]#

Compute Pearson correlation coefficients between two pandas Series over a range of time lags.

Parameters:
  • series1 (pd.Series) – Reference time series.

  • series2 (pd.Series) – Time series to be shifted and correlated with series1.

  • max_lag (int, optional) – Maximum lag (positive and negative) to compute correlations for (default is 30).

Returns:

Correlation coefficients indexed by lag values (from -max_lag to +max_lag).

Return type:

pd.Series

Examples

>>> s1 = pd.Series([1, 2, 3, 4, 5])
>>> s2 = pd.Series([5, 4, 3, 2, 1])
>>> compute_lagged_correlations(s1, s2, max_lag=2)
-2   -0.5
-1   -0.5
 0   -1.0
 1   -0.5
 2   -0.5
dtype: float64
Hydrological_model_validator.Processing.stats_math_utils.corr_no_nan(series1: Series, series2: Series) float[source]#

Compute the Pearson correlation coefficient between two pandas Series, ignoring NaNs.

Parameters:
  • series1 (pd.Series) – First time series.

  • series2 (pd.Series) – Second time series.

Returns:

Pearson correlation coefficient computed over overlapping non-NaN values.

Return type:

float

Examples

>>> s1 = pd.Series([1, 2, 3, None, 5])
>>> s2 = pd.Series([2, 2, 3, 4, None])
>>> corr_no_nan(s1, s2)
0.9819805060619657
Hydrological_model_validator.Processing.stats_math_utils.cross_correlation(m: ndarray | Series | DataArray, o: ndarray | Series | DataArray, time_dim: str = 'time') float | DataArray[source]#

Compute the Pearson correlation coefficient between model and observation data.

Parameters:
  • m (np.ndarray, pd.Series, or xr.DataArray) – Model data.

  • o (np.ndarray, pd.Series, or xr.DataArray) – Observation data.

  • time_dim (str, optional) – Name of the time dimension over which to compute the correlation (default is ‘time’).

Returns:

Pearson correlation coefficient.

Return type:

float or xr.DataArray

Examples

>>> m = np.array([1.0, 2.0, 3.0])
>>> o = np.array([1.1, 1.9, 3.2])
>>> cross_correlation(m, o)
0.9981908926857269
Hydrological_model_validator.Processing.stats_math_utils.detrend_dim(da: DataArray, dim: str, mask: DataArray | None = None, min_valid_points: int = 5) DataArray[source]#

Detrend a DataArray along a dimension using scipy.signal.detrend, optionally masking out points and skipping those with insufficient valid data.

Parameters:
  • da (xr.DataArray) – Input data with dimension dim.

  • dim (str) – Dimension name along which to detrend (e.g., ‘time’).

  • mask (xr.DataArray, optional) – Boolean mask to apply before detrending. True = valid points.

  • min_valid_points (int, default=5) – Minimum number of valid (non-NaN) points along dim required to perform detrending.

Returns:

Detrended data array.

Return type:

xr.DataArray

Hydrological_model_validator.Processing.stats_math_utils.detrend_linear(data: ndarray | List[float] | Series) ndarray[source]#

Remove a linear trend from 1D numeric data using least squares linear regression.

Parameters:

data (np.ndarray or list or pd.Series) – 1D numeric input data from which the linear trend will be removed.

Returns:

Detrended 1D array with the linear trend removed.

Return type:

np.ndarray

Examples

>>> import numpy as np
>>> data = np.arange(10) + np.random.rand(10)
>>> detrended = detrend(data)
Hydrological_model_validator.Processing.stats_math_utils.detrend_poly_dim(data_array: DataArray, dim: str, degree: int = 1) DataArray[source]#

Remove a polynomial trend of specified degree along a given dimension in an xarray.DataArray.

Parameters:
  • data_array (xarray.DataArray) – Input data array containing the data to detrend.

  • dim (str) – Dimension name along which the polynomial trend will be fitted and removed.

  • degree (int, optional) – Degree of the polynomial to fit and remove (default is 1, linear detrend).

Returns:

Detrended data array with the polynomial trend removed along the specified dimension.

Return type:

xarray.DataArray

Examples

>>> import xarray as xr
>>> data = xr.DataArray(np.random.rand(10, 5), dims=['time', 'space'])
>>> detrended = detrend_dim(data, dim='time', degree=1)
Hydrological_model_validator.Processing.stats_math_utils.detrended_monthly_anomaly(data_array: DataArray) Tuple[DataArray, DataArray][source]#

Calculate detrended monthly anomalies by removing the linear trend first, then removing the monthly climatology.

Parameters:

data_array (xarray.DataArray) – Time series data with a ‘time’ coordinate, typically monthly.

Returns:

Tuple containing: - anomalies: Detrended monthly anomalies. - monthly_climatology: Monthly climatology of the detrended data.

Return type:

Tuple[xarray.DataArray, xarray.DataArray]

Examples

>>> anomalies, climatology = detrended_monthly_anomaly(data_array)
Hydrological_model_validator.Processing.stats_math_utils.extract_multidecadal_peak(freqs: ndarray, amps: ndarray, frequency_threshold: float = 0.1) Dict[str, float] | None[source]#

Extract amplitude and frequency/period of the largest peak within a frequency threshold region.

Parameters:
  • freqs (np.ndarray) – Array of frequencies from power spectrum.

  • amps (np.ndarray) – Corresponding amplitude array from power spectrum.

  • frequency_threshold (float, optional) – Frequency threshold defining multidecadal region (default 1/10 cycles per year).

Returns:

Dictionary with keys ‘Peak Amplitude’, ‘Peak Frequency (cycles/year)’, ‘Peak Period (years)’ for the largest amplitude peak in the multidecadal region, or None if no frequencies below threshold.

Return type:

dict or None

Examples

>>> peak = extract_multidecadal_peak(freqs, amps)
Hydrological_model_validator.Processing.stats_math_utils.extract_multidecadal_peaks_from_spectra(power_spectra: Dict[str, Tuple[ndarray, ...]], frequency_threshold: float = 0.1) DataFrame[source]#

Extract multidecadal peak amplitude and frequency information from multiple regions’ power spectra.

Parameters:
  • power_spectra (dict) – Dictionary mapping region names to tuples containing frequency array, amplitude array, and others.

  • frequency_threshold (float, optional) – Threshold frequency below which frequencies are considered multidecadal (default 1/10 cycles/year).

Returns:

DataFrame with peak amplitude, frequency, and period for each region.

Return type:

pd.DataFrame

Examples

>>> df = extract_multidecadal_peaks_from_spectra(power_spectra)
Hydrological_model_validator.Processing.stats_math_utils.fit_huber(mod_data: ndarray, sat_data: ndarray) Tuple[ndarray, ndarray][source]#

Fit a robust linear regression (Huber) model to 1D predictor and response arrays.

Parameters:
  • mod_data (np.ndarray) – 1D array of model data (predictor variable). Must be the same length as sat_data.

  • sat_data (np.ndarray) – 1D array of satellite data (response variable). Must be the same length as mod_data.

Returns:

Tuple (x_vals, y_vals) containing 100 points on the fitted Huber regression line.

Return type:

Tuple[np.ndarray, np.ndarray]

Raises:

ValueError – If inputs are not 1D NumPy arrays of the same length, or are empty.

Examples

>>> mod = np.array([1, 2, 3, 4])
>>> sat = np.array([1.2, 1.9, 3.1, 3.9])
>>> x, y = fit_huber(mod, sat)
>>> len(x), len(y)
(100, 100)
Hydrological_model_validator.Processing.stats_math_utils.fit_lowess(mod_data: ndarray, sat_data: ndarray, frac: float = 0.3) ndarray[source]#

Fit a LOWESS (Locally Weighted Scatterplot Smoothing) curve to satellite vs. model data.

Parameters:
  • mod_data (np.ndarray) – 1D array of model data (predictor). Must be the same length as sat_data.

  • sat_data (np.ndarray) – 1D array of satellite data (response). Must be the same length as mod_data.

  • frac (float, optional) – Fraction of the data used when estimating each y-value. Must be in the interval (0, 1]. Default is 0.3.

Returns:

Array of shape (N, 2) with columns [sorted_mod_data, smoothed_sat_data], suitable for plotting.

Return type:

np.ndarray

Raises:

ValueError – If inputs are not 1D NumPy arrays of equal length, or if frac is not in (0, 1].

Examples

>>> mod = np.array([1, 2, 3, 4])
>>> sat = np.array([1.2, 1.8, 2.5, 3.9])
>>> smooth = fit_lowess(mod, sat, frac=0.5)
>>> smooth.shape
(4, 2)
Hydrological_model_validator.Processing.stats_math_utils.identify_extreme_events(time_series: ndarray | Series | DataArray, threshold_multiplier: float = 1.5, step: int | None = None, comparison_index: int | None = None) Dict[str, ndarray | List[int]][source]#

Identify extreme positive and negative events in a time series using a threshold based on standard deviation. Optionally, sample these extreme events at fixed intervals starting at a given index.

Parameters:
  • time_series (np.ndarray or pd.Series or xarray.DataArray) – 1D time series data.

  • threshold_multiplier (float, optional) – Multiplier for standard deviation to define extreme events (default 1.5).

  • step (int, optional) – Step interval to sample the time series for extreme events (e.g., 12 for monthly December events).

  • comparison_index (int, optional) – Starting index for sampling when step is provided (e.g., 12 to start from December).

Returns:

Dictionary containing: - ‘positive_events_mask’: Boolean mask array for extreme positive events. - ‘negative_events_mask’: Boolean mask array for extreme negative events. - ‘sampled_positive_indices’: Indices of sampled positive extreme events (if step provided). - ‘sampled_negative_indices’: Indices of sampled negative extreme events (if step provided).

Return type:

dict

Examples

>>> events = identify_extreme_events(n34_array, threshold_multiplier=1.5, step=12, comparison_index=12)
Hydrological_model_validator.Processing.stats_math_utils.mean_bias(m: ndarray | Series | DataArray, o: ndarray | Series | DataArray, time_dim: str = 'time') float | DataArray[source]#

Compute the mean bias between model and observations over the specified time dimension.

Parameters:
  • m (np.ndarray, pd.Series, or xr.DataArray) – Model data.

  • o (np.ndarray, pd.Series, or xr.DataArray) – Observation data.

  • time_dim (str, optional) – Name of the time dimension to compute mean over (default is ‘time’).

Returns:

Mean bias: model minus observation, averaged over time.

Return type:

float or xr.DataArray

Examples

>>> m = np.array([1.1, 2.2, 3.3])
>>> o = np.array([1.0, 2.0, 3.0])
>>> mean_bias(m, o)
0.19999999999999973
Hydrological_model_validator.Processing.stats_math_utils.monthly_anomaly(data_array: DataArray) Tuple[DataArray, DataArray][source]#

Calculate monthly mean anomalies without detrending by removing the monthly climatology.

Parameters:

data_array (xarray.DataArray) – Time series data with a ‘time’ coordinate, must be monthly or higher frequency.

Returns:

Tuple containing: - anomalies: Monthly anomalies computed by subtracting monthly means. - monthly_climatology: Mean values for each month.

Return type:

Tuple[xarray.DataArray, xarray.DataArray]

Examples

>>> anomalies, climatology = monthly_anomaly(data_array)
Hydrological_model_validator.Processing.stats_math_utils.np_correlation(field_time_xy: ndarray, index_time: ndarray) ndarray[source]#

Calculate Pearson correlation coefficient map between a 3D spatial-temporal field and a 1D index time series.

Parameters:
  • field_time_xy (np.ndarray) – 3D array (time, y, x) representing spatial field over time.

  • index_time (np.ndarray) – 1D array representing time series index.

Returns:

2D correlation map with shape (y, x).

Return type:

np.ndarray

Examples

>>> corr = np_correlation(field, index)
Hydrological_model_validator.Processing.stats_math_utils.np_covariance(field_time_xy: ndarray, index_time: ndarray) ndarray[source]#

Calculate covariance between a 3D spatial-temporal field and a 1D index time series.

Parameters:
  • field_time_xy (np.ndarray) – 3D array with shape (time, y, x), spatial field over time.

  • index_time (np.ndarray) – 1D array with length equal to time dimension in field_time_xy.

Returns:

2D covariance map with shape (y, x).

Return type:

np.ndarray

Examples

>>> cov = np_covariance(field, index)
Hydrological_model_validator.Processing.stats_math_utils.np_regression(field_time_xy: ndarray, index_time: ndarray, std_units: str = 'yes') ndarray[source]#

Compute regression coefficients between a 3D spatial-temporal field and a 1D index time series.

Parameters:
  • field_time_xy (np.ndarray) – 3D spatial-temporal field array (time, y, x).

  • index_time (np.ndarray) – 1D index time series.

  • std_units (str, optional) – If ‘yes’, normalize regression by standard deviation of index_time.

Returns:

2D regression coefficients map (y, x).

Return type:

np.ndarray

Examples

>>> regression = np_regression(field, index)
Hydrological_model_validator.Processing.stats_math_utils.round_up_to_nearest(x: float | int, base: float = 1.0) float[source]#

Round up the given number to the nearest multiple of the specified base.

Parameters:
  • x (float or int) – Number to round up.

  • base (float, optional) – The base multiple to round up to. Must be positive (default is 1.0).

Returns:

The rounded up value as a multiple of the base.

Return type:

float

Raises:

ValueError – If base is not positive.

Examples

>>> round_up_to_nearest(5.3, base=2)
6.0
>>> round_up_to_nearest(7, base=0.5)
7.0
Hydrological_model_validator.Processing.stats_math_utils.spatial_mean(data_array: DataArray, mask: DataArray | ndarray) DataArray[source]#

Compute the spatial mean of a DataArray over latitude and longitude, applying a boolean mask.

Parameters:
  • data_array (xr.DataArray) – Input data array with dimensions including ‘lat’ and ‘lon’.

  • mask (xr.DataArray or np.ndarray) – 2D boolean mask with shape matching (lat, lon). True indicates valid points.

Returns:

Spatial mean over lat/lon of data within the mask, ignoring NaNs.

Return type:

xr.DataArray

Examples

>>> import numpy as np
>>> import xarray as xr
>>> data = xr.DataArray(np.random.rand(3,4,5), dims=('time', 'lat', 'lon'))
>>> mask = xr.DataArray(np.array([[True, False, True, True, False],
...                               [True, True, False, False, True],
...                               [False, True, True, False, True],
...                               [True, False, True, True, True]]),
...                     dims=('lat', 'lon'))
>>> spatial_mean(data.isel(time=0), mask)
<xarray.DataArray ()>
array(0.5267)
Hydrological_model_validator.Processing.stats_math_utils.standard_deviation_error(m: ndarray | Series | DataArray, o: ndarray | Series | DataArray, time_dim: str = 'time') float | DataArray[source]#

Compute the difference between the standard deviations of model and observation data.

Parameters:
  • m (np.ndarray, pd.Series, or xr.DataArray) – Model data.

  • o (np.ndarray, pd.Series, or xr.DataArray) – Observation data.

  • time_dim (str, optional) – Name of the time dimension to compute std over (default is ‘time’).

Returns:

The difference between model and observation standard deviations (model - obs).

Return type:

float or xr.DataArray

Raises:

ValueError – If m and o have different lengths along the time dimension.

Examples

>>> m = np.array([1.0, 2.0, 3.0])
>>> o = np.array([1.5, 2.5, 3.5])
>>> standard_deviation_error(m, o)
0.0
Hydrological_model_validator.Processing.stats_math_utils.std_dev(da: ndarray | Series | DataArray, time_dim: str = 'time') float | DataArray[source]#

Compute the standard deviation along the specified time dimension.

Parameters:
  • da (np.ndarray, pd.Series, or xr.DataArray) – Input data array or series.

  • time_dim (str, optional) – Name of the time dimension to compute std over (default is ‘time’).

Returns:

Standard deviation along the time dimension.

Return type:

float or xr.DataArray

Examples

>>> import numpy as np
>>> data = np.array([1.0, 2.0, 3.0, 4.0])
>>> std_dev(data)
1.118033988749895
Hydrological_model_validator.Processing.stats_math_utils.unbiased_rmse(m: ndarray | Series | DataArray, o: ndarray | Series | DataArray, time_dim: str = 'time') float | DataArray[source]#

Compute the unbiased Root Mean Square Error (centered RMSE) between model and observations.

Parameters:
  • m (np.ndarray, pd.Series, or xr.DataArray) – Model data.

  • o (np.ndarray, pd.Series, or xr.DataArray) – Observation data.

  • time_dim (str, optional) – Name of the time dimension to compute RMSE over (default is ‘time’).

Returns:

Unbiased RMSE value.

Return type:

float or xr.DataArray

Examples

>>> m = np.array([1.0, 2.0, 3.0])
>>> o = np.array([1.1, 1.9, 3.1])
>>> unbiased_rmse(m, o)
0.10000000000000009
Hydrological_model_validator.Processing.stats_math_utils.yearly_anomaly(data_array: DataArray) Tuple[DataArray, DataArray][source]#

Calculate yearly mean anomalies without detrending by removing the yearly climatology.

Parameters:

data_array (xarray.DataArray) – Time series data with a ‘time’ coordinate, typically annual or higher frequency.

Returns:

Tuple containing: - anomalies: Yearly anomalies computed by subtracting yearly means. - yearly_climatology: Mean values for each year.

Return type:

Tuple[xarray.DataArray, xarray.DataArray]

Examples

>>> anomalies, climatology = yearly_anomaly(data_array)