Filtering and data processing#

Introduction#

The xoa.filter module provides comprehensive filtering and data processing utilities for oceanographic and atmospheric data. It includes spatial filtering with convolution kernels, specialized temporal filters for tidal signals, mask erosion operations, and data decimation tools.

Key features:

  • Spatial filtering: N-dimensional convolution with NaN handling

  • Temporal filtering: Tidal filters (Demerliac, Godin) for time series

  • Mask operations: Iterative mask erosion and coast filling

  • Data decimation: Intelligent undersampling based on spatial proximity

Spatial filtering with convolution#

The core of spatial filtering in xoa is the convolve() function, which performs N-dimensional convolution while properly handling missing data (NaNs).

Basic convolution#

In [1]: import xarray as xr

In [2]: import numpy as np

In [3]: import xoa.filter

# Create sample data with some noise
In [4]: np.random.seed(42)

In [5]: data = xr.DataArray(
   ...:     np.random.normal(size=(50, 70)),
   ...:     dims=('y', 'x'),
   ...:     name='temperature'
   ...: )
   ...: 

# Add some missing data
In [6]: data.values[10:20, 10:20] = np.nan

# Apply spatial smoothing with a simple box kernel
In [7]: smoothed = xoa.filter.convolve(data, kernel=5, normalize=True, na_thres=0.5)

The convolve function has important parameters:

  • normalize: If True, divides by local sum of weights (creates weighted average)

  • na_thres: Controls NaN contamination tolerance (0=strict, 1=permissive)

Smoothing shortcut#

For convenience, smooth() is a shortcut that automatically sets normalize=True:

# Equivalent to convolve with normalize=True
In [8]: smoothed = xoa.filter.smooth(data, kernel=5, na_thres=0.5)

Window functions#

The get_window_func() function provides access to various window functions from NumPy and SciPy:

# Get a Gaussian window function
In [9]: gaussian_func = xoa.filter.get_window_func("gaussian", std=0.2)

# Get a Hamming window
In [10]: hamming_func = xoa.filter.get_window_func("hamming")

# Create custom window from array
In [11]: custom_func = xoa.filter.get_window_func([1, 2, 5, 2, 1])

Available window types include:

  • NumPy windows: bartlett, blackman, hamming, hanning, kaiser

  • SciPy windows: gaussian, tukey, bohman, parzen, and many more

  • Custom arrays: Provide your own weight array

Kernel generation#

Kernels can be specified in multiple ways:

1. Simple integer (box kernel):

# 5x5 box kernel
In [12]: smoothed = xoa.filter.smooth(data, kernel=5)

2. Dictionary for anisotropic kernels:

# Different sizes along different dimensions
In [13]: kernel_spec = {'x': 7, 'y': 3}

In [14]: smoothed = xoa.filter.smooth(data, kernel=kernel_spec)

3. Dictionary with a window function:

# Gaussian window along both dimensions
In [15]: kernel_spec = {'x': 9, 'y': 5}

In [16]: smoothed = xoa.filter.smooth(data, kernel=kernel_spec,
   ....:     kernel_kwargs=dict(window_func='gaussian'))
   ....: 

4. Explicit arrays:

# Custom 1D weights
In [17]: kernel_spec = {'x': [1, 2, 5, 2, 1], 'y': [1, 2, 1]}

In [18]: smoothed = xoa.filter.smooth(data, kernel=kernel_spec)

Isotropic kernels#

For isotropic (radially symmetric) kernels, use generate_isotropic_kernel():

# Create 2D isotropic Gaussian kernel
In [19]: iso_kernel = xoa.filter.generate_isotropic_kernel(
   ....:     shape=(15, 15),
   ....:     window_func='gaussian'
   ....: )
   ....: 

In [20]: print(iso_kernel.shape)
(15, 15)

Shapiro kernel#

The Shapiro kernel is a specialized filter kernel commonly used in numerical models for smoothing:

# 2D Shapiro kernel
In [21]: shapiro = xoa.filter.shapiro_kernel(('y', 'x'))

In [22]: print(shapiro.values)
[[0. 1. 0.]
 [1. 2. 1.]
 [0. 1. 0.]]

# Apply Shapiro smoothing
In [23]: smoothed = xoa.filter.smooth(data, kernel=shapiro)

Temporal filtering for tidal signals#

Oceanographic time series often contain tidal signals that need to be filtered out. The xoa.filter module provides specialized tidal filters.

Demerliac filter#

The Demerliac filter is a widely-used tidal filter designed to remove tidal components while preserving lower-frequency signals:

# Create hourly time series with tidal signal
In [24]: import pandas as pd

In [25]: time = pd.date_range('2020-01-01', periods=30*24, freq='h')

# Simulate signal: trend + tidal component + noise
In [26]: t_hours = np.arange(len(time))

In [27]: signal = (
   ....:     0.5 * t_hours / 24 +  # trend
   ....:     0.3 * np.sin(2 * np.pi * t_hours / 12.42) +  # M2 tide
   ....:     0.1 * np.random.normal(size=len(time))  # noise
   ....: )
   ....: 

In [28]: sea_level = xr.DataArray(
   ....:     signal,
   ....:     coords={'time': time},
   ....:     dims='time',
   ....:     name='sea_level'
   ....: )
   ....: 

# Apply Demerliac filter
In [29]: filtered = xoa.filter.demerliac(sea_level, na_thres=0.2)

The Demerliac filter uses pre-defined weights optimized for hourly data. It automatically handles sub-hourly data by interpolating the weights.

Other tidal filters#

The tidal_filter() function provides access to multiple tidal filter types:

# Godin filter (alternative tidal filter)
In [30]: godin_filtered = xoa.filter.tidal_filter(sea_level, 'godin')

# Simple moving average (24-hour)
In [31]: mean_filtered = xoa.filter.tidal_filter(sea_level, 'mean')

Available filter types:

  • demerliac: Demerliac tidal filter (default for demerliac())

  • godin: Godin tidal filter

  • mean: Simple 24-hour moving average

Important notes for tidal filtering:

  • Data must have a valid time coordinate

  • Time step should be hourly or sub-hourly (sub-hourly is interpolated)

  • Time step should be relatively constant (checked with dt_tol parameter)

  • The na_thres parameter controls tolerance for missing data in the time series

Mask operations#

Mask erosion and coast filling are important operations for preparing ocean model output or observational data.

Eroding masks#

The erode_mask() function iteratively fills masked values using smoothing:

# Create data with a masked region
In [32]: data_with_mask = data.copy()

In [33]: data_with_mask.values[20:30, 30:40] = np.nan

# Erode mask by 3 iterations
In [34]: eroded = xoa.filter.erode_mask(data_with_mask, until=3)

The erosion process:

  1. Smooth the data (using Shapiro kernel by default)

  2. Fill NaN values with smoothed values

  3. Repeat for specified number of iterations

You can also erode until a specific mask is satisfied:

# Define a target mask (True = should remain masked)
In [35]: target_mask = xr.DataArray(
   ....:     np.zeros((50, 70), dtype=bool),
   ....:     dims=('y', 'x')
   ....: )
   ....: 

In [36]: target_mask.values[25:27, 35:37] = True  # Keep small region masked

# Erode until only target_mask regions remain masked
In [37]: eroded = xoa.filter.erode_mask(data_with_mask, until=target_mask)

Coast erosion#

The erode_coast() function is specialized for horizontal (geographic) dimensions and automatically identifies X and Y dimensions:

# Create data with land mask
In [38]: lon = xr.DataArray(np.linspace(-10, 0, 70), dims='lon')

In [39]: lat = xr.DataArray(np.linspace(40, 50, 50), dims='lat')

In [40]: sst = xr.DataArray(
   ....:     np.random.normal(15, 2, size=(50, 70)),
   ....:     coords={'lat': lat, 'lon': lon},
   ....:     dims=('lat', 'lon'),
   ....:     name='sst'
   ....: )
   ....: 

# Add coastal mask
In [41]: sst.values[:, :10] = np.nan  # "land" on the left

# Fill coastal regions
In [42]: sst_filled = xoa.filter.erode_coast(sst, until=5)

The function automatically:

  • Identifies longitude and latitude dimensions using CF conventions

  • Uses Shapiro kernel on horizontal dimensions only

  • Preserves vertical or temporal dimensions

Data decimation#

When working with very dense observational data (e.g., satellite altimetry, drifter trajectories), it’s often necessary to reduce the number of points while preserving spatial coverage. The decimate() function provides intelligent spatial undersampling.

Basic decimation#

# Create dense random points
In [43]: np.random.seed(123)

In [44]: npts = 500

In [45]: lons = np.random.uniform(-20, -10, npts)

In [46]: lats = np.random.uniform(40, 50, npts)

In [47]: values = np.random.normal(15, 2, npts)

In [48]: dense_data = xr.Dataset({
   ....:     'temp': (['npts'], values)
   ....: }, coords={
   ....:     'lon': (['npts'], lons),
   ....:     'lat': (['npts'], lats)
   ....: })
   ....: 

# Decimate to ~50km spacing
In [49]: sparse_data = xoa.filter.decimate(dense_data, radius=50e3, method='pick')

In [50]: print(f"Original points: {npts}")
Original points: 500

In [51]: print(f"Decimated points: {sparse_data.sizes['npts']}")
Decimated points: 155

Decimation methods#

Two methods are available:

1. Pick method (method=’pick’): Simply selects points ensuring minimum spacing

In [52]: decimated_pick = xoa.filter.decimate(dense_data, radius=50e3, method='pick')

2. Average method (method=’average’): At each selected point, averages all nearby points within a radius

# Average with smoothing factor
In [53]: decimated_avg = xoa.filter.decimate(
   ....:     dense_data,
   ....:     radius=50e3,
   ....:     method='average',
   ....:     smooth_factor=1.5  # Average over 75km radius
   ....: )
   ....: 

The smooth_factor parameter controls the averaging radius:

  • smooth_factor=0: Equivalent to ‘pick’ method

  • smooth_factor=1: Average over same radius as spacing

  • smooth_factor>1: Larger averaging radius (smoother result)

Typical use cases#

1. Preparing data for kriging:

# Reduce dense satellite data before kriging
satellite_decimated = xoa.filter.decimate(
    satellite_obs,
    radius=25e3,  # 25 km spacing
    method='average',
    smooth_factor=1.0
)

2. Visualization of dense trajectories:

# Thin drifter trajectories for clearer plotting
drifter_thinned = xoa.filter.decimate(
    drifter_positions,
    radius=10e3,  # 10 km spacing
    method='pick'
)

3. Reducing computational cost:

# Subsample before expensive geostatistical processing
data_subset = xoa.filter.decimate(
    observations,
    radius=100e3,  # 100 km spacing
    method='average',
    smooth_factor=2.0  # Smooth over 200 km
)

Best practices and tips#

Handling missing data#

The na_thres parameter is critical for controlling how NaNs propagate:

# Strict: output masked if any input is masked
strict = xoa.filter.smooth(data, kernel=5, na_thres=0)

# Moderate: output masked if >50% input is masked
moderate = xoa.filter.smooth(data, kernel=5, na_thres=0.5)

# Permissive: output masked only if all input is masked
permissive = xoa.filter.smooth(data, kernel=5, na_thres=1.0)

Choosing kernel sizes#

General guidelines:

  • Smaller kernels (3-5): Preserve features, less smoothing

  • Medium kernels (7-15): Balance smoothing and feature preservation

  • Larger kernels (>15): Heavy smoothing, removes fine structure

For anisotropic smoothing:

# Smooth more along one direction
kernel = {'x': 15, 'y': 5}  # More smoothing in x-direction
smoothed = xoa.filter.smooth(data, kernel=kernel)

Computational considerations#

  • Convolution is computationally intensive for large kernels

  • Consider using FFT-based methods for very large kernels (not yet in xoa)

  • For repeated filtering, pre-generate kernels:

# Generate kernel once
kernel = xoa.filter.generate_kernel({'x': 11, 'y': 11}, data, window_func='gaussian')

# Reuse for multiple datasets
smoothed1 = xoa.filter.smooth(data1, kernel=kernel)
smoothed2 = xoa.filter.smooth(data2, kernel=kernel)

Working with multi-dimensional data#

Filtering works seamlessly with multi-dimensional arrays:

# 4D data: time, depth, lat, lon
kernel_4d = {
    'time': 3,      # Light temporal smoothing
    'depth': 1,     # No vertical smoothing
    'lat': 5,       # Spatial smoothing
    'lon': 5
}
smoothed_4d = xoa.filter.smooth(data_4d, kernel=kernel_4d)

See also#