ResamplePolynomial

class sofia_redux.toolkit.resampling.ResamplePolynomial(coordinates, data, error=None, mask=None, window=None, robust=None, negthresh=None, window_estimate_bins=10, window_estimate_percentile=50, window_estimate_oversample=2.0, leaf_size=40, order=1, fix_order=True, large_data=None, check_memory=True, memory_buffer=0.25, **distance_kwargs)[source]

Bases: ResampleBase

Class to resample data using local polynomial fits.

Parameters:
coordinatesarray_like of float

(n_features, n_samples) array of independent values. A local internal copy will be created if it is not a numpy.float64 type.

dataarray_like of float

(n_sets, n_samples) or (n_samples,) array of dependent values. multiple (n_sets) sets of data are supplied, then n_sets solutions will be calculated at each resampling point.

errorarray_like of float, optional

(n_sets, n_samples) or (n_samples,) array of error (1-sigma) values associated with the data array. error will be used to weight fits, and be propagated to the output error values. If not supplied, the error may still be calculated from residuals to the fit during ResamplePolynomial.__call__().

maskarray_like of bool, optional

(n_sets, n_data) or (n_data,) array of bool where True indicates a valid data point that can be included the fitting and False indicates data points that should be excluded from the fit. Masked points will be reflected in the output counts array. If not supplied, all samples are considered valid.

windowarray_like or float or int, optional

(n_features,) array or single value specifying the maximum distance (distance definition is handled by distance_kwargs) of a data sample from a resampling point such that it can be included in a local fit. window may be declared for each feature. For example, when fitting 2-dimensional (x, y) data, a window of 1.0 would create a circular fitting window around each resampling point, whereas a window of (1.0, 0.5) would create an elliptical fitting window with a semi-major axis of 1.0 in x and semi-minor axis of 0.5 in y. If not supplied, window is estimated via ResamplePolynomial.estimate_feature_windows().

orderarray_like or int, optional

(n_features,) array or single integer value specifying the polynomial fit order for each feature.

fix_orderbool, optional

In order for local polynomial fitting to occur, the basic requirement is that n_samples >= (order + 1) ** n_features, where n_samples is the number of data samples within window. If fix_order is True and this condition is not met, then local fitting will be aborted for that point, and a value of cval will be returned instead. If fix_order is False, then order will be reduced to the maximum value where this condition can be met. NOTE: this is only available if order is symmetrical. i.e. it was passed in as a single integer to be applied across all features. Otherwise, it is unclear as to which feature order should be reduced to meet the condition.

robustfloat, optional

Specifies an outlier rejection threshold for data. A data point is identified as an outlier if abs(x_i - x_med)/MAD > robust, where x_med is the median, and MAD is the Median Absolute Deviation defined as 1.482 * median(abs(x_i - x_med)).

negthreshfloat, optional

Specifies a negative value rejection threshold such that data < (-stddev(data) * negthresh) will be excluded from the fit.

window_estimate_binsint, optional

Used to estimate the window if not supplied using ResamplePolynomial.estimate_feature_windows().

window_estimate_percentileint or float, optional

Used to estimate the window if not supplied using ResamplePolynomial.estimate_feature_windows().

window_estimate_oversampleint or float, optional

Used to estimate the window if not supplied using ResamplePolynomial.estimate_feature_windows().

leaf_sizeint, optional

Number of points at which to switch to brute-force during the ball tree query algorithm. See sklearn.neighbours.BallTree for further details.

large_databool, optional

If True, indicates that this resampling algorithm will run on a large set of data, and the ball tree should be created on subsets of the data.

check_memorybool, optional

If True, check the memory requirements for resampling the supplied data.

memory_bufferfloat, optional

A fraction (positive or negative) with which to modify the memory estimates for the process memory requirements. Memory estimates are scaled by the factor 1 + memory_buffer.

distance_kwargsdict, optional

Optional keyword arguments passed into sklearn.neighbors.DistanceMetric(). The default is to use the “minkowski” definition with p=2, i.e., the Euclidean definition. This is important in determining which samples lie inside the window region of a resampling point, and when deriving distance weighting factors.

Raises:
ValueErrorInvalid inputs to __init__ or __call__

Attributes Summary

fit_tree

Return the fitting tree representative of points to fit.

order

Return the order of polynomial fit.

Methods Summary

__call__(*args[, smoothing, ...])

Resample data defined during initialization to new coordinates.

calculate_adaptive_smoothing(settings)

Calculate the adaptive distance weighting kernel.

calculate_minimum_points()

Return the minimum number of points for a polynomial fit.

check_order(order, n_features, n_samples)

Check the order is of correct format and enough samples exist.

pre_fit(settings, *args[, ...])

Perform pre-fitting steps and build the fitting tree.

process_block(args, block)

Run solve_fits() on each block.

reduction_settings([smoothing, ...])

Define a set of reduction instructions based on user input.

set_sample_tree(coordinates[, radius, ...])

Build the sample tree from input coordinates.

Attributes Documentation

fit_tree

Return the fitting tree representative of points to fit.

Returns:
PolynomialTree
order

Return the order of polynomial fit.

Returns:
orderint or numpy.ndarray (int)

A symmetrical order, or the order for each feature.

Methods Documentation

__call__(*args, smoothing=0.0, relative_smooth=False, adaptive_threshold=None, adaptive_algorithm='scaled', fit_threshold=0.0, cval=nan, edge_threshold=0.0, edge_algorithm='distribution', order_algorithm='bounded', error_weighting=True, estimate_covariance=False, is_covar=False, jobs=None, use_threading=None, use_processes=None, adaptive_region_coordinates=None, get_error=False, get_counts=False, get_weights=False, get_distance_weights=False, get_rchi2=False, get_cross_derivatives=False, get_offset_variance=False, **kwargs)[source]

Resample data defined during initialization to new coordinates.

Parameters:
argsarray_like or n-tuple of array_like

args can take one of the following formats:

  • grid : n-tuple of array_like Here n is the number of dimensions and each array should be of shape (n_data,). This indicates that resampling should occur on a grid where the first argument corresponds to the coordinates along the first dimension.

  • single point : n-tuple of float n is the number of dimensions. The coordinate to resample onto.

  • irregular : array_like An array of shape (n_dimensions, n_ndata) defining a set of coordinates onto which to resample.

smoothingfloat or array_like of float, optional

If set, weights the polynomial fitting by distance from the resampling coordinate to the sample coordinate. The weighting factor applied is exp(-dx^2 / smoothing). A value may be defined for each dimension by providing smoothing as an array of shape (n_dimensions,). However, if adaptive_threshold > 0 for a certain feature, the corresponding smoothing element has a different meaning. In this case, it gives the 1-sigma uncertainty in the coordinate position for that feature. As a reference, for astronomical observations using a Gaussian beam with known FWHM, smoothing should be set to FWHM / (2 * sqrt(2 * log(2))) so long as adaptive weighting is enabled.

relative_smoothbool, optional

If True, the supplied smoothing value is defined in units of window. Otherwise, smoothing is in the same units as the supplied coordinates.

adaptive_thresholdfloat or array_like (n_features,)

If a single value is supplied, each feature will use this value. Otherwise, each feature should specify an adaptive threshold as an element of adaptive_threshold. If a non-zero value is supplied, adaptive weighting will be enabled for that feature. These values define the size of the initial weighting kernel used during fitting. The nominal value is 1.

adaptive_algorithmstr, optional

May be one of {‘scaled’, ‘shaped’, None}. None disables adaptive weighting. The “scaled” algorithm allows the weighting kernel to change in size only. The “shaped” algorithm allows the kernel to change in size, rotate, and stretch each primary axis. Please see resample_utils.scaled_adaptive_weight_matrix() and resample_utils.shaped_adaptive_weight_matrix() for further details.

fit_thresholdfloat, optional

If nonzero, rejects a polynomial fit if it deviates by |fit_threshold| * the RMS of the samples. If it is rejected, that value will be replaced by the mean (error weighted if set), if fit_threshold > 0 or NaN if fit_threshold < 0.

cvalfloat, optional

During fitting, any fit that does not meet the order_algorithm requirement will be set to this value. This will be NaN by default.

edge_thresholdfloat or array_like or float

If set to a value > 0, edges of the fit will be masked out according to edge_algorithm. Values close to zero will result in a low degree of edge clipping, while values close to 1 clip edges to a greater extent. The exact definition of edge_threshold depends on the algorithm. For further details, please see resampling.resample_utils.check_edges().

edge_algorithmstr, optional

Describes how to clip edges if edge_threshold is non-zero. The available algorithms are:

  • ‘distribution’ (default): Statistics on sample distributions are calculated, and if the resampling point is > 1/threshold standard deviations away from the sample mean, it will be clipped.

  • ‘ellipsoid’: If the samples used to fit a resampling point deviate from the resampling point location by more than this amount, it will be clipped.

  • ‘box’: If the flattened 1-dimensional distribution of samples center-of-mass deviates from the resampling point location in any dimension, it will be clipped.

  • ‘range’: Over each dimension, check the distribution of points is greater than edge_threshold to the “left” and “right” of the resampling point.

order_algorithmstr, optional

The type of check to perform on whether the sample distribution for each resampling point is adequate to derive a polynomial fit. Depending on order and fix_order, if the distribution does not meet the criteria for order_algorithm, either the fit will be aborted, returning a value of cval, or the fit order will be reduced. Available algorithms are:

  • ‘bounded’: Require that there are order samples in both the negative and positive directions of each feature from the resampling point.

  • ‘counts’: Require that there are (order + 1) ** n_features samples within the window of each resampling point.

  • ‘extrapolate’: Attempt to fit regardless of the sample distribution.

Note that ‘bounded’ is the most robust mode as it ensures that no singular values will be encountered during the least-squares fitting of polynomial coefficients.

error_weightingbool, optional

If True (default), weight polynomial fitting by the error values of each sample.

estimate_covariancebool, optional

If True, calculate errors on the fit using resample_utils.estimated_covariance_matrix_inverse(). Otherwise, use resample_utils.covariance_matrix_inverse().

is_covarbool, optional

If True, the input data is treated as a covariance instead of a flux, and is propagated as if through a weighted mean.

jobsint, optional

Specifies the maximum number of concurrently running jobs. An attempt will be made to parallel process using a thread-pool if available, but will otherwise revert to the “loky” backend. Values of 0 or 1 will result in serial processing. A negative value sets jobs to n_cpus + 1 + jobs such that -1 would use all cpus, and -2 would use all but one cpu.

use_threadingbool, optional

If True, force use of threads during multiprocessing.

use_processesbool, optional

If True, force use of sub-processes during multiprocessing.

get_errorbool, optional

If True, If True returns the error which is given as the weighted RMS of the samples used for each resampling point.

get_countsbool, optional

If True returns the number of samples used to fit each resampling point.

get_weightsbool, optional

If True, returns the sum of all sample weights (error and distance) used in the fit at each resampling point.

get_distance_weightsbool, optional

If True, returns the sum of all sample distance weights (not including error) used in the fit at each resampling point.

get_rchi2bool, optional

If True, returns the reduced chi-squared statistic of the fit at each resampling point. Note that this is only valid if errors were supplied during ResamplePolynomial.__init__().

get_cross_derivativesbool, optional

If True, returns the derivative mean-squared-cross-product of the fit derivatives at each resampling point.

get_offset_variancebool, optional

If True, returns the offset of the resampling point from the center of the sample distribution used to generate the fit as a variance. i.e., a return value of 9 indicates a 3-sigma deviation of the resampling point from the sample distribution.

Returns:
fit_data, [fit_error], [fit_counts]

The data fit at args and optionally, the error associated with the fit, and the number of samples used to generate each point. Will be of a shape determined by args. fit_data and fit_error will be of type np.float64, and fit_counts will be of type np.int64.

calculate_adaptive_smoothing(settings)[source]

Calculate the adaptive distance weighting kernel.

Calculates a weighting kernel for each sample in the reduction. i.e, each sample will have a different weighting factor based on its distance (and optionally, direction) from each point at which a fit is required. This is done by performing a fit centered on each sample, and measuring how that fit deviates from the actual observed sample value. The first step is to decide upon the distance weighting factor used to perform the initial fit.

It is assumed that the coordinate error is known (or approximately known), and supplied as a \(1 \sigma\) error as the ‘alpha’ keyword in settings. For example, the error in astronomical observations taken using a Gaussian beam with known FWHM is:

\[\sigma = \frac{\text{FWHM}}{2 \sqrt{2 \ln{2}}}\]

However, adaptive weighting may be calculated over select dimensions, leaving some fixed. In this case, the alpha keyword will always apply a weighting factor in a “fixed” dimension \(k\) as:

\[w_k = exp \left( \frac{-\Delta x_k^2}{\alpha_{fixed, k} \Omega_k^2} \right)\]

where \(\Delta x_k\) is the distance from the resampling point to the sample in question, and \(\Omega_k\) is the window principle axis in dimension \(k\). Note that in the weighting function, \(\sigma\) is related to \(\alpha\) by:

\[\alpha_k = 2 \sigma_k^2\]

and weighting is applied using the definition of \(w_k\) above. To signify that \(\alpha_{fixed, k}\) should be used instead of \(\sigma\) for dimension \(k\), the “adaptive_threshold” array in settings should be set to zero for the corresponding dimension. For example, if we had:

settings['alpha'] = [0.3, 0.3]
settings['adaptive_threshold'] = [1, 0]

the first dimension would have \(\alpha_0=0.18\) (\(2 \times 0.3^2\)), and the second would have \(\alpha_1=0.3\). In this example, \(\alpha_0\) would be allowed to vary per sample, while \(\alpha_1\) would be fixed for each sample at 0.3. An initial fit is then performed at each sample coordinate using a test distance weighting parameter:

\[ \begin{align}\begin{aligned}\sigma_{test, k} = \frac{\pi \sigma_k}{\sqrt{2 ln{2}}}\\\alpha_{test, k} = 2 \sigma_{test, k}^2\end{aligned}\end{align} \]

Using this initial weighting parameter, the adaptive kernels are derived using either resample_utils.shaped_adaptive_weight_matrix() or resample_utils.scaled_adaptive_weight_matrix() depending on whether the “shaped” keyword in settings is set to True or False respectively.

The weighting kernels are stored in the “adaptive_alpha” keyword value in settings.

Parameters:
settingsdict

Reduction settings, as returned by ResamplePolynomial.reduction_settings().

Returns:
None
calculate_minimum_points()[source]

Return the minimum number of points for a polynomial fit.

Returns:
minimum_pointsint

The minimum number of points for a polynomial fit

static check_order(order, n_features, n_samples)[source]

Check the order is of correct format and enough samples exist.

Parameters:
orderint or array_like of int (n_features,)

The polynomial order to fit, either supplied as an integer to be applied over all dimensions, or as an array to give the order in each dimension.

n_featuresint

The number of features (dimensions) of the sample coordinates.

n_samplesint

The number of samples.

Returns:
orderint or numpy.ndarray of int (n_features,)

The formatted polynomial order.

Raises:
ValueError

If there are too few samples for the desired order, or order is not formatted correctly.

pre_fit(settings, *args, adaptive_region_coordinates=None)[source]

Perform pre-fitting steps and build the fitting tree.

Parameters:
settingsdict

Settings calculated via reduction_settings to be applied if necessary.

argsn-tuple

The call input arguments.

adaptive_region_coordinatesnumpy.ndarray (float), optional

The coordinates determined from a previous adaptive smoothing algorithm.

Returns:
None
classmethod process_block(args, block)[source]

Run solve_fits() on each block.

Utility function that parses the settings and tree objects to something usable by the numba JIT compiled resampling functions. This is not meant to be called directly.

Parameters:
args2-tuple

A tuple of form (filename, iteration) where the filename is a string pointing towards a previously saved pickle file containing the relevant information for the reduction if required. If set to None, the arguments are retrieved from the _global_resampling_values global parameter.

blockint

The block index to process.

Returns:
results9-tuple of numpy.ndarray

The first element contains the fit point indices to be fit. For the remaining elements, please see solve_fits() return values.

reduction_settings(smoothing=0.0, relative_smooth=False, adaptive_threshold=None, adaptive_algorithm='scaled', error_weighting=True, fit_threshold=0.0, cval=nan, edge_threshold=0.0, edge_algorithm='distribution', order_algorithm='bounded', is_covar=False, estimate_covariance=False, jobs=None, adaptive_region_coordinates=None, use_threading=None, use_processes=None)[source]

Define a set of reduction instructions based on user input.

This method is responsible for determining, formatting, and checking a number variables required for the resampling algorithm based on user input. For detailed descriptions of user options, please see ResamplePolynomial.__call__().

Parameters:
smoothingfloat or array_like (n_features,), optional
relative_smoothbool, optional
adaptive_thresholdfloat or array_like (n_features,), optional
adaptive_algorithmstr, optional
error_weightingbool, optional
fit_thresholdfloat, optional
cvalfloat, optional
edge_thresholdfloat or array_like (n_features,), optional
edge_algorithmstr, optional
order_algorithmstr, optional
is_covarbool, optional
estimate_covariancebool, optional
jobsint, optional
adaptive_region_coordinatesarray_like, optional
use_threadingbool, optional

If True, force use of threads during multiprocessing.

use_processesbool, optional

If True, force use of sub-processes during multiprocessing.

Returns:
settingsdict

The reduction settings. Also, stored as ResamplePolynomial.fit_settings().

set_sample_tree(coordinates, radius=None, window_estimate_bins=10, window_estimate_percentile=50, window_estimate_oversample=2.0, leaf_size=40, large_data=None, check_memory=True, memory_buffer=0.25, **distance_kwargs)[source]

Build the sample tree from input coordinates.

Parameters:
coordinatesnumpy.ndarray (float)

The input coordinates of shape (n_features, n_samples).

radiusfloat or sequence (float), optional

The radius of the window around each fitting point used to determine sample selection for fit. If not supplied, will be estimated using ResamplePolynomial.estimate_feature_windows().

window_estimate_binsint, optional

Used to estimate the window if not supplied using ResamplePolynomial.estimate_feature_windows().

window_estimate_percentileint or float, optional

Used to estimate the window if not supplied using ResamplePolynomial.estimate_feature_windows().

window_estimate_oversampleint or float, optional

Used to estimate the window if not supplied using ResamplePolynomial.estimate_feature_windows().

leaf_sizeint, optional

Number of points at which to switch to brute-force during the ball tree query algorithm. See sklearn.neighbours.BallTree for further details.

large_databool, optional

If True, indicates that this resampling algorithm will run on a large set of data, and the ball tree should be created on subsets of the data.

check_memorybool, optional

If True, check the memory requirements for resampling the supplied data.

memory_bufferfloat, optional

A fraction (positive or negative) with which to modify the memory estimates for the process memory requirements. Memory estimates are scaled by the factor 1 + memory_buffer.

distance_kwargsdict, optional

Optional keyword arguments passed into sklearn.neighbors.DistanceMetric(). The default is to use the “minkowski” definition with p=2, i.e., the Euclidean definition. This is important in determining which samples lie inside the window region of a resampling point, and when deriving distance weighting factors.

Returns:
None