ResamplePolynomial¶
- class sofia_redux.toolkit.resampling.ResamplePolynomial(coordinates, data, error=None, mask=None, window=None, robust=None, negthresh=None, window_estimate_bins=10, window_estimate_percentile=50, window_estimate_oversample=2.0, leaf_size=40, order=1, fix_order=True, large_data=None, check_memory=True, memory_buffer=0.25, **distance_kwargs)[source]¶
Bases:
ResampleBase
Class to resample data using local polynomial fits.
- Parameters:
- coordinatesarray_like of float
(n_features, n_samples) array of independent values. A local internal copy will be created if it is not a numpy.float64 type.
- dataarray_like of float
(n_sets, n_samples) or (n_samples,) array of dependent values. multiple (n_sets) sets of data are supplied, then n_sets solutions will be calculated at each resampling point.
- errorarray_like of float, optional
(n_sets, n_samples) or (n_samples,) array of error (1-sigma) values associated with the
data
array.error
will be used to weight fits, and be propagated to the output error values. If not supplied, the error may still be calculated from residuals to the fit duringResamplePolynomial.__call__()
.- maskarray_like of bool, optional
(n_sets, n_data) or (n_data,) array of bool where
True
indicates a valid data point that can be included the fitting andFalse
indicates data points that should be excluded from the fit. Masked points will be reflected in the output counts array. If not supplied, all samples are considered valid.- windowarray_like or float or int, optional
(n_features,) array or single value specifying the maximum distance (distance definition is handled by
distance_kwargs
) of a data sample from a resampling point such that it can be included in a local fit.window
may be declared for each feature. For example, when fitting 2-dimensional (x, y) data, a window of 1.0 would create a circular fitting window around each resampling point, whereas a window of (1.0, 0.5) would create an elliptical fitting window with a semi-major axis of 1.0 in x and semi-minor axis of 0.5 in y. If not supplied,window
is estimated viaResamplePolynomial.estimate_feature_windows()
.- orderarray_like or int, optional
(n_features,) array or single integer value specifying the polynomial fit order for each feature.
- fix_orderbool, optional
In order for local polynomial fitting to occur, the basic requirement is that n_samples >= (order + 1) ** n_features, where n_samples is the number of data samples within
window
. Iffix_order
is True and this condition is not met, then local fitting will be aborted for that point, and a value ofcval
will be returned instead. Iffix_order
is False, thenorder
will be reduced to the maximum value where this condition can be met. NOTE: this is only available iforder
is symmetrical. i.e. it was passed in as a single integer to be applied across all features. Otherwise, it is unclear as to which feature order should be reduced to meet the condition.- robustfloat, optional
Specifies an outlier rejection threshold for
data
. A data point is identified as an outlier if abs(x_i - x_med)/MAD > robust, where x_med is the median, and MAD is the Median Absolute Deviation defined as 1.482 * median(abs(x_i - x_med)).- negthreshfloat, optional
Specifies a negative value rejection threshold such that data < (-stddev(data) * negthresh) will be excluded from the fit.
- window_estimate_binsint, optional
Used to estimate the
window
if not supplied usingResamplePolynomial.estimate_feature_windows()
.- window_estimate_percentileint or float, optional
Used to estimate the
window
if not supplied usingResamplePolynomial.estimate_feature_windows()
.- window_estimate_oversampleint or float, optional
Used to estimate the
window
if not supplied usingResamplePolynomial.estimate_feature_windows()
.- leaf_sizeint, optional
Number of points at which to switch to brute-force during the ball tree query algorithm. See
sklearn.neighbours.BallTree
for further details.- large_databool, optional
If
True
, indicates that this resampling algorithm will run on a large set of data, and the ball tree should be created on subsets of the data.- check_memorybool, optional
If
True
, check the memory requirements for resampling the supplied data.- memory_bufferfloat, optional
A fraction (positive or negative) with which to modify the memory estimates for the process memory requirements. Memory estimates are scaled by the factor 1 + memory_buffer.
- distance_kwargsdict, optional
Optional keyword arguments passed into
sklearn.neighbors.DistanceMetric()
. The default is to use the “minkowski” definition withp=2
, i.e., the Euclidean definition. This is important in determining which samples lie inside the window region of a resampling point, and when deriving distance weighting factors.
- Raises:
- ValueErrorInvalid inputs to __init__ or __call__
Attributes Summary
Return the fitting tree representative of points to fit.
Return the order of polynomial fit.
Methods Summary
__call__
(*args[, smoothing, ...])Resample data defined during initialization to new coordinates.
calculate_adaptive_smoothing
(settings)Calculate the adaptive distance weighting kernel.
Return the minimum number of points for a polynomial fit.
check_order
(order, n_features, n_samples)Check the order is of correct format and enough samples exist.
pre_fit
(settings, *args[, ...])Perform pre-fitting steps and build the fitting tree.
process_block
(args, block)Run
solve_fits()
on each block.reduction_settings
([smoothing, ...])Define a set of reduction instructions based on user input.
set_sample_tree
(coordinates[, radius, ...])Build the sample tree from input coordinates.
Attributes Documentation
- fit_tree¶
Return the fitting tree representative of points to fit.
- Returns:
- PolynomialTree
- order¶
Return the order of polynomial fit.
- Returns:
- orderint or numpy.ndarray (int)
A symmetrical order, or the order for each feature.
Methods Documentation
- __call__(*args, smoothing=0.0, relative_smooth=False, adaptive_threshold=None, adaptive_algorithm='scaled', fit_threshold=0.0, cval=nan, edge_threshold=0.0, edge_algorithm='distribution', order_algorithm='bounded', error_weighting=True, estimate_covariance=False, is_covar=False, jobs=None, use_threading=None, use_processes=None, adaptive_region_coordinates=None, get_error=False, get_counts=False, get_weights=False, get_distance_weights=False, get_rchi2=False, get_cross_derivatives=False, get_offset_variance=False, **kwargs)[source]¶
Resample data defined during initialization to new coordinates.
- Parameters:
- argsarray_like or n-tuple of array_like
args can take one of the following formats:
grid : n-tuple of array_like Here
n
is the number of dimensions and each array should be of shape (n_data,). This indicates that resampling should occur on a grid where the first argument corresponds to the coordinates along the first dimension.single point : n-tuple of float
n
is the number of dimensions. The coordinate to resample onto.irregular : array_like An array of shape (n_dimensions, n_ndata) defining a set of coordinates onto which to resample.
- smoothingfloat or array_like of float, optional
If set, weights the polynomial fitting by distance from the resampling coordinate to the sample coordinate. The weighting factor applied is exp(-dx^2 /
smoothing
). A value may be defined for each dimension by providingsmoothing
as an array of shape (n_dimensions,). However, ifadaptive_threshold
> 0 for a certain feature, the corresponding smoothing element has a different meaning. In this case, it gives the 1-sigma uncertainty in the coordinate position for that feature. As a reference, for astronomical observations using a Gaussian beam with known FWHM,smoothing
should be set toFWHM / (2 * sqrt(2 * log(2)))
so long as adaptive weighting is enabled.- relative_smoothbool, optional
If
True
, the suppliedsmoothing
value is defined in units ofwindow
. Otherwise,smoothing
is in the same units as the supplied coordinates.- adaptive_thresholdfloat or array_like (n_features,)
If a single value is supplied, each feature will use this value. Otherwise, each feature should specify an adaptive threshold as an element of
adaptive_threshold
. If a non-zero value is supplied, adaptive weighting will be enabled for that feature. These values define the size of the initial weighting kernel used during fitting. The nominal value is 1.- adaptive_algorithmstr, optional
May be one of {‘scaled’, ‘shaped’, None}.
None
disables adaptive weighting. The “scaled” algorithm allows the weighting kernel to change in size only. The “shaped” algorithm allows the kernel to change in size, rotate, and stretch each primary axis. Please seeresample_utils.scaled_adaptive_weight_matrix()
andresample_utils.shaped_adaptive_weight_matrix()
for further details.- fit_thresholdfloat, optional
If nonzero, rejects a polynomial fit if it deviates by
|fit_threshold|
* the RMS of the samples. If it is rejected, that value will be replaced by the mean (error weighted if set), iffit_threshold > 0
or NaN iffit_threshold < 0
.- cvalfloat, optional
During fitting, any fit that does not meet the
order_algorithm
requirement will be set to this value. This will be NaN by default.- edge_thresholdfloat or array_like or float
If set to a value > 0, edges of the fit will be masked out according to
edge_algorithm
. Values close to zero will result in a low degree of edge clipping, while values close to 1 clip edges to a greater extent. The exact definition ofedge_threshold
depends on the algorithm. For further details, please seeresampling.resample_utils.check_edges()
.- edge_algorithmstr, optional
Describes how to clip edges if edge_threshold is non-zero. The available algorithms are:
‘distribution’ (default): Statistics on sample distributions are calculated, and if the resampling point is > 1/threshold standard deviations away from the sample mean, it will be clipped.
‘ellipsoid’: If the samples used to fit a resampling point deviate from the resampling point location by more than this amount, it will be clipped.
‘box’: If the flattened 1-dimensional distribution of samples center-of-mass deviates from the resampling point location in any dimension, it will be clipped.
‘range’: Over each dimension, check the distribution of points is greater than edge_threshold to the “left” and “right” of the resampling point.
- order_algorithmstr, optional
The type of check to perform on whether the sample distribution for each resampling point is adequate to derive a polynomial fit. Depending on
order
andfix_order
, if the distribution does not meet the criteria fororder_algorithm
, either the fit will be aborted, returning a value ofcval
, or the fit order will be reduced. Available algorithms are:‘bounded’: Require that there are
order
samples in both the negative and positive directions of each feature from the resampling point.‘counts’: Require that there are (order + 1) ** n_features samples within the
window
of each resampling point.‘extrapolate’: Attempt to fit regardless of the sample distribution.
Note that ‘bounded’ is the most robust mode as it ensures that no singular values will be encountered during the least-squares fitting of polynomial coefficients.
- error_weightingbool, optional
If
True
(default), weight polynomial fitting by theerror
values of each sample.- estimate_covariancebool, optional
If
True
, calculate errors on the fit usingresample_utils.estimated_covariance_matrix_inverse()
. Otherwise, useresample_utils.covariance_matrix_inverse()
.- is_covarbool, optional
If True, the input data is treated as a covariance instead of a flux, and is propagated as if through a weighted mean.
- jobsint, optional
Specifies the maximum number of concurrently running jobs. An attempt will be made to parallel process using a thread-pool if available, but will otherwise revert to the “loky” backend. Values of 0 or 1 will result in serial processing. A negative value sets jobs to
n_cpus + 1 + jobs
such that -1 would use all cpus, and -2 would use all but one cpu.- use_threadingbool, optional
If
True
, force use of threads during multiprocessing.- use_processesbool, optional
If
True
, force use of sub-processes during multiprocessing.- get_errorbool, optional
If
True
, If True returns the error which is given as the weighted RMS of the samples used for each resampling point.- get_countsbool, optional
If
True
returns the number of samples used to fit each resampling point.- get_weightsbool, optional
If
True
, returns the sum of all sample weights (error and distance) used in the fit at each resampling point.- get_distance_weightsbool, optional
If
True
, returns the sum of all sample distance weights (not including error) used in the fit at each resampling point.- get_rchi2bool, optional
If
True
, returns the reduced chi-squared statistic of the fit at each resampling point. Note that this is only valid if errors were supplied duringResamplePolynomial.__init__()
.- get_cross_derivativesbool, optional
If
True
, returns the derivative mean-squared-cross-product of the fit derivatives at each resampling point.- get_offset_variancebool, optional
If
True
, returns the offset of the resampling point from the center of the sample distribution used to generate the fit as a variance. i.e., a return value of 9 indicates a 3-sigma deviation of the resampling point from the sample distribution.
- Returns:
- fit_data, [fit_error], [fit_counts]
The data fit at
args
and optionally, the error associated with the fit, and the number of samples used to generate each point. Will be of a shape determined byargs
.fit_data
andfit_error
will be of type np.float64, andfit_counts
will be of type np.int64.
- calculate_adaptive_smoothing(settings)[source]¶
Calculate the adaptive distance weighting kernel.
Calculates a weighting kernel for each sample in the reduction. i.e, each sample will have a different weighting factor based on its distance (and optionally, direction) from each point at which a fit is required. This is done by performing a fit centered on each sample, and measuring how that fit deviates from the actual observed sample value. The first step is to decide upon the distance weighting factor used to perform the initial fit.
It is assumed that the coordinate error is known (or approximately known), and supplied as a \(1 \sigma\) error as the ‘alpha’ keyword in
settings
. For example, the error in astronomical observations taken using a Gaussian beam with known FWHM is:\[\sigma = \frac{\text{FWHM}}{2 \sqrt{2 \ln{2}}}\]However, adaptive weighting may be calculated over select dimensions, leaving some fixed. In this case, the
alpha
keyword will always apply a weighting factor in a “fixed” dimension \(k\) as:\[w_k = exp \left( \frac{-\Delta x_k^2}{\alpha_{fixed, k} \Omega_k^2} \right)\]where \(\Delta x_k\) is the distance from the resampling point to the sample in question, and \(\Omega_k\) is the window principle axis in dimension \(k\). Note that in the weighting function, \(\sigma\) is related to \(\alpha\) by:
\[\alpha_k = 2 \sigma_k^2\]and weighting is applied using the definition of \(w_k\) above. To signify that \(\alpha_{fixed, k}\) should be used instead of \(\sigma\) for dimension \(k\), the “adaptive_threshold” array in
settings
should be set to zero for the corresponding dimension. For example, if we had:settings['alpha'] = [0.3, 0.3] settings['adaptive_threshold'] = [1, 0]
the first dimension would have \(\alpha_0=0.18\) (\(2 \times 0.3^2\)), and the second would have \(\alpha_1=0.3\). In this example, \(\alpha_0\) would be allowed to vary per sample, while \(\alpha_1\) would be fixed for each sample at 0.3. An initial fit is then performed at each sample coordinate using a test distance weighting parameter:
\[ \begin{align}\begin{aligned}\sigma_{test, k} = \frac{\pi \sigma_k}{\sqrt{2 ln{2}}}\\\alpha_{test, k} = 2 \sigma_{test, k}^2\end{aligned}\end{align} \]Using this initial weighting parameter, the adaptive kernels are derived using either
resample_utils.shaped_adaptive_weight_matrix()
orresample_utils.scaled_adaptive_weight_matrix()
depending on whether the “shaped” keyword insettings
is set toTrue
orFalse
respectively.The weighting kernels are stored in the “adaptive_alpha” keyword value in
settings
.- Parameters:
- settingsdict
Reduction settings, as returned by
ResamplePolynomial.reduction_settings()
.
- Returns:
- None
- calculate_minimum_points()[source]¶
Return the minimum number of points for a polynomial fit.
- Returns:
- minimum_pointsint
The minimum number of points for a polynomial fit
- static check_order(order, n_features, n_samples)[source]¶
Check the order is of correct format and enough samples exist.
- Parameters:
- orderint or array_like of int (n_features,)
The polynomial order to fit, either supplied as an integer to be applied over all dimensions, or as an array to give the order in each dimension.
- n_featuresint
The number of features (dimensions) of the sample coordinates.
- n_samplesint
The number of samples.
- Returns:
- orderint or numpy.ndarray of int (n_features,)
The formatted polynomial order.
- Raises:
- ValueError
If there are too few samples for the desired order, or
order
is not formatted correctly.
- pre_fit(settings, *args, adaptive_region_coordinates=None)[source]¶
Perform pre-fitting steps and build the fitting tree.
- Parameters:
- settingsdict
Settings calculated via
reduction_settings
to be applied if necessary.- argsn-tuple
The call input arguments.
- adaptive_region_coordinatesnumpy.ndarray (float), optional
The coordinates determined from a previous adaptive smoothing algorithm.
- Returns:
- None
- classmethod process_block(args, block)[source]¶
Run
solve_fits()
on each block.Utility function that parses the settings and tree objects to something usable by the numba JIT compiled resampling functions. This is not meant to be called directly.
- Parameters:
- args2-tuple
A tuple of form (filename, iteration) where the filename is a string pointing towards a previously saved pickle file containing the relevant information for the reduction if required. If set to
None
, the arguments are retrieved from the_global_resampling_values
global parameter.- blockint
The block index to process.
- Returns:
- results9-tuple of numpy.ndarray
The first element contains the fit point indices to be fit. For the remaining elements, please see
solve_fits()
return values.
- reduction_settings(smoothing=0.0, relative_smooth=False, adaptive_threshold=None, adaptive_algorithm='scaled', error_weighting=True, fit_threshold=0.0, cval=nan, edge_threshold=0.0, edge_algorithm='distribution', order_algorithm='bounded', is_covar=False, estimate_covariance=False, jobs=None, adaptive_region_coordinates=None, use_threading=None, use_processes=None)[source]¶
Define a set of reduction instructions based on user input.
This method is responsible for determining, formatting, and checking a number variables required for the resampling algorithm based on user input. For detailed descriptions of user options, please see
ResamplePolynomial.__call__()
.- Parameters:
- smoothingfloat or array_like (n_features,), optional
- relative_smoothbool, optional
- adaptive_thresholdfloat or array_like (n_features,), optional
- adaptive_algorithmstr, optional
- error_weightingbool, optional
- fit_thresholdfloat, optional
- cvalfloat, optional
- edge_thresholdfloat or array_like (n_features,), optional
- edge_algorithmstr, optional
- order_algorithmstr, optional
- is_covarbool, optional
- estimate_covariancebool, optional
- jobsint, optional
- adaptive_region_coordinatesarray_like, optional
- use_threadingbool, optional
If
True
, force use of threads during multiprocessing.- use_processesbool, optional
If
True
, force use of sub-processes during multiprocessing.
- Returns:
- settingsdict
The reduction settings. Also, stored as
ResamplePolynomial.fit_settings()
.
- set_sample_tree(coordinates, radius=None, window_estimate_bins=10, window_estimate_percentile=50, window_estimate_oversample=2.0, leaf_size=40, large_data=None, check_memory=True, memory_buffer=0.25, **distance_kwargs)[source]¶
Build the sample tree from input coordinates.
- Parameters:
- coordinatesnumpy.ndarray (float)
The input coordinates of shape (n_features, n_samples).
- radiusfloat or sequence (float), optional
The radius of the window around each fitting point used to determine sample selection for fit. If not supplied, will be estimated using
ResamplePolynomial.estimate_feature_windows()
.- window_estimate_binsint, optional
Used to estimate the
window
if not supplied usingResamplePolynomial.estimate_feature_windows()
.- window_estimate_percentileint or float, optional
Used to estimate the
window
if not supplied usingResamplePolynomial.estimate_feature_windows()
.- window_estimate_oversampleint or float, optional
Used to estimate the
window
if not supplied usingResamplePolynomial.estimate_feature_windows()
.- leaf_sizeint, optional
Number of points at which to switch to brute-force during the ball tree query algorithm. See
sklearn.neighbours.BallTree
for further details.- large_databool, optional
If
True
, indicates that this resampling algorithm will run on a large set of data, and the ball tree should be created on subsets of the data.- check_memorybool, optional
If
True
, check the memory requirements for resampling the supplied data.- memory_bufferfloat, optional
A fraction (positive or negative) with which to modify the memory estimates for the process memory requirements. Memory estimates are scaled by the factor 1 + memory_buffer.
- distance_kwargsdict, optional
Optional keyword arguments passed into
sklearn.neighbors.DistanceMetric()
. The default is to use the “minkowski” definition withp=2
, i.e., the Euclidean definition. This is important in determining which samples lie inside the window region of a resampling point, and when deriving distance weighting factors.
- Returns:
- None