Outlier Rejection

The sofia_redux.toolkit.stats.stats module incorporates outlier rejection in several functions. Given a user defined threshold, sample \(i\) in distribution \(x\) is defined as an outlier if:

\begin{eqnarray} & \frac{| x_{i} - median\{x\} |}{MAD} > threshold \\ & \text{where,} \\ & MAD = 1.482 \times median\{ | x_{i} - median\{x\} | \} \end{eqnarray}

The find_outliers() function is used to identify outliers in a sample distribution supplied as an N-dimensional array. The search may be applied along a single dimension, or over the full set. Output supplied as a Boolean mask with the same shape as the input array where False indicates an outlier.

import numpy as np
import matplotlib.pyplot as plt
from sofia_redux.toolkit.stats import find_outliers

rand = np.random.RandomState(42)
x = rand.rand(16, 16) - 0.5
outliers = find_outliers(x, threshold=5)
assert outliers.all()  # Verify no outliers identified at this stage

# Insert a single bad row
x[4] += 100

# Insert a single bad column
x[:, 4] += 100

# Find outliers from the entire distribution, and along each dimension
full_outliers = find_outliers(x, threshold=5)
row_outliers = find_outliers(x, axis=0, threshold=5)
col_outliers = find_outliers(x, axis=1, threshold=5)

plt.figure(figsize=(10,5))
plt.subplot(131)
plt.imshow(full_outliers, cmap='gray')
plt.title("Ouliers over full distribution")
plt.subplot(132)
plt.imshow(row_outliers, cmap='gray')
plt.title("Outliers along dimension 0")
plt.subplot(133)
plt.imshow(col_outliers, cmap='gray')
plt.title("Outliers along dimension 1")

(Source code, png, hires.png, pdf)

../../../_images/stats-1.png

Mean and Median Combination

The meancomb() function returns the weighted or straight mean of a data array over the entire set or along an arbitrary dimension. Likewise, the medcomb() function will return the median in the same fashion. If provided, variance will be propagated accordingly. NaNs may either be ignored and excluded (default) or propagated through all calculations.

from sofia_redux.toolkit.stats import meancomb, medcomb
import numpy as np

test_array = np.arange(16).reshape((4, 4)).astype(float)
variance = np.full_like(test_array, 1)

print(test_array)
# [[ 0.  1.  2.  3.]
# [ 4.  5.  6.  7.]
# [ 8.  9. 10. 11.]
# [12. 13. 14. 15.]]
print(meancomb(test_array, datavar=variance))
# (7.5, 0.0625)
print(medcomb(test_array, variance=variance))
# (7.5, 0.09817477042468103)
print(meancomb(test_array, datavar=variance, axis=0))
# (array([6., 7., 8., 9.]), array([0.25, 0.25, 0.25, 0.25]))

Statistics

The moments() function is used to provide statistics on a set of data, optionally avoiding deviant data points as described in Outlier Rejection above.

from sofia_redux.toolkit.stats import moments
import numpy as np

rand = np.random.RandomState(42)
data = rand.rand(100, 3)
data[50, 1] = 1e6  # insert a single spurious data value

stats = moments(data, axis=0)
stats.keys()
# dict_keys(['mask', 'mean', 'var', 'stddev', 'skewness', 'kurtosis'])
stats['mean']
# array([4.76849233e-01, 1.00005101e+04, 4.96303194e-01])
stats['stddev']
# array([2.85957102e-01, 9.99999485e+04, 2.88113989e-01])

# Repeat using outlier rejection
stats = moments(data, axis=0, threshold=5)
stats['mask'].sum()
# 299  # one sample has been masked out
stats['mean']
# array([0.47684923, 0.51521793, 0.49630319])
stats['stddev']
# array([0.2859571 , 0.31056431, 0.28811399])