Tools¶
Independence Sims¶
Linear¶
-
hyppo.tools.
linear
(n, p, noise=False, low=-1, high=1)[source]¶ Simulates univariate or multivariate linear data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- low : float, (default: -1)
The lower limit of the uniform distribution simulated from.
- high : float, (default: -1)
The upper limit of the uniform distribution simulated from.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, 1) where n is the number of samples and p is the number of dimensions.
Notes
Linear \((X, Y) \in \mathbb{R}^p \times \mathbb{R}\):
\[\begin{split}X &\sim \mathcal{U}(-1, 1)^p \\ Y &= w^T X + \kappa \epsilon\end{split}\]Examples
>>> from hyppo.tools import linear >>> x, y = linear(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 1)
Exponential¶
-
hyppo.tools.
exponential
(n, p, noise=False, low=0, high=3)[source]¶ Simulates univariate or multivariate exponential data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- low : float, (default: 0)
The lower limit of the uniform distribution simulated from.
- high : float, (default: 3)
The upper limit of the uniform distribution simulated from.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, 1) where n is the number of samples and p is the number of dimensions.
Notes
Exponential \((X, Y) \in \mathbb{R}^p \times \mathbb{R}\):
\[\begin{split}X &\sim \mathcal{U}(0, 3)^p \\ Y &= \exp (w^T X) + 10 \kappa \epsilon\end{split}\]Examples
>>> from hyppo.tools import exponential >>> x, y = exponential(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 1)
Cubic¶
-
hyppo.tools.
cubic
(n, p, noise=False, low=-1, high=1, cubs=[-12, 48, 128], scale=0.3333333333333333)[source]¶ Simulates univariate or multivariate cubic data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- low : float, (default: -1)
The lower limit of the uniform distribution simulated from.
- high : float, (default: -1)
The upper limit of the uniform distribution simulated from.
- cubs : list of ints (default: [-12, 48, 128])
Coefficients of the cubic function where each value corresponds to the order of the cubic polynomial.
- scale : float (default: 1/3)
Scaling center of the cubic.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, 1) where n is the number of samples and p is the number of dimensions.
Notes
Cubic \((X, Y) \in \mathbb{R}^p \times \mathbb{R}\):
\[\begin{split}X &\sim \mathcal{U}(-1, 1)^p \\ Y &= 128 \left( w^T X - \frac{1}{3} \right)^3 + 48 \left( w^T X - \frac{1}{3} \right)^2 - 12 \left( w^T X - \frac{1}{3} \right) + 80 \kappa \epsilon\end{split}\]Examples
>>> from hyppo.tools import cubic >>> x, y = cubic(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 1)
Joint Normal¶
-
hyppo.tools.
joint_normal
(n, p, noise=False)[source]¶ Simulates univariate or multivariate joint-normal data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, p) where n is the number of samples and p is the number of dimensions.
Notes
Joint Normal \((X, Y) \in \mathbb{R}^p \times \mathbb{R}^p\): Let \(\rho = \frac{1}{2} p\), \(I_p\) be the identity matrix of size \(p \times p\), \(J_p\) be the matrix of ones of size \(p \times p\) and \(\Sigma = \begin{bmatrix} I_p & \rho J_p \\ \rho J_p & (1 + 0.5\kappa) I_p \end{bmatrix}\). Then,
\[(X, Y) \sim \mathcal{N}(0, \Sigma)\]Examples
>>> from hyppo.tools import joint_normal >>> x, y = joint_normal(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 2)
Step¶
-
hyppo.tools.
step
(n, p, noise=False, low=-1, high=1)[source]¶ Simulates univariate or multivariate step data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- low : float, (default: -1)
The lower limit of the uniform distribution simulated from.
- high : float, (default: -1)
The upper limit of the uniform distribution simulated from.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, 1) where n is the number of samples and p is the number of dimensions.
Notes
Step \((X, Y) \in \mathbb{R}^p \times \mathbb{R}\):
\[\begin{split}X &\sim \mathcal{U}(-1, 1)^p \\ Y &= \mathbb{1}_{w^T X > 0} + \epsilon\end{split}\]where \(\mathbb{1}\) is the indicator function.
Examples
>>> from hyppo.tools import step >>> x, y = step(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 1)
Quadratic¶
-
hyppo.tools.
quadratic
(n, p, noise=False, low=-1, high=1)[source]¶ Simulates univariate or multivariate quadratic data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- low : float, (default: -1)
The lower limit of the uniform distribution simulated from.
- high : float, (default: -1)
The upper limit of the uniform distribution simulated from.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, 1) where n is the number of samples and p is the number of dimensions.
Notes
Quadratic \((X, Y) \in \mathbb{R}^p \times \mathbb{R}\):
\[\begin{split}X &\sim \mathcal{U}(-1, 1)^p \\ Y &= (w^T X)^2 + 0.5 \kappa \epsilon\end{split}\]Examples
>>> from hyppo.tools import quadratic >>> x, y = quadratic(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 1)
W-Shaped¶
-
hyppo.tools.
w_shaped
(n, p, noise=False, low=-1, high=1)[source]¶ Simulates univariate or multivariate quadratic data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- low : float, (default: -1)
The lower limit of the uniform distribution simulated from.
- high : float, (default: -1)
The upper limit of the uniform distribution simulated from.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, 1) where n is the number of samples and p is the number of dimensions.
Notes
W-Shaped \((X, Y) \in \mathbb{R}^p \times \mathbb{R}\): \(\mathcal{U}(-1, 1)^p\),
\[\begin{split}X &\sim \mathcal{U}(-1, 1)^p \\ Y &= \left[ \left( (w^T X)^2 - \frac{1}{2} \right)^2 + \frac{w^T U}{500} \right] + 0.5 \kappa \epsilon\end{split}\]Examples
>>> from hyppo.tools import w_shaped >>> x, y = w_shaped(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 1)
Spiral¶
-
hyppo.tools.
spiral
(n, p, noise=False, low=0, high=5)[source]¶ Simulates univariate or multivariate spiral data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- low : float, (default: 0)
The lower limit of the uniform distribution simulated from.
- high : float, (default: 5)
The upper limit of the uniform distribution simulated from.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, 1) where n is the number of samples and p is the number of dimensions.
Notes
Spiral \((X, Y) \in \mathbb{R}^p \times \mathbb{R}\): \(U \sim \mathcal{U}(0, 5)\), \(\epsilon \sim \mathcal{N}(0, 1)\)
\[\begin{split}X_{|d|} &= U \sin(\pi U) \cos^d(\pi U)\ \mathrm{for}\ d = 1,...,p-1 \\ X_{|p|} &= U \cos^p(\pi U) \\ Y &= U \sin(\pi U) + 0.4 p \epsilon\end{split}\]Examples
>>> from hyppo.tools import spiral >>> x, y = spiral(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 1)
Bernoulli¶
Simulates univariate or multivariate uncorrelated Bernoulli data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- prob : float, (default: 0.5)
The probability of the bernoulli distribution simulated from.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, 1) where n is the number of samples and p is the number of dimensions.
Notes
Uncorrelated Bernoulli \((X, Y) \in \mathbb{R}^p \times \mathbb{R}\): \(U \sim \mathcal{B}(0.5)\), \(\epsilon_1 \sim \mathcal{N}(0, I_p)\), \(\epsilon_2 \sim \mathcal{N}(0, 1)\),
\[\begin{split}X &= \mathcal{B}(0.5)^p + 0.5 \epsilon_1 \\ Y &= (2U - 1) w^T X + 0.5 \epsilon_2\end{split}\]Examples
>>> from hyppo.tools import uncorrelated_bernoulli >>> x, y = uncorrelated_bernoulli(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 1)
Logarithmic¶
-
hyppo.tools.
logarithmic
(n, p, noise=False)[source]¶ Simulates univariate or multivariate logarithmic data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, p) where n is the number of samples and p is the number of dimensions.
Notes
Logarithmic \((X, Y) \in \mathbb{R}^p \times \mathbb{R}^p\): \(\epsilon \sim \mathcal{N}(0, I_p)\),
\[\begin{split}X &\sim \mathcal{N}(0, I_p) \\ Y_{|d|} &= 2 \log_2 (|X_{|d|}|) + 3 \kappa \epsilon_{|d|} \ \mathrm{for}\ d = 1, ..., p\end{split}\]Examples
>>> from hyppo.tools import logarithmic >>> x, y = logarithmic(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 2)
Fourth Root¶
-
hyppo.tools.
fourth_root
(n, p, noise=False, low=-1, high=1)[source]¶ Simulates univariate or multivariate fourth root data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- low : float, (default: -1)
The lower limit of the uniform distribution simulated from.
- high : float, (default: -1)
The upper limit of the uniform distribution simulated from.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, 1) where n is the number of samples and p is the number of dimensions.
Notes
Fourth Root \((X, Y) \in \mathbb{R}^p \times \mathbb{R}\):
\[\begin{split}X &\sim \mathcal{U}(-1, 1)^p \\ Y &= |w^T X|^\frac{1}{4} + \frac{\kappa}{4} \epsilon\end{split}\]Examples
>>> from hyppo.tools import fourth_root >>> x, y = fourth_root(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 1)
Sine \(4\pi\)¶
-
hyppo.tools.
sin_four_pi
(n, p, noise=False, low=-1, high=1)[source]¶ Simulates univariate or multivariate sine 4 \(\pi\) data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- low : float, (default: -1)
The lower limit of the uniform distribution simulated from.
- high : float, (default: -1)
The upper limit of the uniform distribution simulated from.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, p) where n is the number of samples and p is the number of dimensions.
Notes
Sine 4:math:pi \((X, Y) \in \mathbb{R}^p \times \mathbb{R}^p\): \(U \sim \mathcal{U}(-1, 1)\), \(V \sim \mathcal{N}(0, 1)^p\), \(\theta = 4 \pi\),
\[\begin{split}X_{|d|} &= U + 0.02 p V_{|d|}\ \mathrm{for}\ d = 1, ..., p \\ Y &= \sin (\theta X) + \kappa \epsilon\end{split}\]Examples
>>> from hyppo.tools import sin_four_pi >>> x, y = sin_four_pi(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 2)
Sine \(16\pi\)¶
-
hyppo.tools.
sin_sixteen_pi
(n, p, noise=False, low=-1, high=1)[source]¶ Simulates univariate or multivariate sine 16 \(\pi\) data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- low : float, (default: -1)
The lower limit of the uniform distribution simulated from.
- high : float, (default: -1)
The upper limit of the uniform distribution simulated from.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, p) where n is the number of samples and p is the number of dimensions.
Notes
Sine 16:math:pi \((X, Y) \in \mathbb{R}^p \times \mathbb{R}^p\): \(U \sim \mathcal{U}(-1, 1)\), \(V \sim \mathcal{N}(0, 1)^p\), \(\theta = 16 \pi\),
\[\begin{split}X_{|d|} &= U + 0.02 p V_{|d|}\ \mathrm{for}\ d = 1, ..., p \\ Y &= \sin (\theta X) + \kappa \epsilon\end{split}\]Examples
>>> from hyppo.tools import sin_sixteen_pi >>> x, y = sin_sixteen_pi(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 2)
Square¶
-
hyppo.tools.
square
(n, p, noise=False, low=-1, high=1)[source]¶ Simulates univariate or multivariate square data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- low : float, (default: -1)
The lower limit of the uniform distribution simulated from.
- high : float, (default: -1)
The upper limit of the uniform distribution simulated from.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, p) where n is the number of samples and p is the number of dimensions.
Notes
Square \((X, Y) \in \mathbb{R}^p \times \mathbb{R}^p\): \(U \sim \mathcal{U}(-1, 1)\), \(V \sim \mathcal{N}(0, 1)^p\), \(\theta = -\frac{\pi}{8}\),
\[\begin{split}X_{|d|} &= U \cos(\theta) + V \sin(\theta) + 0.05 p \epsilon_{|d|} \ \mathrm{for}\ d = 1, ..., p \\ Y_{|d|} &= -U \sin(\theta) + V \cos(\theta)\end{split}\]Examples
>>> from hyppo.tools import square >>> x, y = square(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 2)
Two Parabolas¶
-
hyppo.tools.
two_parabolas
(n, p, noise=False, low=-1, high=1, prob=0.5)[source]¶ Simulates univariate or multivariate two parabolas data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- low : float, (default: -1)
The lower limit of the uniform distribution simulated from.
- high : float, (default: -1)
The upper limit of the uniform distribution simulated from.
- prob : float, (default: 0.5)
The probability of the bernoulli distribution simulated from.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, 1) where n is the number of samples and p is the number of dimensions.
Notes
Two Parabolas \((X, Y) \in \mathbb{R}^p \times \mathbb{R}^p\):
\[\begin{split}X &\sim \mathcal{U}(-1, 1)^p \\ Y &= ((w^T X)^2 + 2 \kappa \epsilon) \times \left( U = \frac{1}{2} \right)\end{split}\]Examples
>>> from hyppo.tools import two_parabolas >>> x, y = two_parabolas(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 2)
Circle¶
-
hyppo.tools.
circle
(n, p, noise=False, low=-1, high=1)[source]¶ Simulates univariate or multivariate circle data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- low : float, (default: -1)
The lower limit of the uniform distribution simulated from.
- high : float, (default: -1)
The upper limit of the uniform distribution simulated from.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, p) where n is the number of samples and p is the number of dimensions.
Notes
Circle \((X, Y) \in \mathbb{R}^p \times \mathbb{R}^p\): \(U \sim \mathcal{U}(-1, 1)^p\), \(\epsilon \sim \mathcal{N}(0, I_p)\), \(r = 1\),
\[\begin{split}X_{|d|} &= r \left( \sin(\pi U_{|d+1|}) \prod_{j=1}^d \cos(\pi U_{|j|}) + 0.4 \epsilon_{|d|} \right)\ \mathrm{for}\ d = 1, ..., p-1 \\ X_{|p|} &= r \left( \prod_{j=1}^p \cos(\pi U_{|j|}) + 0.4 \epsilon_{|p|} \right) \\ Y_{|d|} &= \sin(\pi U_{|1|})\end{split}\]Examples
>>> from hyppo.tools import circle >>> x, y = circle(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 2)
Ellipse¶
-
hyppo.tools.
ellipse
(n, p, noise=False, low=-1, high=1)[source]¶ Simulates univariate or multivariate ellipse data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- low : float, (default: -1)
The lower limit of the uniform distribution simulated from.
- high : float, (default: -1)
The upper limit of the uniform distribution simulated from.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, p) where n is the number of samples and p is the number of dimensions.
Notes
Ellipse \((X, Y) \in \mathbb{R}^p \times \mathbb{R}^p\): \(U \sim \mathcal{U}(-1, 1)^p\), \(\epsilon \sim \mathcal{N}(0, I_p)\), \(r = 5\),
\[\begin{split}X_{|d|} &= r \left( \sin(\pi U_{|d+1|}) \prod_{j=1}^d \cos(\pi U_{|j|}) + 0.4 \epsilon_{|d|} \right)\ \mathrm{for}\ d = 1, ..., p-1 \\ X_{|p|} &= r \left( \prod_{j=1}^p \cos(\pi U_{|j|}) + 0.4 \epsilon_{|p|} \right) \\ Y_{|d|} &= \sin(\pi U_{|1|})\end{split}\]Examples
>>> from hyppo.tools import ellipse >>> x, y = ellipse(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 2)
Diamond¶
-
hyppo.tools.
diamond
(n, p, noise=False, low=-1, high=1)[source]¶ Simulates univariate or multivariate diamond data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- low : float, (default: -1)
The lower limit of the uniform distribution simulated from.
- high : float, (default: -1)
The upper limit of the uniform distribution simulated from.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, p) where n is the number of samples and p is the number of dimensions.
Notes
Diamond \((X, Y) \in \mathbb{R}^p \times \mathbb{R}^p\): \(U \sim \mathcal{U}(-1, 1)\), \(V \sim \mathcal{N}(0, 1)^p\), \(\theta = -\frac{\pi}{4}\),
\[\begin{split}X_{|d|} &= U \cos(\theta) + V \sin(\theta) + 0.05 p \epsilon_{|d|}\ \mathrm{for}\ d = 1, ..., p \\ Y_{|d|} &= -U \sin(\theta) + V \cos(\theta)\end{split}\]Examples
>>> from hyppo.tools import diamond >>> x, y = diamond(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 2)
Multiplicative Noise¶
-
hyppo.tools.
multiplicative_noise
(n, p)[source]¶ Simulates univariate or multivariate multiplicative noise data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, p) where n is the number of samples and p is the number of dimensions.
Notes
Multiplicative Noise \((X, Y) \in \mathbb{R}^p \times \mathbb{R}^p\): \(\U \sim \mathcal{N}(0, I_p)\),
\[\begin{split}X &\sim \mathcal{N}(0, I_p) \\ Y_{|d|} &= U_{|d|} X_{|d|}\ \mathrm{for}\ d = 1, ..., p\end{split}\]Examples
>>> from hyppo.tools import multiplicative_noise >>> x, y = multiplicative_noise(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 2)
Multimodal Independence¶
-
hyppo.tools.
multimodal_independence
(n, p, prob=0.5, sep1=3, sep2=2)[source]¶ Simulates univariate or multimodal independence data.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- prob : float, (default: 0.5)
The probability of the bernoulli distribution simulated from.
- sep1, sep2: float, (default: 3, 2)
The separation between clusters of normally distributed data.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n, p) and (n, p) where n is the number of samples and p is the number of dimensions.
Notes
Multimodal Independence \((X, Y) \in \mathbb{R}^p \times \mathbb{R}^p\): \(U \sim \mathcal{N}(0, I_p)\), \(V \sim \mathcal{N}(0, I_p)\), \(U^\prime \sim \mathcal{B}(0.5)^p\), \(V^\prime \sim \mathcal{B}(0.5)^p\),
\[\begin{split}X &= \frac{U}{3} + 2 U^\prime - 1 \\ Y &= \frac{V}{3} + 2 V^\prime - 1\end{split}\]Examples
>>> from hyppo.tools import multimodal_independence >>> x, y = multimodal_independence(100, 2) >>> print(x.shape, y.shape) (100, 2) (100, 2)
K-Sample Sims¶
2-Sample Rotated Simulation¶
-
hyppo.tools.
rot_2samp
(sim, n, p, noise=True, degree=90)[source]¶ Rotates input simulations to produce a 2-sample simulation.
Parameters: - sim : callable()
The simulation (from the
hyppo.tools
module) that is to be rotated.- n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: True)
Whether or not to include noise in the simulation.
- degree : float, (default: 90)
The number of degrees to rotate the input simulation by (in first dimension).
Returns: - samp1, samp2 : ndarray
Rotated data matrices. samp1 and samp2 have shapes (n, p+1) and (n, p+1) or (n, 2p) and (n, 2p) depending on the independence simulation. Here, n is the number of samples and p is the number of dimensions.
Examples
>>> from hyppo.tools import rot_2samp, linear >>> x, y = rot_2samp(linear, 100, 1) >>> print(x.shape, y.shape) (100, 2) (100, 2)
2-Sample Translated Simulation¶
-
hyppo.tools.
trans_2samp
(sim, n, p, noise=True, degree=90, trans=0.3)[source]¶ Translates and rotates input simulations to produce a 2-sample simulation.
Parameters: - n : int
The number of samples desired by the simulation.
- p : int
The number of dimensions desired by the simulation.
- noise : bool, (default: False)
Whether or not to include noise in the simulation.
- degree : float, (default: 90)
The number of degrees to rotate the input simulation by (in first dimension).
- trans : float, (default: 0.3)
The amount to translate the second simulation by (in first dimension).
Returns: - samp1, samp2 : ndarray
Translated/rotated data matrices. samp1 and samp2 have shapes (n, p+1) and (n, p+1) or (n, 2p) and (n, 2p) depending on the independence simulation. Here, n is the number of samples and p is the number of dimensions.
Examples
>>> from hyppo.tools import trans_2samp, linear >>> x, y = trans_2samp(linear, 100, 1) >>> print(x.shape, y.shape) (100, 2) (100, 2)
3-Sample Gaussian Simulation¶
-
hyppo.tools.
gaussian_3samp
(n, epsilon=1, weight=0, case=1)[source]¶ Generates 3 sample of gaussians corresponding to 5 cases.
Parameters: - n : int
The number of samples desired by the simulation.
- epsilon : float, (default: 1)
The amount to translate simulation by (amount depends on case).
- weight : float, (default: False)
Number between 0 and 1 corresponding to weight of the second Gaussian (used in case 4 and 5 to produce a mixture of Gaussians)
- case : {1, 2, 3, 4, 5}, (default: 1)
The case in which to evaluate statistical power for each test.
Returns: - sims : list of ndarray
List of 3 2-dimensional multivariate Gaussian each corresponding to the desired case.
Examples
>>> from hyppo.tools import gaussian_3samp >>> sims = gaussian_3samp(100) >>> print(sims[0].shape, sims[1].shape, sims[2].shape) (100, 2) (100, 2) (100, 2)
Time-Series Sims¶
Independent AR Process¶
-
hyppo.tools.
indep_ar
(n, lag=1, phi=0.5, sigma=1)[source]¶ Simulates two independent, stationary, autoregressive time series.
Parameters: - n : int
The number of samples desired by the simulation.
- lag : float, optional (default: 1)
The maximum time lag considered between x and y.
- phi : float, optional (default: 0.5)
The AR coefficient.
- sigma : float, optional (default: 1)
The variance of the noise.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n,) and (n,) where n is the number of samples.
Notes
\(X_t\) and \(Y_t\) are univarite AR(
1ag
) with \(\phi = 0.5\) for both series. Noise follows \(\mathcal{N}(0, \sigma)\). With lag (1), this is\[\begin{split}\begin{bmatrix} X_t \\ Y_t \end{bmatrix} = \begin{bmatrix} \phi & 0 \\ 0 & \phi \end{bmatrix} \begin{bmatrix} X_{t - 1} \\ Y_{t - 1} \end{bmatrix} + \begin{bmatrix} \epsilon_t \\ \eta_t \end{bmatrix}\end{split}\]Examples
>>> from hyppo.tools import indep_ar >>> x, y = indep_ar(100) >>> print(x.shape, y.shape) (100,) (100,)
Linear AR Process¶
-
hyppo.tools.
cross_corr_ar
(n, lag=1, phi=0.5, sigma=1)[source]¶ Simulates two linearly dependent time series.
Parameters: - n : int
The number of samples desired by the simulation.
- lag : float, optional (default: 1)
The maximum time lag considered between x and y.
- phi : float, optional (default: 0.5)
The AR coefficient.
- sigma : float, optional (default: 1)
The variance of the noise.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n,) and (n,) where n is the number of samples.
Notes
\(X_t\) and \(Y_t\) are together a bivariate univarite AR(
1ag
) with \(\phi = \begin{bmatrix} 0 & 0.5 \\ 0.5 & 0 \end{bmatrix}\) for both series. Noise follows \(\mathcal{N}(0, \sigma)\). With lag (1), this is\[\begin{split}\begin{bmatrix} X_t \\ Y_t \end{bmatrix} = \begin{bmatrix} 0 & \phi \\ \phi & 0 \end{bmatrix} \begin{bmatrix} X_{t - 1} \\ Y_{t - 1} \end{bmatrix} + \begin{bmatrix} \epsilon_t \\ \eta_t \end{bmatrix}\end{split}\]Examples
>>> from hyppo.tools import cross_corr_ar >>> x, y = cross_corr_ar(100) >>> print(x.shape, y.shape) (100,) (100,)
Nonlinear AR Process¶
-
hyppo.tools.
nonlinear_process
(n, lag=1, phi=1, sigma=1)[source]¶ Simulates two nonlinearly dependent time series.
Parameters: - n : int
The number of samples desired by the simulation.
- lag : float, optional (default: 1)
The maximum time lag considered between x and y.
- phi : float, optional (default: 1)
The AR coefficient.
- sigma : float, optional (default: 1)
The variance of the noise.
Returns: - x, y : ndarray
Simulated data matrices. x and y have shapes (n,) and (n,) where n is the number of samples.
Notes
\(X_t\) and \(Y_t\) are together a bivariate nonlinear process. Noise follows \(\mathcal{N}(0, \sigma)\). With lag (1), this is
\[\begin{split}\begin{bmatrix} X_t \\ Y_t \end{bmatrix} = \begin{bmatrix} \phi \epsilon_t Y_{t - 1} \\ \eta_t \end{bmatrix}\end{split}\]Examples
>>> from hyppo.tools import cross_corr_ar >>> x, y = cross_corr_ar(100) >>> print(x.shape, y.shape) (100,) (100,)
Misc¶
Kernel Matrix Computation¶
-
hyppo.tools.
compute_kern
(x, y, metric='gaussian', workers=1, **kwargs)[source]¶ Compute kernel similarity matrix for the input matrices.
Parameters: - x, y : ndarray
Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, if x and y can be distance matrices, where the shapes must both be (n, n), no kernel will be computed.
- metric : str, optional (default: "gaussian")
A function that computes the distance among the samples within each data matrix. Valid strings for
metric
are, as defined insklearn.metrics.pairwise.pairwise_kernels
,['additive_chi2', 'chi2', 'linear', 'poly', 'polynomial', 'gaussian', 'laplacian', 'sigmoid', 'cosine']
Set to None or precomputed if x and y are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form
metric(x, **kwargs)
where x is the data matrix for which pairwise distances are calculated and kwargs are extra arguements to send to your custom function.- workers : int, optional (default: 1)
The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.
- **kwargs : optional
Optional arguments provided to
sklearn.metrics.pairwise.pairwise_kernels
or a custom kernel function.
Returns: - simx, simy : ndarray
Similarity matrices based on the metric provided by the user.
Distance Matrix Computation¶
-
hyppo.tools.
compute_dist
(x, y, metric='euclidean', workers=None, **kwargs)[source]¶ Compute kernel similarity matrix for the input matrices.
Parameters: - x, y : ndarray
Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, if x and y can be distance matrices, where the shapes must both be (n, n), no kernel will be computed.
- metric : str, optional (default: "gaussian")
A function that computes the distance among the samples within each data matrix. Valid strings for
metric
are, as defined insklearn.metrics.pairwise_distances
,- From scikit-learn: [‘cityblock’, ‘cosine’, ‘euclidean’, ‘l1’, ‘l2’, ‘manhattan’] See the documentation for scipy.spatial.distance for details on these metrics.
- From scipy.spatial.distance: [‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘correlation’, ‘dice’, ‘hamming’, ‘jaccard’, ‘kulsinski’, ‘mahalanobis’, ‘minkowski’, ‘rogerstanimoto’, ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘sqeuclidean’, ‘yule’] See the documentation for scipy.spatial.distance for details on these metrics.
Set to None or precomputed if x and y are already distance matrices. To call a custom function, either create the distance matrix before-hand or create a function of the form
metric(x, **kwargs)
where x is the data matrix for which pairwise distances are calculated and kwargs are extra arguements to send to your custom function.- workers : int, optional (default: 1)
The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.
- **kwargs : optional
Optional arguments provided to
sklearn.metrics.pairwise_distances
or a custom kernel function.
Returns: - distx, disty : ndarray
Distance matrices based on the metric provided by the user.
Permutation Test¶
-
hyppo.tools.
perm_test
(calc_stat, x, y, reps=1000, workers=1, is_distsim=True, perm_blocks=None)[source]¶ Calculate the p-value for a nonparametric test via permutation.
This process is completed by first randomly permuting \(y\) to estimate the null distribution and then calculating the probability of observing a test statistic, under the null, at least as extreme as the observed test statistic.
Parameters: - calc_stat : callable()
The method used to calculate the test statistic (must use hyppo API)
- x, y : ndarray
Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n).
- reps : int, optional (default: 1000)
The number of replications used to estimate the null distribution when using the permutation test used to calculate the p-value.
- workers : int, optional (default: 1)
The number of cores to parallelize the p-value computation over. Supply -1 to use all cores available to the Process.
- is_distsim : bool, optional (default: True)
Whether or not x and y are distance or similarity matrices. Changes the permutation style of y.
Returns: - stat : float
The computed test statistic.
- pvalue : float
The computed p-value.
- pvalue : float
The approximated null distribution of shape (reps,).
Chi-Squared Approximation¶
-
hyppo.tools.
chi2_approx
(calc_stat, x, y)[source]¶ Calculate the p-value for Dcorr and Hsic via a chi-squared approximation.
In the case of distance and kernel methods, Dcorr (and by extension Hsic [2]) can be approximated via a chi-squared distribution [#1ChiSq]. This approximation is also applicable for the nonparametric MANOVA via independence testing method in our package [3].
Parameters: - calc_stat : callable()
The method used to calculate the test statistic (must use hyppo API).
- x, y : ndarray
Input data matrices. x and y must have the same number of samples. That is, the shapes must be (n, p) and (n, q) where n is the number of samples and p and q are the number of dimensions. Alternatively, x and y can be distance matrices, where the shapes must both be (n, n).
Returns: - stat : float
The computed test statistic.
- pvalue : float
The computed p-value.
References
[1] Shen, C., & Vogelstein, J. T. (2019). The Chi-Square Test of Distance Correlation. arXiv preprint arXiv:1912.12150. [2] Shen, C., & Vogelstein, J. T. (2018). The exact equivalence of distance and kernel methods for hypothesis testing. arXiv preprint arXiv:1806.05514. [3] Panda, S., Shen, C., Perry, R., Zorn, J., Lutz, A., Priebe, C. E., & Vogelstein, J. T. (2019). Nonparametric MANOVA via Independence Testing. arXiv e-prints, arXiv-1910.