SamsRegressor

class pyoptex.analysis.estimators.sams.estimator.SamsRegressor(factors=(), Y2X=<function identityY2X>, random_effects=(), dependencies=None, mode=None, forced_model=None, model_size=None, nb_models=10000, skipn='auto', est_ratios=None, allow_duplicate_sample=False, max_cluster=8, ncluster=None, topn_bnb=4, nterms_bnb=None, bnb_timeout=180, entropy_sampler=<function sample_model_dep_onebyone>, entropy_sampling_N=10000, entropy_model_order=None, tqdm=True)[source]

Regression model selection using the SAMS procedure. This implements the MultiRegressionMixin, and outputs multiple good fitting models. SAMS was originally divised in Wolters and Bingham (2012) and adapted here to include any dependency matrix and random effects.

Note

It also includes all parameters and attributes from MultiRegressionMixin

Note

A more detailed guide on SAMS can be found at Simulated annealing model selection (SAMS).

Attributes

dependenciesnp.array(2d)

The dependency matrix of size (N, N) with N the number of terms in the encoded model (output from Y2X). Term i depends on term j if dep(i, j) = true.

modeNone or ‘weak’ or ‘strong’

The heredity mode during sampling.

forced_modelnp.array(1d)

Any terms that must be included in the model. Commonly np.array([0], dtype=np.int) is used to force the intercept when the intercept is the first column in the normalized, encoded model matrix. This model must itself fulfill the heredity constraints.

model_sizeint

The size of the overfitted models. Defaults to the number of runs divided by three. The overfitted model includes the forced model, and its size must thus be larger than the forced model.

nb_modelsint or ‘all’

The number of unique models to accept during the sams procedure. If ‘all’ then all possible models are fitted.

skipnint, float or ‘auto’

The number of worst-fitting models to skip during branch-and-bound and entropy calculations. When specified as a float, it must be a number between 0 and 1 to indicate a fraction. Defaults to ‘auto’ which uses the elbow method and an additional 1% safety margin. Any int must be smaller than nb_models.

est_ratiosNone or np.array(1d)

The estimated variance ratios to be used during SAMS. These ratios are used to make the SAMS procedure computationally feasible in mixed models. For every random effect, a ratio should be provided. Defaults to 1 for each random effect if None is specified.

allow_duplicate_samplebool

Whether or not to allow duplicate samples to be stored in the final results of the SAMS sampling procedure.

max_clusterint

The maximum number of clusters to try when specifying ‘auto’ as ncluster. Atleast three are required for the elbow method. This number is inclusive.

nclusterNone or int or ‘auto’

The number of clusters to fit on the raster plot using the Hamming distance. If None, no kmeans clustering is performed. If ‘auto’, every number of clusters between 1 and max_cluster is tried and the best is selected using the elbow method.

topn_bnbint

The number of top submodels for a fixed size to retrieve for entropy calculations.

nterms_bnbNone or int or iterable(int)

The fixed sizes of submodels to apply the branch-and-bound algorithm on. If None, every size from one to the model_size - 2 (inclusive) is tested as recommended by the original paper. If an int, every size from one until the specified number is tested. If an iterable, only the values from the iterable are tested.

bnb_timeoutint

The maximum number of seconds to run the branch-and-bound algorithm for. Clear submodels in the raster plot will not require much time in the branch-and-bound algorithm. Therefore, if the branch-and-bound algorithm would require too much time, most likely low entropy models are the result and the computation can be halted prematurely. Defaults to three minutes.

entropy_samplerfunc(dep, model_size, N, forced, mode)

The sampler to use when generating random hereditary models. See the documentation on customizing SAMS for an indication on which sampler to use.

entropy_sampling_Nint

The number of random samples to draw using the sampler to compute the theoretical frequencies of the submodels.

entropy_model_orderdict(str: (‘lin’ or ‘tfi’ or ‘quad’))

The order of the terms in the model. Please read the warning in the documentation on customizing SAMS.

tqdmbool

Whether to use tqdm to track the progress

sams_model_Model

A SAMS model used in sampling and fitting data during the SAMS procedure.

results_np.array(1d)

A numpy array with a special datatype where each element contains two arrays of size model_size (‘model’, np.int64), (‘coeff’, np.float64), and one scalar (‘metric’, np.float64). Results contains nb_models elements. These are the returned models from the SAMS procedure.

models_list(np.array(1d))

The list of models, ordered by entropy.

entropies_np.array(1d)

The entropy of each exported model in models_. In case of multiple clusters, the entropies are calculated within the respective cluster.

selection_metrics_np.array(1d)

Alias for entropies_.

frequencies_np.array(1d)

The occurence frequency of each submodel in models_

kmeans_None or sklearn.cluster.Kmeans

A kmeans object used to cluster the raster plot. Added a parameter skips equal to 5% of the cluster size to be skipped for entropy calculations.

metric_name_str

The name of the selection metric.

__init__(factors=(), Y2X=<function identityY2X>, random_effects=(), dependencies=None, mode=None, forced_model=None, model_size=None, nb_models=10000, skipn='auto', est_ratios=None, allow_duplicate_sample=False, max_cluster=8, ncluster=None, topn_bnb=4, nterms_bnb=None, bnb_timeout=180, entropy_sampler=<function sample_model_dep_onebyone>, entropy_sampling_N=10000, entropy_model_order=None, tqdm=True)[source]

Initializes the class

Parameters

factorslist(Factor)

A list of factors to be used during fitting. It contains the categorical encoding, continuous normalization, etc.

Y2Xfunc(Y)

The function to transform a design matrix Y to a model matrix X.

random_effectslist(str)

The names of any random effect columns. Every random effect is interpreted as a string column and encoded using effect encoding.

dependenciesnp.array(2d)

The dependency matrix of size (N, N) with N the number of terms in the encoded model (output from Y2X). Term i depends on term j if dep(i, j) = true.

modeNone or ‘weak’ or ‘strong’

The heredity mode during sampling.

forced_modelnp.array(1d)

Any terms that must be included in the model. Commonly np.array([0], dtype=np.int64) is used to force the intercept when the intercept is the first column in the normalized, encoded model matrix. This model must itself fulfill the heredity constraints.

model_sizeint

The size of the overfitted models. Defaults to the number of runs divided by three. The overfitted model includes the forced model, and its size must thus be larger than the forced model.

nb_modelsint or ‘all’

Th number of unique models to accept during the sams procedure. If ‘all’, then all possible models will be fitted.

skipnint, float or ‘auto’

The number of worst-fitting models to skip during branch-and-bound and entropy calculations. When specified as a float, it must be a number between 0 and 1 to indicate a fraction. Defaults to ‘auto’ which uses the elbow method and an additional 1% safety margin. Any int must be smaller than nb_models.

est_ratiosNone or np.array(1d)

The estimated variance ratios to be used during SAMS. These ratios are used to make the SAMS procedure computationally feasible in mixed models. For every random effect, a ratio should be provided. Defaults to 1 for each random effect if None is specified.

allow_duplicate_samplebool

Whether or not to allow duplicate samples to be stored in the final results of the SAMS sampling procedure.

max_clusterint

The maximum number of clusters to try when specifying ‘auto’ as ncluster. Atleast three are required for the elbow method. This number is inclusive.

nclusterNone or int or ‘auto’

The number of clusters to fit on the raster plot using the Hamming distance. If None, no kmeans clustering is performed. If ‘auto’, every number of clusters between 1 and max_cluster is tried and the best is selected using the elbow method.

topn_bnbint

The number of top submodels for a fixed size to retrieve for entropy calculations.

nterms_bnbNone or int or iterable(int)

The fixed sizes of submodels to apply the branch-and-bound algorithm on. If None, every size from one to the model_size - 2 (inclusive) is tested as recommended by the original paper. If an int, every size from one until the specified number is tested. If an iterable, only the values from the iterable are tested.

bnb_timeoutint

The maximum number of seconds to run the branch-and-bound algorithm for. Clear submodels in the raster plot will not require much time in the branch-and-bound algorithm. Therefore, if the branch-and-bound algorithm would require too much time, most likely low entropy models are the result and the computation can be halted prematurely. Defaults to three minutes.

entropy_samplerfunc(dep, model_size, N, forced, mode)

The sampler to use when generating random hereditary models. See the documentation on customizing SAMS for an indication on which sampler to use.

entropy_sampling_Nint

The number of random samples to draw using the sampler to compute the theoretical frequencies of the submodels.

entropy_model_orderdict(str: (‘lin’ or ‘tfi’ or ‘quad’))

The order of the terms in the model. Please read the warning in the documentation on customizing SAMS.

tqdmbool

Whether to use tqdm to track the progress

Methods

SamsRegressor.fit(X, y)

Fits the data.

SamsRegressor.formula([labels, idx])

Creates the prediction formula of the fit for the encoded and normalized data.

SamsRegressor.model_formula(model[, idx])

Creates the prediction formula of the fit for the encoded and normalized data.

SamsRegressor.plot_selection([ntop, model])

Creates a raster plot of the fitted SAMS procedure.

SamsRegressor.pred_var(X)

Prediction variances for the new values specified in X.

SamsRegressor.predict(X)

Predict on new data after fitting.

SamsRegressor.preprocess_fit(X, y)

Preprocesses before fitting the data.

SamsRegressor.preprocess_predict(X)

Preprocessing the incoming data before prediction.

SamsRegressor.score(X, y[, sample_weight])

Return coefficient of determination on test data.

SamsRegressor.summary()

Generates a summary of the fit in case it was stored during training in the fit_ attribute.

Attributes

SamsRegressor.M_

Alias for information_matrix

SamsRegressor.Minv_

Alias for inv_information_matrix

SamsRegressor.V_

Alias for obs_cov

SamsRegressor.Vinv_

Alias for inv_obs_cov

SamsRegressor.information_matrix

The information matrix of the fitted data.

SamsRegressor.inv_information_matrix

The inverse of the information matrix.

SamsRegressor.inv_obs_cov

The inverse of the observation covariance matrix.

SamsRegressor.is_fitted

Checks whether the regressor has been fitted.

SamsRegressor.obs_cov

The observation covariance matrix \(V = var(Y)\).

SamsRegressor.total_var

The total variance on the normalized y-values.