SamsRegressor
- class pyoptex.analysis.estimators.sams.estimator.SamsRegressor(factors=(), Y2X=<function identityY2X>, random_effects=(), dependencies=None, mode=None, forced_model=None, model_size=None, nb_models=10000, skipn='auto', est_ratios=None, allow_duplicate_sample=False, max_cluster=8, ncluster=None, topn_bnb=4, nterms_bnb=None, bnb_timeout=180, entropy_sampler=<function sample_model_dep_onebyone>, entropy_sampling_N=10000, entropy_model_order=None, tqdm=True)[source]
Regression model selection using the SAMS procedure. This implements the
MultiRegressionMixin, and outputs multiple good fitting models. SAMS was originally divised in Wolters and Bingham (2012) and adapted here to include any dependency matrix and random effects.Note
It also includes all parameters and attributes from
MultiRegressionMixinNote
A more detailed guide on SAMS can be found at Simulated annealing model selection (SAMS).
Attributes
- dependenciesnp.array(2d)
The dependency matrix of size (N, N) with N the number of terms in the encoded model (output from Y2X). Term i depends on term j if dep(i, j) = true.
- modeNone or ‘weak’ or ‘strong’
The heredity mode during sampling.
- forced_modelnp.array(1d)
Any terms that must be included in the model. Commonly np.array([0], dtype=np.int) is used to force the intercept when the intercept is the first column in the normalized, encoded model matrix. This model must itself fulfill the heredity constraints.
- model_sizeint
The size of the overfitted models. Defaults to the number of runs divided by three. The overfitted model includes the forced model, and its size must thus be larger than the forced model.
- nb_modelsint or ‘all’
The number of unique models to accept during the sams procedure. If ‘all’ then all possible models are fitted.
- skipnint, float or ‘auto’
The number of worst-fitting models to skip during branch-and-bound and entropy calculations. When specified as a float, it must be a number between 0 and 1 to indicate a fraction. Defaults to ‘auto’ which uses the elbow method and an additional 1% safety margin. Any int must be smaller than nb_models.
- est_ratiosNone or np.array(1d)
The estimated variance ratios to be used during SAMS. These ratios are used to make the SAMS procedure computationally feasible in mixed models. For every random effect, a ratio should be provided. Defaults to 1 for each random effect if None is specified.
- allow_duplicate_samplebool
Whether or not to allow duplicate samples to be stored in the final results of the SAMS sampling procedure.
- max_clusterint
The maximum number of clusters to try when specifying ‘auto’ as ncluster. Atleast three are required for the elbow method. This number is inclusive.
- nclusterNone or int or ‘auto’
The number of clusters to fit on the raster plot using the Hamming distance. If None, no kmeans clustering is performed. If ‘auto’, every number of clusters between 1 and max_cluster is tried and the best is selected using the elbow method.
- topn_bnbint
The number of top submodels for a fixed size to retrieve for entropy calculations.
- nterms_bnbNone or int or iterable(int)
The fixed sizes of submodels to apply the branch-and-bound algorithm on. If None, every size from one to the model_size - 2 (inclusive) is tested as recommended by the original paper. If an int, every size from one until the specified number is tested. If an iterable, only the values from the iterable are tested.
- bnb_timeoutint
The maximum number of seconds to run the branch-and-bound algorithm for. Clear submodels in the raster plot will not require much time in the branch-and-bound algorithm. Therefore, if the branch-and-bound algorithm would require too much time, most likely low entropy models are the result and the computation can be halted prematurely. Defaults to three minutes.
- entropy_samplerfunc(dep, model_size, N, forced, mode)
The sampler to use when generating random hereditary models. See the documentation on customizing SAMS for an indication on which sampler to use.
- entropy_sampling_Nint
The number of random samples to draw using the sampler to compute the theoretical frequencies of the submodels.
- entropy_model_orderdict(str: (‘lin’ or ‘tfi’ or ‘quad’))
The order of the terms in the model. Please read the warning in the documentation on customizing SAMS.
- tqdmbool
Whether to use tqdm to track the progress
- sams_model_
Model A SAMS model used in sampling and fitting data during the SAMS procedure.
- results_np.array(1d)
A numpy array with a special datatype where each element contains two arrays of size model_size (‘model’, np.int64), (‘coeff’, np.float64), and one scalar (‘metric’, np.float64). Results contains nb_models elements. These are the returned models from the SAMS procedure.
- models_list(np.array(1d))
The list of models, ordered by entropy.
- entropies_np.array(1d)
The entropy of each exported model in models_. In case of multiple clusters, the entropies are calculated within the respective cluster.
- selection_metrics_np.array(1d)
Alias for entropies_.
- frequencies_np.array(1d)
The occurence frequency of each submodel in models_
- kmeans_None or
sklearn.cluster.Kmeans A kmeans object used to cluster the raster plot. Added a parameter skips equal to 5% of the cluster size to be skipped for entropy calculations.
- metric_name_str
The name of the selection metric.
- __init__(factors=(), Y2X=<function identityY2X>, random_effects=(), dependencies=None, mode=None, forced_model=None, model_size=None, nb_models=10000, skipn='auto', est_ratios=None, allow_duplicate_sample=False, max_cluster=8, ncluster=None, topn_bnb=4, nterms_bnb=None, bnb_timeout=180, entropy_sampler=<function sample_model_dep_onebyone>, entropy_sampling_N=10000, entropy_model_order=None, tqdm=True)[source]
Initializes the class
Parameters
- factorslist(
Factor) A list of factors to be used during fitting. It contains the categorical encoding, continuous normalization, etc.
- Y2Xfunc(Y)
The function to transform a design matrix Y to a model matrix X.
- random_effectslist(str)
The names of any random effect columns. Every random effect is interpreted as a string column and encoded using effect encoding.
- dependenciesnp.array(2d)
The dependency matrix of size (N, N) with N the number of terms in the encoded model (output from Y2X). Term i depends on term j if dep(i, j) = true.
- modeNone or ‘weak’ or ‘strong’
The heredity mode during sampling.
- forced_modelnp.array(1d)
Any terms that must be included in the model. Commonly np.array([0], dtype=np.int64) is used to force the intercept when the intercept is the first column in the normalized, encoded model matrix. This model must itself fulfill the heredity constraints.
- model_sizeint
The size of the overfitted models. Defaults to the number of runs divided by three. The overfitted model includes the forced model, and its size must thus be larger than the forced model.
- nb_modelsint or ‘all’
Th number of unique models to accept during the sams procedure. If ‘all’, then all possible models will be fitted.
- skipnint, float or ‘auto’
The number of worst-fitting models to skip during branch-and-bound and entropy calculations. When specified as a float, it must be a number between 0 and 1 to indicate a fraction. Defaults to ‘auto’ which uses the elbow method and an additional 1% safety margin. Any int must be smaller than nb_models.
- est_ratiosNone or np.array(1d)
The estimated variance ratios to be used during SAMS. These ratios are used to make the SAMS procedure computationally feasible in mixed models. For every random effect, a ratio should be provided. Defaults to 1 for each random effect if None is specified.
- allow_duplicate_samplebool
Whether or not to allow duplicate samples to be stored in the final results of the SAMS sampling procedure.
- max_clusterint
The maximum number of clusters to try when specifying ‘auto’ as ncluster. Atleast three are required for the elbow method. This number is inclusive.
- nclusterNone or int or ‘auto’
The number of clusters to fit on the raster plot using the Hamming distance. If None, no kmeans clustering is performed. If ‘auto’, every number of clusters between 1 and max_cluster is tried and the best is selected using the elbow method.
- topn_bnbint
The number of top submodels for a fixed size to retrieve for entropy calculations.
- nterms_bnbNone or int or iterable(int)
The fixed sizes of submodels to apply the branch-and-bound algorithm on. If None, every size from one to the model_size - 2 (inclusive) is tested as recommended by the original paper. If an int, every size from one until the specified number is tested. If an iterable, only the values from the iterable are tested.
- bnb_timeoutint
The maximum number of seconds to run the branch-and-bound algorithm for. Clear submodels in the raster plot will not require much time in the branch-and-bound algorithm. Therefore, if the branch-and-bound algorithm would require too much time, most likely low entropy models are the result and the computation can be halted prematurely. Defaults to three minutes.
- entropy_samplerfunc(dep, model_size, N, forced, mode)
The sampler to use when generating random hereditary models. See the documentation on customizing SAMS for an indication on which sampler to use.
- entropy_sampling_Nint
The number of random samples to draw using the sampler to compute the theoretical frequencies of the submodels.
- entropy_model_orderdict(str: (‘lin’ or ‘tfi’ or ‘quad’))
The order of the terms in the model. Please read the warning in the documentation on customizing SAMS.
- tqdmbool
Whether to use tqdm to track the progress
Methods
SamsRegressor.fit(X, y)Fits the data.
SamsRegressor.formula([labels, idx])Creates the prediction formula of the fit for the encoded and normalized data.
SamsRegressor.model_formula(model[, idx])Creates the prediction formula of the fit for the encoded and normalized data.
SamsRegressor.plot_selection([ntop, model])Creates a raster plot of the fitted SAMS procedure.
Prediction variances for the new values specified in X.
Predict on new data after fitting.
Preprocesses before fitting the data.
Preprocessing the incoming data before prediction.
SamsRegressor.score(X, y[, sample_weight])Return coefficient of determination on test data.
Generates a summary of the fit in case it was stored during training in the fit_ attribute.
Attributes
Alias for
information_matrixAlias for
inv_information_matrixAlias for
obs_covAlias for
inv_obs_covThe information matrix of the fitted data.
The inverse of the information matrix.
The inverse of the observation covariance matrix.
Checks whether the regressor has been fitted.
The observation covariance matrix \(V = var(Y)\).
The total variance on the normalized y-values.
- factorslist(