.. |link-qc-pre| raw:: html
.. |link-qc-post| raw:: html
.. _a_example_scenarios:
Example scenarios
=================
.. _a_drop_p_value:
Dropping based on p-values
--------------------------
A common strategy is to fit a very large model and drop terms
one by one based on their p-value significance. Similar to the quickstart,
we will fit a response surface model with three continuous variables. The full
Python script is found at
|link-qc-pre|\ |version|\ |link-qc-mid0|\ drop_pvalue.py\ |link-qc-mid1|\ drop_pvalue.py\ |link-qc-post|.
Start with the imports
>>> import numpy as np
>>> import pandas as pd
>>>
>>> from pyoptex.utils import Factor
>>> from pyoptex.utils.model import model2Y2X, order_dependencies, partial_rsm_names
>>> from pyoptex.analysis import PValueDropRegressor
>>> from pyoptex.analysis.utils.plot import plot_res_diagnostics
Define the factors
>>> factors = [
>>> Factor('A'), Factor('B'), Factor('C')
>>> ]
Generate the random simulation data. The true model in our case is
:math:`y = 5 + 2*A + 3*C - 4*A*B + \epsilon`.
>>> # The number of random observations
>>> N = 200
>>>
>>> # Define the data
>>> data = pd.DataFrame(np.random.rand(N, 3) * 2 - 1, columns=[str(f.name) for f in factors])
>>> data['Y'] = 2*data['A'] + 3*data['C'] - 4*data['A']*data['B'] + 5\
>>> + np.random.normal(0, 1, N)
Next, we create the response surface model, which contains all potential terms
we wish to investigate.
>>> model = partial_rsm_names({str(f.name): 'quad' for f in factors})
>>> Y2X = model2Y2X(model, factors)
Then we need to decide on the model constraints. There are three types of models:
* No heredity: This means any term can occur in the model, without any restrictions.
* Weak heredity: This means that if a term such as :math:`A \times B` occurs in the model,
either :math:`A`, :math:`B`, or both must also occur in the model. Similar for
:math:`A^2` to occur, :math:`A` must also be in the model.
* Strong heredity: Extends weak heredity by forcing that when :math:`A \times B` is in the
model, **both** :math:`A` and :math:`B` must occur.
All these dependencies can be represented by a `dependency matrix`. This matrix has
the same number of rows and columns as there are terms (in the encoded model). Term
i depends on term j if dep(i, j) = true.
An easy method to create such a dependency matrix from a generic model is
:py:func:`order_dependencies `.
>>> dependencies = order_dependencies(model, factors)
Finally, we fit the regressor on the data using weak heredity
and a threshold for the p-value of 5%.
>>> regr = PValueDropRegressor(
>>> factors, Y2X,
>>> threshold=0.05, dependencies=dependencies, mode='weak'
>>> )
>>> regr.fit(data.drop(columns='Y'), data['Y'])
Once fitted, we can display the summary of the fit
>>> print(regr.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.848
Model: OLS Adj. R-squared: 0.845
Method: Least Squares F-statistic: 363.6
Date: Tue, 07 Jan 2025 Prob (F-statistic): 8.43e-80
Time: 10:50:08 Log-Likelihood: -95.613
No. Observations: 200 AIC: 199.2
Df Residuals: 196 BIC: 212.4
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0734 0.028 2.619 0.010 0.018 0.129
x1 0.7703 0.048 15.933 0.000 0.675 0.866
x2 1.1347 0.046 24.525 0.000 1.043 1.226
x3 -1.5286 0.082 -18.645 0.000 -1.690 -1.367
==============================================================================
Omnibus: 1.348 Durbin-Watson: 1.859
Prob(Omnibus): 0.510 Jarque-Bera (JB): 1.233
Skew: 0.192 Prob(JB): 0.540
Kurtosis: 2.995 Cond. No. 2.96
==============================================================================
Or the prediction formula. Note that indeed the correct model was
selected in this case.
>>> print(regr.model_formula(model=model))
0.073 * cst + 0.770 * A + 1.135 * C + -1.529 * A * B
Predicting remains the same
>>> data['pred'] = regr.predict(data.drop(columns='Y'))
Just like the residual diagnostics
>>> plot_res_diagnostics(
>>> data, y_true='Y', y_pred='pred',
>>> textcols=[str(f.name) for f in factors],
>>> ).show()
In some cases, the user is interested in strong heredity models.
However, forcing strong heredity during the model selection process
often puts too much pressure on the main effects, meaning the interactions
are often missed. Besides forcing strong heredity, we could force
weak heredity instead and transform the final model to a strong
heredity model.
Instead of simply predicting based on the `regr`, we can transform
the result to a strong model
>>> terms_strong = model2strong(regr.terms_, dependencies)
>>> model = model.iloc[terms_strong]
>>> Y2X = model2Y2X(model, factors)
And fit a simple model
>>> regr_simple = SimpleRegressor(factors, Y2X).fit(data.drop(columns='Y'), data['Y'])
The full Python script can be found at
|link-qc-pre|\ |version|\ |link-qc-mid0|\ drop_pvalue_strong.py\ |link-qc-mid1|\ drop_pvalue_strong.py\ |link-qc-post|.
Mixed linear model
------------------
Mixed models occur very often when having hard-to-change factors. Every result from a
split-plot design, split-split-plot design, strip-plot design, staggered-level design, etc.
should be modelled by a mixed model. The :ref:`Zs ` must be specified in the
dataframe to use random effects. Similar to the quickstart,
we will fit a response surface model with three continuous variables. The full
Python script is found at
|link-qc-pre|\ |version|\ |link-qc-mid0|\ simple_model_mixedlm.py\ |link-qc-mid1|\ simple_model_mixedlm.py\ |link-qc-post|.
Start with the imports
>>> import numpy as np
>>> import pandas as pd
>>>
>>> from pyoptex.utils import Factor
>>> from pyoptex.utils.model import model2Y2X, partial_rsm_names
>>> from pyoptex.analysis import SimpleRegressor
>>> from pyoptex.analysis.utils.plot import plot_res_diagnostics
Create the factors
>>> factors = [
>>> Factor('A'), Factor('B'), Factor('C')
>>> ]
Generate random simulation data. We also generate a random effect
to show the mixed modelling. Five groups will be made, spaced over
200 ovbservations.
>>> # The number of random observations
>>> N = 200
>>> nre = 5
>>>
>>> # Define the data
>>> data = pd.DataFrame(np.random.rand(N, 3) * 2 - 1, columns=[str(f.name) for f in factors])
>>> data['RE'] = np.array([f'L{i}' for i in range(nre)])[np.repeat(np.arange(nre), N//nre)]
>>> data['Y'] = 2*data['A'] + 3*data['C'] - 4*data['A']*data['B'] + 5\
>>> + np.random.normal(0, 1, N)\
>>> + np.repeat(np.random.normal(0, 1, nre), N//nre)
Similar to the quickstart, we create the response surface model
>>> model = partial_rsm_names({str(f.name): 'quad' for f in factors})
>>> Y2X = model2Y2X(model, factors)
Then we fit the mixed model by specifying the column 'RE' in the data
as the random effect. Each random effect column is intepreted as a
categorical column (with strings).
>>> # Define random effects
>>> random_effects = ('RE',)
>>>
>>> # Create the regressor
>>> regr = SimpleRegressor(factors, Y2X, random_effects)
>>> regr.fit(data.drop(columns='Y'), data['Y'])
Once fitted, we can display the summary of the fit
>>> print(regr.summary())
Mixed Linear Model Regression Results
========================================================
Model: MixedLM Dependent Variable: y
No. Observations: 200 Method: REML
No. Groups: 1 Scale: 0.1408
Min. group size: 200 Log-Likelihood: -109.1928
Max. group size: 200 Converged: Yes
Mean group size: 200.0
--------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
--------------------------------------------------------
const 0.084 0.193 0.438 0.662 -0.293 0.462
x1 0.716 0.048 14.992 0.000 0.622 0.809
x2 -0.049 0.045 -1.085 0.278 -0.137 0.039
x3 1.073 0.045 23.981 0.000 0.985 1.161
x4 -1.420 0.080 -17.761 0.000 -1.577 -1.263
x5 -0.081 0.082 -0.986 0.324 -0.242 0.080
x6 0.001 0.074 0.011 0.991 -0.144 0.146
x7 0.014 0.092 0.151 0.880 -0.166 0.194
x8 -0.065 0.089 -0.734 0.463 -0.240 0.109
x9 0.017 0.088 0.197 0.844 -0.155 0.189
g0 Var 0.166 0.324
========================================================
Print the prediction formula on the encoded, normalized model matrix.
See :py:func:`model_formula `
for more information.
>>> print(regr.model_formula(model=model))
0.084 * cst + 0.716 * A + -0.049 * B + 1.073 * C + -1.420 * A * B + -0.081 * A * C + 0.001 * B * C + 0.014 * A^2 + -0.065 * B^2 + 0.017 * C^2
Predicting is still the same.
>>> data['pred'] = regr.predict(data.drop(columns='Y'))
When plotting the residual diagnostics, we can also
indicate the random effect groups with a color.
>>> plot_res_diagnostics(
>>> data, y_true='Y', y_pred='pred',
>>> textcols=[str(f.name) for f in factors],
>>> color='RE'
>>> ).show()
.. figure:: /assets/img/res_diag_mixedlm.png
:width: 100%
:alt: The residual diagnostics
:align: center
Simulated Annealing Model Selection (SAMS)
------------------------------------------
While model selection is often performed based on p-values or metrics such
as the AICc or BIC, SAMS improves on most of them. For more extensive
information on the algorithm, see :ref:`a_cust_sams`.
In this example, we have six factors, A through F, and wish to detect
the weak heredity model :math:`A + C + A \times B`. The full Python
script is at
|link-qc-pre|\ |version|\ |link-qc-mid0-sams|\ sams_generic.py\ |link-qc-mid1|\ sams_generic.py\ |link-qc-post|.
First, the imports
>>> import numpy as np
>>> import pandas as pd
>>>
>>> from pyoptex.utils import Factor
>>> from pyoptex.utils.model import model2Y2X, order_dependencies, partial_rsm_names
>>> from pyoptex.analysis import SamsRegressor
>>> from pyoptex.analysis.utils.plot import plot_res_diagnostics
Next, we define the factors and simulate some data.
>>> # Define the factors
>>> factors = [
>>> Factor('A'), Factor('B'), Factor('C'),
>>> Factor('D'), Factor('E'), Factor('F'),
>>> ]
>>>
>>> # The number of random observations
>>> N = 200
>>>
>>> # Define the data
>>> data = pd.DataFrame(np.random.rand(N, len(factors)) * 2 - 1, columns=[str(f.name) for f in factors])
>>> data['Y'] = 2*data['A'] + 3*data['C'] - 4*data['A']*data['B'] + 5\
>>> + np.random.normal(0, 1, N)
Then, as in any analysis, we define the Y2X function, which is a full
response surface model, and the corresponding heredity dependencies.
>>> # Create the model
>>> model_order = {str(f.name): 'quad' for f in factors}
>>> model = partial_rsm_names(model_order)
>>> Y2X = model2Y2X(model, factors)
>>>
>>> # Define the dependencies
>>> dependencies = order_dependencies(model, factors)
Finally, we fit the SAMS model
>>> regr = SamsRegressor(
>>> factors, Y2X,
>>> dependencies=dependencies, mode='weak',
>>> forced_model=np.array([0], np.int64),
>>> model_size=6, nb_models=5000, skipn=1000,
>>> )
>>> regr.fit(data.drop(columns='Y'), data['Y'])
.. note::
By specifying the `entropy_model_order` parameter in the
:py:class:`SamsRegressor `,
we can use the exact entropy caluclations. For more information,
see :ref:`a_cust_sams_entropy`. The full Python script is at
|link-qc-pre|\ |version|\ |link-qc-mid0-sams|\ sams_partial_rsm.py\ |link-qc-mid1|\ sams_partial_rsm.py\ |link-qc-post|.
Finally, we can analyze the generated models. To manually extract a model, use
the :py:func:`plot_selection `
function.
>>> regr.plot_selection().show()
.. figure:: /assets/img/raster_plot.png
:width: 100%
:alt: The raster plot of the SAMS algorithm.
:align: center
:py:class:`SamsRegressor `
is a :py:class:`MultiRegressionMixin `,
meaning it finds multiple good-fitting models and orders them. By default, the best
can be analyzed as before
>>> # Print the summary
>>> print(regr.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.858
Model: OLS Adj. R-squared: 0.856
Method: Least Squares F-statistic: 394.5
Date: Tue, 07 Jan 2025 Prob (F-statistic): 9.15e-83
Time: 15:23:33 Log-Likelihood: -88.642
No. Observations: 200 AIC: 185.3
Df Residuals: 196 BIC: 198.5
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0043 0.027 -0.159 0.874 -0.058 0.049
x1 0.8045 0.048 16.689 0.000 0.709 0.900
x2 1.1409 0.045 25.356 0.000 1.052 1.230
x3 -1.7373 0.084 -20.769 0.000 -1.902 -1.572
==============================================================================
Omnibus: 1.979 Durbin-Watson: 2.166
Prob(Omnibus): 0.372 Jarque-Bera (JB): 1.934
Skew: -0.238 Prob(JB): 0.380
Kurtosis: 2.932 Cond. No. 3.17
==============================================================================
Or
>>> # Print the formula in encoded form
>>> print(regr.model_formula(model=model))
-0.004 * cst + 0.805 * A + 1.141 * C + -1.737 * A * B
Prediction is still the same.
>>> data['pred'] = regr.predict(data.drop(columns='Y'))
And the residual plot of the highest entropy model can be found using
>>> plot_res_diagnostics(
>>> data, y_true='Y', y_pred='pred',
>>> textcols=[str(f.name) for f in factors],
>>> ).show()
.. note::
If the best model is not the desired model, you can extract any other model
in the list by accessing :py:attr:`models\_ `
after fitting. These are ordered by highest entropy first.
Once you selected a model, you can refit it, similar to the
:ref:`p-value example `. Instead of simply predicting based on the `regr`,
we can select the desired model
>>> terms = regr.models_[i]
>>> model = model.iloc[terms]
>>> Y2X = model2Y2X(model, factors)
And fit a simple model
>>> regr_simple = SimpleRegressor(factors, Y2X).fit(data.drop(columns='Y'), data['Y'])