Quickstart
Installation
If you have not installed Python and setup a project, follow this guide Installing Python.
If you have not already installed PyOptEx, first install it using pip:
(.venv) $ pip install pyoptex
Analyze your first dataset
Create a predetermined linear model is very easy. The full Python script for this example can be found at simple_model.py.
Let us assume that we have 200 random observations from a process with three continuous variables A, B, and C.
Start with the necessary imports
>>> import numpy as np
>>> import pandas as pd
>>>
>>> from pyoptex.utils import Factor
>>> from pyoptex.utils.model import model2Y2X, partial_rsm_names
>>> from pyoptex.analysis import SimpleRegressor
>>> from pyoptex.analysis.utils.plot import plot_res_diagnostics
Next, we define the factors in our simulation
>>> # Define the factors
>>> factors = [
>>> Factor('A'), Factor('B'), Factor('C')
>>> ]
Then, we define the data for our simulation
>>> # The number of random observations
>>> N = 200
>>>
>>> # Define the data
>>> data = pd.DataFrame(np.random.rand(N, 3) * 2 - 1, columns=[str(f.name) for f in factors])
>>> data['Y'] = 2*data['A'] + 3*data['C'] - 4*data['A']*data['B'] + 5\
>>> + np.random.normal(0, 1, N)
Just like in design of experiments, we define the model we want to fit. In this case, it is a response surface model (or a full quadratic model containing the intercept and all main effects, interactions, and quadratic effects).
First, we create a matrix representation of our model using the
partial_rsm_names function.
Then, we convert this model to a function which transforms Y to X using the
model2Y2X function.
>>> model = partial_rsm_names({str(f.name): 'quad' for f in factors})
>>> Y2X = model2Y2X(model, factors)
Finally, we create the
SimpleRegressor
and fit it to the data
>>> regr = SimpleRegressor(factors, Y2X)
>>> regr.fit(data.drop(columns='Y'), data['Y'])
To analyze the results, we can do a few things. First, we can print the summary of the fit which includes the coefficients of the normalized data, the scale of the data, etc.
>>> print(regr.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.850
Model: OLS Adj. R-squared: 0.843
Method: Least Squares F-statistic: 119.5
Date: Tue, 07 Jan 2025 Prob (F-statistic): 2.59e-73
Time: 09:57:03 Log-Likelihood: -94.165
No. Observations: 200 AIC: 208.3
Df Residuals: 190 BIC: 241.3
Df Model: 9
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.0791 0.065 1.213 0.226 -0.049 0.208
x1 0.7721 0.049 15.677 0.000 0.675 0.869
x2 -0.0519 0.048 -1.090 0.277 -0.146 0.042
x3 1.1409 0.047 24.225 0.000 1.048 1.234
x4 -1.5120 0.084 -17.989 0.000 -1.678 -1.346
x5 -0.0666 0.086 -0.774 0.440 -0.236 0.103
x6 0.0065 0.076 0.085 0.933 -0.144 0.157
x7 0.0230 0.096 0.238 0.812 -0.167 0.213
x8 -0.0686 0.092 -0.743 0.458 -0.251 0.114
x9 0.0377 0.092 0.410 0.683 -0.144 0.219
==============================================================================
Omnibus: 1.938 Durbin-Watson: 1.839
Prob(Omnibus): 0.379 Jarque-Bera (JB): 1.933
Skew: 0.236 Prob(JB): 0.380
Kurtosis: 2.900 Cond. No. 4.93
==============================================================================
We can also print the prediction formula using
model_formula.
Note
model_formula
is only possible if we have created Y2X using
model2Y2X
as done in this example. Otherwise, use
formula
and specify your own labels.
Warning
The prediction formula is based on the encoded model. Make sure
to first normalize the data between -1 and 1, and encode the
categorical variables. See
formula
for the complete warning.
>>> print(regr.model_formula(model=model))
0.079 * cst + 0.772 * A + -0.052 * B + 1.141 * C + -1.512 * A * B + -0.067 * A * C + 0.006 * B * C + 0.023 * A^2 + -0.069 * B^2 + 0.038 * C^2
Prediction is as easy as calling .predict()
>>> data['pred'] = regr.predict(data.drop(columns='Y'))
Finally, to investigate how good it fits, we introduced
plot_res_diagnostics
>>> plot_res_diagnostics(
>>> data, y_true='Y', y_pred='pred',
>>> textcols=[str(f.name) for f in factors],
>>> ).show()
The upper left plot indicates the predicted vs. fit plot. Ideally, all elements are on the black diagonal. The upper right plot provides the error vs. the prediction. A positive errors means an overprediction. In case a trend or divergence is observed, it could indicate a lack of fit. Ideally, the data is normally distributed around the x-axis, in a rectangular block. The lower left plot is the quantile-quantile distribution of the errors against a normal distribution. If all points are on the black diagonal, the errors are normally distributed. Significant deviations may indicate outliers. The lower right plot is the error vs. the run. Again, a positive error means an overprediction. This plot is useful to analyze potential time trends in the data if your data is sorted in time.