Linear regression is a basic tool. It works on the assumption that there exists a linear relationship between the dependent and independent variable, also known as the explanatory variables and output. However, not all problems have such a linear relationship. In fact, many of the problems we see today are nonlinear in nature. A very basic example is our own decision making process which involves deciding an outcome based on various questions. For example, when we decide to have dinner, our thought process is not linear. It is based a combination of our tastes, our budget, our past experiences with a restaurant, alternatives available, weather conditions etc. There can be other simple nonlinear cases such as quadratic or exponential dependencies which are not too difficult to imagine. This is how non-linear regression came into practice – a powerful alternative to linear regression for nonlinear situations. Similar to linear regression, nonlinear regression draws a line through the set of available data points in such a way that the line fits to the data with the only difference that the line is not a straight line or in other words, not linear.
Non-linear Regression – An Illustration
In R, we have lm() function for linear regression while nonlinear regression is supported by nls() function which is an abbreviation for nonlinear least squares function. To apply nonlinear regression, it is very important to know the relationship between the variables. Looking at the data, one should be able to determine the generalized equation of the model which will fit the data. This model is then specified as the ‘formula’ parameter in nls() function. The function then determines the coefficients of the parameters in the model. Let’s try linear and nonlinear regression models on an exponential data. I will use the runif() function to generate an exponential set of values for y. Here I will use x as a sequence from 0 to 100.
I will also use a set.seed() function so that the values are regenerated for you.
#set a seed value
set.seed(23)
#Generate x as 100 integers using seq function
x<-seq(0,100,1)
#Generate y as a*e^(bx)+c
y<-runif(1,0,20)*exp(runif(1,0.005,0.075)*x)+runif(101,0,5)
#How does our data look like? Lets plot it
plot(x,y)
This seems a fairly smooth non-linear plot. To illustrate the difference between linear and nonlinear models, let’s fit them both:
#Linear model
lin_mod=lm(y~x)
#Plotting the model
plot(x,y)
abline(lin_mod)
There is little overlap between the actual values and the fitted plot. Now let’s try the nonlinear model and specify the formula
nonlin_mod=nls(y~a*exp(b*x),start=list(a=13,b=0.1)) #a is the starting value and b is the exponential start
#This new plot can be made by using the lines() function
plot(x,y)
lines(x,predict(nonlin_mod),col=”red”)
This is a much better fit and clearly passes through most of the data. For more clarity, we will now calculate the errors for both the models
#Error calculation
error <- lin_mod$residuals
lm_error <- sqrt(mean(error^2)) #5.960544
error2=y-predict(nonlin_mod)
nlm_error <- sqrt(mean(error2^2)) #1.527064
The linear model has more than twice the error than that of nonlinear one. This shows that the nonlinear model fits better for nonlinear data.
Understanding the nls() function
There are a few parameters that the nls() function requires. I used two parameters to define the model in the above illustration – the formula and the start parameters. Nonlinear function requires us to look at the data first and estimate the model to fit in. This estimated model is specified as the formula parameter. We can also specify the coefficients as variables to be estimated. The next step involves specifying the start parameter. This parameter specifies the starting values of the coefficients we used in the formula. Here we have ‘a’ and ‘b’ as the coefficients. I took ‘a’ as the nearest integer to minimum value of y (which is approximately 13.19) and ‘b’ as the increment for the exponent. Using these values, the nls() function determines the optimal values of ‘a’ and ‘b’. It is very important to set the right starting parameter values otherwise the model may give us absurd results or even fail. Let’s see what are the estimated values of ‘a’ and ‘b’ for this dataset:
nonlin_mod
Nonlinear regression model
model: y ~ a * exp(b * x)
data: parent.frame()
a b
13.60391 0.01911
residual sum-of-squares: 235.5
Number of iterations to convergence: 15
Achieved convergence tolerance: 4.975e-07
The values for ‘a’ and ‘b’ estimated for this model are 13.60391 and 0.01911 respectively which are very close to those we provided as starting values. This shows that that the model estimated by the nls() function is y=13.60391*e^(0.01911*x). Further, the estimated values of ‘a’ and ‘b’ are very close to the starting values we provided. These results will remain the same if we keep ‘b’ as 0.01 or even 0.001 or keep ‘a’ as 10 or 100 or 1000. As long as the model is able to converge at the optimal estimation, some approximation is admissible. However, if the values of ‘a’ and ‘b’ are completely out of range, say 1 and 1, we get an error as the model fails. The right set of starting values need to be estimated by looking at the data before implementing the model.
Self-Starting Functions
The problem arises when one is beginning with nonlinear functions and does not know what value should be estimated for the parameters. To illustrate this problem, I will now use a non-linear dataset available in R. The Puromycin data shows the concentration and reaction rate for enzymatic reaction of Puromycin antibiotic. I will plot the data to understand the data and estimate the formula equation
attach(Puromycin)
plot(Puromycin$conc,Puromycin$rate)
This data is specific to biological reactions and can be estimated using the famous enzyme kinetics equation known as the Michaelis-Menten equation. For this, we will separate the dataset based on whether the state is “treated” or “untreated” and define a function for the equation
#Define a function to apply Michaelis-Menten equation
mm=function(conc,vmax,k) vmax*conc/(k+conc)
#Use the nls data over the first subset of treated data. I will set the starting values as 50 and 0.05
mm1=nls(rate~mm(conc,vmax,k),data=Puromycin,start=c(vmax=50,k=0.05),subset=state==”treated”)
#Use a similar function for the second subset of untreated data
mm2=nls(rate~mm(conc,vmax,k),data=Puromycin,start=c(vmax=50,k=0.05),subset=state==”untreated”)
Both the models, mm1 and mm2 make good estimations of the data and fit the model. However, it is hard to estimate the starting values looking at the plot of Puromycin conc. vs rate. The Puromycin concentration vs rate plot suggested that the minimum conc. on the x- axis is around 0.01 and the maximum rate (vmax) on the y-axis is around 200 yet I purposely used values which are very different from these estimations so that the model will fit while converging slowly. In this case, I am taking a risk on the estimation ability of the model. This is where “self-starting” functions come into the picture. As the name suggests, a self-starting function does not need a starting value for the parameters and do estimate themselves. We can rewrite the above two functions using the SSmicmen function which is a self starting function for Michaelis-Menten equation. The new models are:
mm3=nls(rate~SSmicmen(conc,vmax,k),data=Puromycin,subset=state==”treated”)
mm4=nls(rate~SSmicmen(conc,vmax,k),data=Puromycin,subset=state==”untreated”)
Let us compare the corresponding models by calling the model variables in R. We will first look at the models with state=”treated” which are mm1 and mm3 and compare the vmax and k values. We will then compare the models with state=”untreated” which are mm2 and mm4.:
#Print the model summary and estimated parameters for mm1
mm1
Nonlinear regression model
model: rate ~ mm(conc, vmax, k)
data: Puromycin
vmax k
212.68369 0.06412
residual sum-of-squares: 1195
Number of iterations to convergence: 7
Achieved convergence tolerance: 2.703e-06
#Print the model summary and estimated parameters for mm3
mm3
Nonlinear regression model
model: rate ~ SSmicmen(conc, vmax, k)
data: Puromycin
vmax k
212.68371 0.06412
residual sum-of-squares: 1195
Number of iterations to convergence: 0
Achieved convergence tolerance: 1.937e-06
#Print the model summary and estimated parameters for mm2
mm2
Nonlinear regression model
model: rate ~ mm(conc, vmax, k)
data: Puromycin
vmax k
160.28001 0.04771
residual sum-of-squares: 859.6
Number of iterations to convergence: 7
Achieved convergence tolerance: 2.039e-06
#Print the model summary and estimated parameters for mm4
mm4
Nonlinear regression model
model: rate ~ SSmicmen(conc, vmax, k)
data: Puromycin
vmax k
160.28012 0.04771
residual sum-of-squares: 859.6
Number of iterations to convergence: 5
Achieved convergence tolerance: 3.942e-06
The corresponding models have estimated the same coefficients up to the third decimal. This shows that self-starting functions fairly well in place of functions where I need to define the start parameters. The big limitation of estimating the starting parameters can be avoided using the self-starting functions. R has many self-starting functions available. A list of the same can be obtained by using the apropos function:
apropos(“^SS”)
[1] “SSasymp” “SSasympOff” “SSasympOrig” “SSbiexp” “SSD”
[6] “SSfol” “SSfpl” “SSgompertz” “SSlogis” “SSmicmen”
[11] “SSweibull”
With the exception of the SSD function, there are 10 self-starting functions here in R. The final step in the model. I have given a brief description of what all these functions are defined for (in alphabetical order)
- SSasymp asymptotic regression models
- SSasympOff asymptotic regression models with an offset
- SSasympOrig asymptotic regression models through the origin
- SSbiexp biexponential models
- SSfol first-order compartment models
- SSfpl four-parameter logistic models
- SSgompertz Gompertz growth models
- SSlogis logistic models
- SSmicmen Michaelis–Menten models
- SSweibull Weibull growth curve models
Goodness of Fit
As an additional verification step, I will also check the goodness of fit of the model. This can be done by looking that the correlation between the values predicted by the model and the actual y values.
#Goodness of fit for first nonlinear function
cor(y,predict(nonlin_mod)) #0.9976462
#Goodness of fit for treated values of Puromycin function
cor(subset(Puromycin$rate,state==”treated”),predict(mm3)) #0.9817072
cor(subset(Puromycin$rate,state==”treated”),predict(mm1)) #0.9817072
#Goodness of fit for untreated values of Puromycin function
cor(subset(Puromycin$rate,state==”untreated”),predict(mm2)) #0.9699776
cor(subset(Puromycin$rate,state==”untreated”),predict(mm4)) #0.9699777
All our models have a high correlation value which indicates that the values are very close to each other and accurate. The corresponding model summary and estimation parameters also show the same observation.
Summary
Regression is a fundamental technique to estimate the relationships among variables and nonlinear regression is a handy technique if that relationship is nonlinear. It is similar to linear regression and provides a powerful method to fit a nonlinear curve based on the estimated formula while minimizing the error using nonlinear least squares method. There are a variety of other nonlinear models available such as SVM and Decision trees. Nonlinear regression is a robust technique over such models because it provides a parametric equation to explain the data. As the models becomes complex, nonlinear regression becomes less accurate over the data. This article gives an overview of the basics of nonlinear regression and understand the concepts by application of the concepts in R. Here is the complete R code used in the article.
#set a seed value
set.seed(23)
#Generate x as 100 integers using seq function
x<-seq(0,100,1)
#Generate y as a*e^(bx)+c
y<-runif(1,0,20)*exp(runif(1,0.005,0.075)*x)+runif(101,0,5)
#How does our data look like? Lets plot it
plot(x,y)
#Linear model
lin_mod=lm(y~x)
#Plotting the model
plot(x,y)
abline(lin_mod)
nonlin_mod=nls(y~a*exp(b*x),start=list(a=13,b=0.1)) #a is the starting value and b is the exponential start
#This new plot can be made by using the lines() function
plot(x,y)
lines(x,predict(nonlin_mod),col=”red”)
#Error calculation
error <- lin_mod$residuals
lm_error <- sqrt(mean(error^2)) #5.960544
error2=y-predict(nonlin_mod)
nlm_error <- sqrt(mean(error2^2)) #1.527064
nonlin_mod
attach(Puromycin)
plot(Puromycin$conc,Puromycin$rate)
#Define a function to apply Michaelis-Menten equation
mm=function(conc,vmax,k) vmax*conc/(k+conc)
#Use the nls data over the first subset of treated data. I will set the starting values as 50 and 0.05
mm1=nls(rate~mm(conc,vmax,k),data=Puromycin,start=c(vmax=50,k=0.05),subset=state==”treated”)
#Use a similar function for the second subset of untreated data
mm2=nls(rate~mm(conc,vmax,k),data=Puromycin,start=c(vmax=50,k=0.05),subset=state==”untreated”)
mm3=nls(rate~SSmicmen(conc,vmax,k),data=Puromycin,subset=state==”treated”)
mm4=nls(rate~SSmicmen(conc,vmax,k),data=Puromycin,subset=state==”untreated”)
#Print the model summary and estimated parameters for mm1
mm1
#Print the model summary and estimated parameters for mm3
mm3
#Print the model summary and estimated parameters for mm2
mm2
#Print the model summary and estimated parameters for mm4
mm4
#Print the names of all functions in R which start with SS
apropos(“^SS”)
#Goodness of fit for first nonlinear function
cor(y,predict(nonlin_mod)) #0.9976462
#Goodness of fit for treated values of Puromycin function
cor(subset(Puromycin$rate,state==”treated”),predict(mm3)) #0.9817072
cor(subset(Puromycin$rate,state==”treated”),predict(mm1)) #0.9817072
#Goodness of fit for untreated values of Puromycin function
cor(subset(Puromycin$rate,state==”untreated”),predict(mm2)) #0.9699776
cor(subset(Puromycin$rate,state==”untreated”),predict(mm4)) #0.9699777