Single-equation regression models
Single-equation regression models¶
Introduction¶
Economic theory makes statements or hypotheses that are primarily qualitative in nature. The main concern of mathematical economics, on the other hand, is to express economic theory in mathematical form (equations) without considering whether the theory can be measured or empirically verified. Econometrics, in turn, is mainly concerned with the verification of economic theory. Among the related fields, we may also mention economic statistics, which is primarily concerned with the collection, processing, and presentation of economic data. The information collected constitutes the raw data for econometric work.
We may also distinguish between theoretical econometrics and applied econometrics. The former deals with the development of appropriate methods for measuring the economic relationships specified in econometric models, whereas the latter uses these tools to study one or more fields of economics.
In general terms, the traditional econometric methodology follows these steps:
Statement of the theory or hypothesis.
Specification of the mathematical model of the theory.
Specification of the statistical or econometric model.
Collection of data.
Estimation of the parameters of the econometric model.
Hypothesis testing.
Projection or forecasting.
Use of the model for control or policy purposes.
To illustrate these steps, let us consider the well-known Keynesian theory of consumption.
1. Statement of the theory or hypothesis
The fundamental psychological law [...] is that men [women] are disposed, as a rule and on average, to increase their consumption as their income increases, but not by as much as the increase in income. (Keynes)
Keynes postulated that the marginal propensity to consume (MPC), the rate of change in consumption resulting from a one-unit change in income, is greater than zero but less than 1.
2. Specification of the mathematical model of the theory
Consumption function (relates income and consumption):
and are parameters (intercept and slope coefficient)
() is the dependent (independent or explanatory) variable
3. Specification of the statistical or econometric model
The deterministic model must be modified:
: called the disturbance (or error term), is a random (or stochastic) variable
4. Collection of data
To estimate the model, data are required.
5. Estimation of the parameters of the econometric model
It is necessary to estimate the values of the parameters from the data. Regression analysis is the main tool for this purpose.
6. Hypothesis testing
It is necessary to perform a hypothesis test (or statistical inference) to verify whether the result is statistically significant in supporting or rejecting the theory or hypothesis proposed in step 1.
7. Projection or forecasting
If the model does not refute the hypothesis under consideration, it may be used to predict future values.
8. Use of the model for control or policy purposes
An estimated model may be used for control purposes or for policy formulation.
Some important issues:
The main idea behind regression analysis is the statistical dependence of one variable (the dependent variable) on one or more other variables (the explanatory variables).
The objective of this analysis is to estimate and/or predict the mean value of the dependent variable based on the known or fixed values of the explanatory variables.
In practice, the success of regression analysis depends on the availability of adequate data.
In addition, regarding notation, we denote as the expected value of given the value of , that is, a conditional mean, in contrast to , which is an unconditional mean. Furthermore, we may write:
That is, represents a function of the explanatory variable . This function may be called: the conditional expectation function (CEF), the population regression function (PRF), or the population regression (PR).
Evidently, if we have pairs of observations , we may write:
Where the stochastic disturbance gives the difference between the particular value and the expected value for the corresponding . By definition, we also have that the expected value, that is, the mean of , is 0.
It is important to note that up to this point the discussion has focused on data from a population. Frequently, however, we only have a sample from this population. In this case, we have the sample regression function (SRF), and we place a “hat” on the dependent variable and the parameters:
By this we mean that is an estimator of . We may say that the observed can be expressed as:
We say that this equation is the stochastic version of the SRF, and now takes the name of residual term (in the sample).
Some additional relevant issues:
The key concept underlying regression analysis is that of the conditional expectation function (CEF) or population regression function (PRF). Our objective in regression analysis is to examine how the mean value of the dependent variable (or regressand) varies with the value of the explanatory variable (regressor).
This book deals mainly with linear PRFs, that is, regressions that are linear in the parameters. They may or may not be linear in the regressand or in the regressors.
For empirical purposes, what matters is the stochastic PRF. The stochastic error term, , plays a fundamental role in the estimation of the PRF.
The PRF is an idealized concept, since in practice we very rarely have access to the entire population of interest. In general, we only have a sample of observations from the population. Therefore, we use stochastic sample regression functions (SRFs) to estimate the PRF.
Ordinary Least Squares (OLS) Method¶
We have the two-variable PRF:
But we must estimate it through the SRF:
We may then write the residuals as:
We therefore want to minimize the residuals:
Squaring allows us to ignore the signs, that is, if , we obtain the same result as when . It also increases the importance of larger residuals in absolute value. Since the data are obtained from the sample, we want to estimate the values . Therefore, the residual sum of squares function is a function of the estimators:
Solving by the process of differentiation, we obtain:
Calculus¶
fWe can apply the method of linear regression to a large dataset in order to obtain the line that best describes that dataset. Evidently, for this method to be useful, we need there to exist a line that satisfactorily describes the data. If we have a single independent variable that we expect to describe a single dependent variable linearly (a straight line), then, for a set of input points , we want to find a set of output points that can be written as follows:
where:
is the regression constant;
is the regression coefficient;
is the error obtained when we try to predict from .
When we apply the linear regression model, we want to minimize the error. To do so, we need a measure of the error. We will use the sum of squared errors:
Considering that we have points, we want to find and that minimize this quantity. We may rewrite this as:
We may use the gradient to find the minimum of a function. The minimum must satisfy . Since we want to find the minimum with respect to both terms , we have:
Calculating the derivative with respect to :
Proceeding similarly for :
To minimize, we require the following condition to be satisfied:
Then:
To ensure that this is a minimum rather than a maximum, we must verify that the second derivative is positive:
Since and , we can be certain that this is a minimum. Manipulating the first equation in the system, we obtain:
And for the second equation in the system, we have:
Combining the results, we obtain:
Therefore, is:
That is:
From these constants, we construct the following line equation:
pip install -q linearmodels# @title
import numpy as np
import statsmodels.api as sm
#Manual
I = np.random.randint(0,100)
x = np.random.randint(0, 100, I)
y = np.random.randint(0, 100, I)
#x = np.array([1,2,3,4,5,6,7,8,9,10])
#y = np.array([5,8,9,13,15,18,20,25,27,30])
N=len(x)
xm=np.mean(x)
ym=np.mean(y)
s1=0;s2=0
for i in (range(N)):
s1+=x[i]*y[i]
s2+=x[i]*x[i]
b1=(s1-N*ym*xm)/(s2-N*xm*xm)
b0=ym-b1*xm
print("Manual")
print("const: "+str(b0)+"\nx1: "+str(b1))
### Auto
X = sm.add_constant(x)
# regressão OLS
modelo = sm.OLS(y, X).fit()
# resultados
# print(modelo.summary())
print("\nAuto")
print("const: "+str(modelo.params[0])+"\nx1: "+str(modelo.params[1]))
print("\nOs coeficientes angulares são {:.2f}% iguais.".format(100*b1/modelo.params[1]))
Manual
const: 48.71590353827139
x1: -0.06073409789888189
Auto
const: 48.7159035382714
x1: -0.06073409789888215
Os coeficientes angulares são 100.00% iguais.
OLS with Mean Transformation (FE)¶
If we have time-series data for more than one country, we may be interested only in the variation of each series over time, disregarding average level differences across countries. A method typically employed in this case is to subtract from each observation the time mean of its respective series, so that all series become centered around zero. OLS is then applied to the transformed data. In this case, the estimated coefficient depends only on the internal variations of each series over time. Thus, we are interested only in the slope coefficient and not in the intercept obtained through OLS.
# @title
from linearmodels.panel import PanelOLS
import pandas as pd
#Manual
x1 = np.random.randint(0, 100, I)
y1 = np.random.randint(0, 100, I)
nx=x-np.mean(x)
nx1=x1-np.mean(x1)
ny=y-np.mean(y)
ny1=y1-np.mean(y1)
x2=np.concatenate((nx,nx1))
y2=np.concatenate((ny,ny1))
N=len(x2)
xm=np.mean(x2)
ym=np.mean(y2)
s1=0;s2=0
for i in (range(N)):
s1+=x2[i]*y2[i]
s2+=x2[i]*x2[i]
b1=(s1-N*ym*xm)/(s2-N*xm*xm)
b0=ym-b1*xm
print("Manual")
print("x1: "+str(b1))
#Auto
# dataframe painel
df = pd.DataFrame({
'id': [1]*I + [2]*I,
'tempo': list(range(I)) + list(range(I)),
'x': np.concatenate([x, x1]),
'y': np.concatenate([y, y1])
})
#print(df)
# índice painel
df = df.set_index(['id', 'tempo'])
# modelo FE
modelo = PanelOLS.from_formula(
'y ~ x + EntityEffects',
data=df
)
resultado = modelo.fit()
print("\nAuto")
print("x1: "+str(resultado.params['x']))
print("\nOs coeficientes angulares são {:.2f}% iguais.".format(100*b1/resultado.params['x']))
Manual
x1: 0.1577693520786498
Auto
x1: 0.15776935207864973
Os coeficientes angulares são 100.00% iguais.
OLS with Mean Transformation and Weighting (FE + Weighted)¶
Now that we are dealing with different countries, it may also be the case that, when applying the final linear estimation, we wish to assign different weights to different observations. To do so, we may return to the error function minimized by OLS, and assign different weights to each error term We recover the usual OLS case when for all observations. The remainder of the derivation of is analogous, now considering the term .
The derivatives with respect to and become, respectively:
Summing the weights over all data points, is equivalent to summing the weights over all observations of each one of the series, . We are considering that all series have the same length, such that . Since we may interchange the order of summation, then:
This is because our weights are distributed such that, if we sum all the weights of all series corresponding to any given time , we must necessarily obtain 1, that is . Making the additional assumption that the weights of each series remain constant over time, we may factor out of the summation over time. Recalling that, in order to minimize the error, we must satisfy , we may arrive at the first equation:
That is, we obtain something quite similar to the usual OLS case. The difference is that, instead of having only a single mean over all observations, we now have a weighted sum of the mean of each series, where the weights correspond to those assigned to each series. Essentially, we replace .Now, regarding the second equation, we have:
Combining the results, we obtain:
Therefore, is:
Or, if we wish to return to the notation over the entire dataset, associating a weight with each observation , in a more general form:
However, we should remember that we obtain this result only under the assumption that all series have the same length and that the sum of the weights across all series for any given must be equal to 1. Comparing this expression with the previous , we replaced the total population size with the size of each series and inserted a weight into each summation.
# @title
M=len(nx) #Tamanho original de cada série
N=len(y2)
s1=0;s2=0;s3=0;s4=0;s5=0
for i in (range(N)):
w = 0.25 if (i<M) else 0.75
s1+=w*x2[i]*y2[i]
s2+=w*y2[i]
s3+=w*x2[i]
s4+=w*x2[i]*x2[i]
s5+=w*x2[i]
b1=(s1-s2*s3/M)/(s4-s5*s5/M)
print("Manual")
print("x1: "+str(b1))
#Auto
df = pd.DataFrame({
'id': [1]*I + [2]*I,
'tempo': list(range(I)) + list(range(I)),
'x': np.concatenate([x, x1]),
'y': np.concatenate([y, y1])
})
df['peso'] = np.where(df['id'] == 1, 0.25, 0.75)
df = df.set_index(['id', 'tempo'])
modelo = PanelOLS.from_formula(
'y ~ x + EntityEffects',
data=df,
weights=df['peso']
)
resultado = modelo.fit()
print("\nAuto")
print("x1: "+str(resultado.params['x']))
print("\nOs coeficientes angulares são {:.2f}% iguais.".format(100*b1/resultado.params['x']))Manual
x1: 0.2969533853308433
Auto
x1: 0.29695338533084326
Os coeficientes angulares são 100.00% iguais.
Correlation¶
Pearson¶
We should begin with Pearson correlation, and I believe the best way is through a geometric interpretation. If we have a set of points and another set then, if we subtract the mean values, we obtain:
We may think of this as shifting the point distributions so that they are centered around the origin. We can then analyze how each distribution departs from the origin. If we have only three points, then and denote vectors in three-dimensional space. The dot product of both is given by:
Then:
Therefore, Pearson correlation gives us the cosine of the angle between the vectors formed by the variables after removing their mean values. In other words, it measures the alignment between the variations of and around their respective means, interpreted as vectors in an -dimensional space.
# @title
from scipy.stats import pearsonr
I = np.random.randint(0,100)
x = np.random.randint(0, 100, I)
y = np.random.randint(0, 100, I)
xn=x-np.mean(x)
yn=y-np.mean(y)
p1 = np.dot(xn, yn)
p2 = np.dot(xn, xn)
p3 = np.dot(yn, yn)
c=p1/(np.sqrt(p2)*np.sqrt(p3))
print("Manual")
print("Correlaçao: "+str(c))
corr, _= pearsonr(x, y)
print("\nAuto")
print("Correlação: "+str(corr))
print("\nOs coeficientes de correlação são {:.2f}% iguais.".format(100*c/corr))Manual
Correlaçao: 0.012785408476462182
Auto
Correlação: 0.012785408476462127
Os coeficientes de correlação são 100.00% iguais.
Spearmn¶
It does not differ much from the previous case, except that instead of working with the original data, we transform them into ranks and only then apply the Pearson correlation.
In both this section and the previous one, we should keep in mind that we do not analyze standard errors or p-values. In general, this additional analysis is important for evaluating the statistical significance of the results, especially in correlation tests. However, in many cases linear regression may be used merely as a descriptive tool, that is, with the objective of finding the line that best represents the distribution of the points, independently of statistical inference regarding the coefficients.
# @title
from scipy.stats import rankdata
rx = rankdata(x)
ry = rankdata(y)
xn=rx-np.mean(rx)
yn=ry-np.mean(ry)
p1 = np.dot(xn, yn)
p2 = np.dot(xn, xn)
p3 = np.dot(yn, yn)
c=p1/(np.sqrt(p2)*np.sqrt(p3))
print("Manual")
print("Correlaçao: "+str(c))
corr, _= spearmanr(x, y)
print("\nAuto")
print("Correlação: "+str(corr))
print("\nOs coeficientes de correlação são {:.2f}% iguais.".format(100*c/corr))
Manual
Correlaçao: 0.07954084943381937
Auto
Correlação: 0.07954084943381935
Os coeficientes de correlação são 100.00% iguais.