How to do regression analysis in excel. Konrad Carlberg. Regression analysis in Microsoft Excel

Regression and correlation analysis are statistical research methods. These are the most common ways to show the dependence of a parameter on one or more independent variables.

Below, using specific practical examples, we will consider these two very popular analyzes among economists. We will also give an example of obtaining results when combining them.

Regression Analysis in Excel

Shows the influence of some values ​​(independent, independent) on the dependent variable. For example, how does the number of economically active population depend on the number of enterprises, wages and other parameters. Or: how do foreign investments, energy prices, etc. affect the level of GDP.

The result of the analysis allows you to highlight priorities. And based on the main factors, predict, plan the development of priority areas, and make management decisions.

Regression happens:

  • linear (y = a + bx);
  • parabolic (y = a + bx + cx 2);
  • exponential (y = a * exp(bx));
  • power (y = a*x^b);
  • hyperbolic (y = b/x + a);
  • logarithmic (y = b * 1n(x) + a);
  • exponential (y = a * b^x).

Let's look at an example of building a regression model in Excel and interpreting the results. Let's take the linear type of regression.

Task. At 6 enterprises, the average monthly salary and the number of quitting employees were analyzed. It is necessary to determine the dependence of the number of quitting employees on the average salary.

The linear regression model has the following form:

Y = a 0 + a 1 x 1 +…+a k x k.

Where a are regression coefficients, x are influencing variables, k is the number of factors.

In our example, Y is the indicator of quitting employees. The influencing factor is wages (x).

Excel has built-in functions that can help you calculate the parameters of a linear regression model. But the “Analysis Package” add-on will do this faster.

We activate a powerful analytical tool:

Once activated, the add-on will be available in the Data tab.

Now let's do the regression analysis itself.



First of all, we pay attention to R-squared and coefficients.

R-squared is the coefficient of determination. In our example – 0.755, or 75.5%. This means that the calculated parameters of the model explain 75.5% of the relationship between the studied parameters. The higher the coefficient of determination, the better the model. Good - above 0.8. Bad – less than 0.5 (such an analysis can hardly be considered reasonable). In our example – “not bad”.

The coefficient 64.1428 shows what Y will be if all variables in the model under consideration are equal to 0. That is, the value of the analyzed parameter is also influenced by other factors not described in the model.

The coefficient -0.16285 shows the weight of variable X on Y. That is, the average monthly salary within this model affects the number of quitters with a weight of -0.16285 (this is a small degree of influence). The “-” sign indicates a negative impact: the higher the salary, the fewer people quit. Which is fair.



Correlation Analysis in Excel

Correlation analysis helps determine whether there is a relationship between indicators in one or two samples. For example, between the operating time of a machine and the cost of repairs, the price of equipment and the duration of operation, the height and weight of children, etc.

If there is a connection, then does an increase in one parameter lead to an increase (positive correlation) or a decrease (negative) of the other. Correlation analysis helps the analyst determine whether the value of one indicator can be used to predict the possible value of another.

The correlation coefficient is denoted by r. Varies from +1 to -1. The classification of correlations for different areas will be different. When the coefficient is 0, there is no linear relationship between samples.

Let's look at how to find the correlation coefficient using Excel.

To find paired coefficients, the CORREL function is used.

Objective: Determine whether there is a relationship between the operating time of a lathe and the cost of its maintenance.

Place the cursor in any cell and press the fx button.

  1. In the “Statistical” category, select the CORREL function.
  2. Argument “Array 1” - the first range of values ​​– machine operating time: A2:A14.
  3. Argument “Array 2” - second range of values ​​– repair cost: B2:B14. Click OK.

To determine the type of connection, you need to look at the absolute number of the coefficient (each field of activity has its own scale).

For correlation analysis of several parameters (more than 2), it is more convenient to use “Data Analysis” (the “Analysis Package” add-on). You need to select correlation from the list and designate the array. All.

The resulting coefficients will be displayed in the correlation matrix. Like this:

Correlation and regression analysis

In practice, these two techniques are often used together.

Example:


Now the regression analysis data has become visible.

The MS Excel package allows you to do most of the work very quickly when constructing a linear regression equation. It is important to understand how to interpret the results obtained.

Requires an add-on to work Analysis package, which must be enabled in the menu item Service\Add-ons

In Excel 2007, to enable the analysis package, you need to click go to block Excel Options by clicking the button in the upper left corner and then the " Excel Options"at the bottom of the window:



To build a regression model, you must select the item Service\Data Analysis\Regression. (In Excel 2007, this mode is in the block Data/Data Analysis/Regression). A dialog box will appear that you need to fill out:

1) Input interval Y¾ contains a link to cells that contain the values ​​of the resulting characteristic y. The values ​​must be arranged in a column;

2) Input interval X¾ contains a link to cells that contain factor values. The values ​​must be arranged in columns;

3) Sign Tags set if the first cells contain explanatory text (data labels);

4) Reliability level¾ is the confidence level, which is considered to be 95% by default. If you are not satisfied with this value, then you need to enable this flag and enter the required value;

5) Sign Constant-zero is included if it is necessary to construct an equation in which the free variable is ;

6) Output Options determine where the results should be placed. By default builds mode New worksheet;

7) Block Leftovers allows you to include the output of residuals and the construction of their graphs.

As a result, information is displayed containing all the necessary information and grouped into three blocks: Regression statistics, Analysis of variance, Withdrawal of balance. Let's take a closer look at them.

1. Regression statistics:

multiple R is determined by the formula ( Pearson correlation coefficient);

R (coefficient of determination);

Normalized R-square is calculated using the formula (used for multiple regression);

Standard error S calculated by the formula ;

Observations ¾ is the amount of data n.

2. Analysis of variance, line Regression:

Parameter df equals m(number of factor sets x);

Parameter SS is determined by the formula ;

Parameter MS is determined by the formula ;

Statistics F is determined by the formula ;

Significance F. If the resulting number exceeds , then the hypothesis is accepted (there is no linear relationship), otherwise the hypothesis is accepted (there is a linear relationship).


3. Analysis of variance, line Remainder:

Parameter df equal to ;

Parameter SS is determined by the formula ;

Parameter MS is determined by the formula.

4. Analysis of variance, line Total contains the sum of the first two columns.

5. Analysis of variance, line Y-intersection contains the coefficient, standard error and t-statistics.

P-value ¾ is the value of the significance levels corresponding to the calculated t-statisticians. Determined by the function STUDIST( t-statistics; ). If P-value exceeds , then the corresponding variable is statistically insignificant and can be excluded from the model.

Bottom 95% And Top 95%¾ are the lower and upper limits of the 95 percent confidence intervals for the coefficients of the theoretical linear regression equation. If the confidence probability value in the data input block was left at its default value, then the last two columns will duplicate the previous ones. If the user has entered a confidence value, the last two columns contain the lower and upper bound values ​​for the specified confidence level.

6. Analysis of variance, lines contain the values ​​of coefficients, standard errors, t-statistician, P-values ​​and confidence intervals for the corresponding .

7. Block Withdrawal of balance contains the predicted values y(in our notation this is ) and residues .

The linear regression method allows us to describe a straight line that best fits a series of ordered pairs (x, y). The equation for a straight line, known as the linear equation, is given below:

ŷ is the expected value of y for a given value of x,

x is the independent variable,

a is a segment on the y-axis for a straight line,

b is the slope of the straight line.

The figure below illustrates this concept graphically:

The figure above shows the line described by the equation ŷ =2+0.5x. The y-intercept is the point at which the line intersects the y-axis; in our case, a = 2. The slope of the line, b, the ratio of the rise of the line to the length of the line, has a value of 0.5. A positive slope means the line rises from left to right. If b = 0, the line is horizontal, which means there is no relationship between the dependent and independent variables. In other words, changing the value of x does not affect the value of y.

ŷ and y are often confused. The graph shows 6 ordered pairs of points and a line, according to the given equation

This figure shows the point corresponding to the ordered pair x = 2 and y = 4. Note that the expected value of y according to the line at X= 2 is ŷ. We can confirm this with the following equation:

ŷ = 2 + 0.5х =2 +0.5(2) =3.

The y value represents the actual point and the ŷ value is the expected value of y using a linear equation for a given value of x.

The next step is to determine the linear equation that best matches the set of ordered pairs, we talked about this in the previous article, where we determined the type of equation by .

Using Excel to Define Linear Regression

In order to use the regression analysis tool built into Excel, you must activate the add-in Analysis package. You can find it by clicking on the tab File -> Options(2007+), in the dialog box that appears OptionsExcel go to the tab Add-ons. In field Control choose Add-onsExcel and click Go. In the window that appears, check the box next to Analysis package, click OK.

In the tab Data in Group Analysis a new button will appear Data analysis.

To demonstrate the work of the add-in, we will use data where a guy and a girl share a table in the bathroom. Enter the data from our bathtub example in Columns A and B of the blank sheet.

Go to the tab Data, in Group Analysis click Data analysis. In the window that appears Data analysis select Regression as shown in the figure and click OK.

Set the necessary regression parameters in the window Regression, as it shown on the picture:

Click OK. The figure below shows the results obtained:

These results are consistent with those we obtained by doing our own calculations in .

The construction of linear regression, evaluation of its parameters and their significance can be performed much faster when using the Excel analysis package (Regression). Let us consider the interpretation of the results obtained in the general case ( k explanatory variables) according to example 3.6.

In the table regression statistics the following values ​​are given:

Multiple R – multiple correlation coefficient;

R- square– coefficient of determination R 2 ;

Normalized R - square– adjusted R 2 adjusted for the number of degrees of freedom;

Standard error– regression standard error S;

Observations – number of observations n.

In the table Analysis of variance are given:

1. Column df - number of degrees of freedom equal to

for string Regression df = k;

for string Remainderdf = nk – 1;

for string Totaldf = n– 1.

2. Column SS – the sum of squared deviations equal to

for string Regression ;

for string Remainder ;

for string Total .

3. Column MS variances determined by the formula MS = SS/df:

for string Regression– factor dispersion;

for string Remainder– residual variance.

4. Column F – calculated value F-criterion calculated using the formula

F = MS(regression)/ MS(remainder).

5. Column Significance F – significance level value corresponding to the calculated F-statistics .

Significance F= FDIST( F- statistics, df(regression), df(remainder)).

If significance F < стандартного уровня значимости, то R 2 is statistically significant.

Odds Standard error t-statistics P-value Bottom 95% Top 95%
Y 65,92 11,74 5,61 0,00080 38,16 93,68
X 0,107 0,014 7,32 0,00016 0,0728 0,142

This table shows:

1. Odds– coefficient values a, b.

2. Standard error– standard errors of regression coefficients S a, Sb.



3. t- statistics– calculated values t -criteria calculated by the formula:

t-statistic = Coefficients/Standard error.

4.R-value (significance t) is the significance level value corresponding to the calculated t- statistics.

R-value = STUDIDIST(t-statistics, df(remainder)).

If R-meaning< стандартного уровня значимости, то соответствующий коэффициент статистически значим.

5. Bottom 95% and Top 95%– lower and upper limits of 95% confidence intervals for the coefficients of the theoretical linear regression equation.

WITHDRAWAL OF THE REST
Observation Predicted y Residues e
72,70 -29,70
82,91 -20,91
94,53 -4,53
105,72 5,27
117,56 12,44
129,70 19,29
144,22 20,77
166,49 24,50
268,13 -27,13

In the table WITHDRAWAL OF THE REST indicated:

in the column Observation– observation number;

in the column Foretold y – calculated values ​​of the dependent variable;

in the column Leftovers e – the difference between the observed and calculated values ​​of the dependent variable.

Example 3.6. There are data (conventional units) on food costs y and per capita income x for nine groups of families:

x
y

Using the results of the Excel analysis package (Regression), we will analyze the dependence of food costs on per capita income.

The results of regression analysis are usually written in the form:

where the standard errors of the regression coefficients are indicated in parentheses.

Regression coefficients A = 65,92 and b= 0.107. Direction of communication between y And x determines the sign of the regression coefficient b= 0.107, i.e. the connection is direct and positive. Coefficient b= 0.107 shows that with an increase in per capita income by 1 conventional. units food costs increase by 0.107 conventional units. units

Let us evaluate the significance of the coefficients of the resulting model. Significance of coefficients ( a, b) is checked by t-test:

P-value ( a) = 0,00080 < 0,01 < 0,05

P-value ( b) = 0,00016 < 0,01 < 0,05,

therefore, the coefficients ( a, b) are significant at the 1% level, and even more so at the 5% significance level. Thus, the regression coefficients are significant and the model is adequate to the original data.

The regression estimation results are compatible not only with the obtained values ​​of the regression coefficients, but also with a certain set of them (confidence interval). With a 95% probability, confidence intervals for the coefficients are (38.16 – 93.68) for a and (0.0728 – 0.142) for b.

The quality of the model is assessed by the coefficient of determination R 2 .

Magnitude R 2 = 0.884 means that the per capita income factor can explain 88.4% of the variation (scatter) in food expenses.

Significance R 2 is checked by F- test: significance F = 0,00016 < 0,01 < 0,05, следовательно, R 2 is significant at the 1% level, and even more so at the 5% significance level.

In the case of pairwise linear regression, the correlation coefficient can be defined as . The obtained value of the correlation coefficient indicates that the relationship between food costs and per capita income is very close.

Regression analysis is one of the most popular methods of statistical research. It can be used to establish the degree of influence of independent variables on the dependent variable. Microsoft Excel has tools designed to perform this type of analysis. Let's look at what they are and how to use them.

Connecting the analysis package

But, in order to use the function that allows you to perform regression analysis, you first need to activate the Analysis Package. Only then the tools necessary for this procedure will appear on the Excel ribbon.

  1. Move to the “File” tab.
  2. Go to the “Settings” section.
  3. The Excel Options window opens. Go to the “Add-ons” subsection.
  4. At the very bottom of the window that opens, move the switch in the “Management” block to the “Excel Add-ins” position, if it is in a different position. Click on the “Go” button.
  5. A window of available Excel add-ins opens. Check the box next to “Analysis Package”. Click on the “OK” button.

Now, when we go to the “Data” tab, on the ribbon in the “Analysis” tool block we will see a new button - “Data Analysis”.

Types of Regression Analysis

There are several types of regressions:

  • parabolic;
  • sedate;
  • logarithmic;
  • exponential;
  • demonstrative;
  • hyperbolic;
  • linear regression.

We will talk in more detail about performing the last type of regression analysis in Excel later.

Linear Regression in Excel

Below, as an example, is a table showing the average daily air temperature outside and the number of store customers for the corresponding working day. Let's find out using regression analysis exactly how weather conditions in the form of air temperature can affect the attendance of a retail establishment.

The general linear regression equation is as follows: Y = a0 + a1x1 +…+ akhk. In this formula, Y means the variable on which we are trying to study the influence of factors. In our case, this is the number of buyers. The value of x is the various factors that influence the variable. The parameters a are the regression coefficients. That is, they are the ones who determine the significance of a particular factor. The index k denotes the total number of these same factors.


Analysis results analysis

The results of the regression analysis are displayed in the form of a table in the place specified in the settings.

One of the main indicators is R-squared. It indicates the quality of the model. In our case, this coefficient is 0.705 or about 70.5%. This is an acceptable level of quality. Dependency less than 0.5 is bad.

Another important indicator is located in the cell at the intersection of the “Y-intersection” row and the “Coefficients” column. This indicates what value Y will have, and in our case, this is the number of buyers, with all other factors equal to zero. In this table, this value is 58.04.

The value at the intersection of the columns “Variable X1” and “Coefficients” shows the level of dependence of Y on X. In our case, this is the level of dependence of the number of store customers on temperature. A coefficient of 1.31 is considered a fairly high influence indicator.

As you can see, using Microsoft Excel it is quite easy to create a regression analysis table. But only a trained person can work with the output data and understand its essence.

We are glad that we were able to help you solve the problem.

Ask your question in the comments, describing the essence of the problem in detail. Our specialists will try to answer as quickly as possible.

Did this article help you?

The linear regression method allows us to describe a straight line that best fits a series of ordered pairs (x, y). The equation for a straight line, known as the linear equation, is given below:

ŷ - expected value of y for a given value of x,

x - independent variable,

a - segment on the y-axis for a straight line,

b is the slope of the straight line.

The figure below illustrates this concept graphically:

The figure above shows the line described by the equation ŷ =2+0.5x. The y-intercept is the point at which the line intersects the y-axis; in our case, a = 2. The slope of the line, b, the ratio of the rise of the line to the length of the line, has a value of 0.5. A positive slope means the line rises from left to right. If b = 0, the line is horizontal, which means there is no relationship between the dependent and independent variables. In other words, changing the value of x does not affect the value of y.

ŷ and y are often confused. The graph shows 6 ordered pairs of points and a line, according to the given equation

This figure shows the point corresponding to the ordered pair x = 2 and y = 4. Note that the expected value of y according to the line at X= 2 is ŷ. We can confirm this with the following equation:

ŷ = 2 + 0.5х =2 +0.5(2) =3.

The y value represents the actual point and the ŷ value is the expected value of y using a linear equation for a given value of x.

The next step is to determine the linear equation that best matches the set of ordered pairs, we talked about this in the previous article, where we determined the form of the equation using the least squares method.

Using Excel to Define Linear Regression

In order to use the regression analysis tool built into Excel, you must activate the add-in Analysis package. You can find it by clicking on the tab File -> Options(2007+), in the dialog box that appears OptionsExcel go to the tab Add-ons. In field Control choose Add-onsExcel and click Go. In the window that appears, check the box next to Analysis package, click OK.

In the tab Data in Group Analysis a new button will appear Data analysis.

To demonstrate how the add-in works, let's use data from a previous article, where a guy and a girl share a table in the bathroom. Enter the data from our bathtub example in Columns A and B of the blank sheet.

Go to the tab Data, in Group Analysis click Data analysis. In the window that appears Data analysis select Regression as shown in the figure and click OK.

Set the necessary regression parameters in the window Regression, as it shown on the picture:

Click OK. The figure below shows the results obtained:

These results are consistent with those we obtained by doing our own calculations in the previous article.

Regression analysis is a statistical research method that allows you to show the dependence of a particular parameter on one or more independent variables. In the pre-computer era, its use was quite difficult, especially when it came to large volumes of data. Today, having learned how to build regression in Excel, you can solve complex statistical problems in just a couple of minutes. Below are specific examples from the field of economics.

Types of Regression

This concept itself was introduced into mathematics by Francis Galton in 1886. Regression happens:

  • linear;
  • parabolic;
  • sedate;
  • exponential;
  • hyperbolic;
  • demonstrative;
  • logarithmic.

Example 1

Let's consider the problem of determining the dependence of the number of team members who quit on the average salary at 6 industrial enterprises.

Task. At six enterprises, the average monthly salary and the number of employees who quit voluntarily were analyzed. In tabular form we have:

For the task of determining the dependence of the number of quitting workers on the average salary at 6 enterprises, the regression model has the form of the equation Y = a0 + a1×1 +…+аkxk, where хi are the influencing variables, ai are the regression coefficients, and k is the number of factors.

For this task, Y is the indicator of employees who quit, and the influencing factor is salary, which we denote by X.

Using the capabilities of the Excel spreadsheet processor

Regression analysis in Excel must be preceded by applying built-in functions to existing tabular data. However, for these purposes it is better to use the very useful “Analysis Pack” add-on. To activate it you need:

  • from the “File” tab go to the “Options” section;
  • in the window that opens, select the line “Add-ons”;
  • click on the “Go” button located below, to the right of the “Management” line;
  • check the box next to the name “Analysis package” and confirm your actions by clicking “Ok”.

If everything is done correctly, the required button will appear on the right side of the “Data” tab, located above the Excel worksheet.

Linear Regression in Excel

Now that we have all the necessary virtual tools at hand to carry out econometric calculations, we can begin to solve our problem. For this:

  • click on the “Data Analysis” button;
  • in the window that opens, click on the “Regression” button;
  • in the tab that appears, enter the range of values ​​for Y (the number of quitting employees) and for X (their salaries);
  • We confirm our actions by pressing the “Ok” button.

As a result, the program will automatically fill a new spreadsheet with regression analysis data. Note! Excel allows you to manually set the location you prefer for this purpose. For example, this could be the same sheet where the Y and X values ​​are located, or even a new workbook specifically designed to store such data.

Analysis of regression results for R-squared

In Excel, the data obtained during processing of the data in the example under consideration has the form:

First of all, you should pay attention to the R-squared value. It represents the coefficient of determination. In this example, R-square = 0.755 (75.5%), i.e., the calculated parameters of the model explain the relationship between the parameters under consideration by 75.5%. The higher the value of the coefficient of determination, the more suitable the selected model is for a specific task. It is considered to correctly describe the real situation when the R-square value is above 0.8. If R-squared is tcr, then the hypothesis about the insignificance of the free term of the linear equation is rejected.

In the problem under consideration for the free term, using Excel tools, it was obtained that t = 169.20903, and p = 2.89E-12, i.e., we have zero probability that the correct hypothesis about the insignificance of the free term will be rejected. For the coefficient for the unknown t=5.79405, and p=0.001158. In other words, the probability that the correct hypothesis about the insignificance of the coefficient for an unknown will be rejected is 0.12%.

Thus, it can be argued that the resulting linear regression equation is adequate.

The problem of the feasibility of purchasing a block of shares

Multiple regression in Excel is performed using the same Data Analysis tool. Let's consider a specific application problem.

The management of the NNN company must decide on the advisability of purchasing a 20% stake in MMM JSC. The cost of the package (SP) is 70 million US dollars. NNN specialists have collected data on similar transactions. It was decided to evaluate the value of the shareholding according to such parameters, expressed in millions of US dollars, as:

  • accounts payable (VK);
  • annual turnover volume (VO);
  • accounts receivable (VD);
  • cost of fixed assets (COF).

In addition, the parameter of the enterprise's salary debt (V3 P) in thousands of US dollars is used.

Solution using Excel spreadsheet processor

First of all, you need to create a table of source data. It looks like this:

  • call the “Data Analysis” window;
  • select the “Regression” section;
  • In the “Input interval Y” box, enter the range of values ​​of the dependent variables from column G;
  • Click on the icon with a red arrow to the right of the “Input interval X” window and highlight the range of all values ​​​​from columns B, C, D, F on the sheet.

Mark the “New worksheet” item and click “Ok”.

Obtain a regression analysis for a given problem.

Study of results and conclusions

We “collect” the regression equation from the rounded data presented above on the Excel spreadsheet:

SP = 0.103*SOF + 0.541*VO – 0.031*VK +0.405*VD +0.691*VZP – 265.844.

In a more familiar mathematical form, it can be written as:

y = 0.103*x1 + 0.541*x2 – 0.031*x3 +0.405*x4 +0.691*x5 – 265.844

Data for MMM JSC are presented in the table:

Substituting them into the regression equation, we get a figure of 64.72 million US dollars. This means that the shares of MMM JSC are not worth purchasing, since their value of 70 million US dollars is quite inflated.

As you can see, the use of the Excel spreadsheet and the regression equation made it possible to make an informed decision regarding the feasibility of a very specific transaction.

Now you know what regression is. The Excel examples discussed above will help you solve practical problems in the field of econometrics.