Testing simple hypotheses using the Pearson chi-square test in MS EXCEL. Pearson criterion. Testing the hypothesis of normal distribution

Laboratory work No. 6. Testing the hypothesis about the normal distribution of the sample using the Pearson criterion.

The lab work is performed in Excel 2007.

The purpose of the work is to provide skills in primary data processing, constructing histograms, selecting a suitable distribution law and calculating its parameters, checking the agreement between the empirical and hypothetical distribution law using the Pearson chi-square test using Excel.

1. Formation of a sample of normally distributed random numbers With given values mathematical expectation and standard deviation.

Data → Data Analysis → Random Number Generation → OK.

Rice. 1. Dialog box Data analysis

In the window that appears Random number generation enter:

Number of variables: 1 ;

Number of random numbers: 100 ;

Distribution: Normal.

Options:

Average = 15 (expected value);

Standard Deviation = 2 (standard deviation);

Random Scatter: do not fill out(or fill out as directed by the teacher);

Output interval: address of the first cell of the random number array column - $ A$1 . OK.

Rice. 2. Dialog box Random number generation with filled input fields

As a result of the operation Random number generation a column will appear $ A$1: $A$100 containing 100 random numbers.

Rice. 3. Fragment of an Excel sheet of the first few random numbers $A$1: $A$100.

2. Determination of sampling parameters, descriptive statistics

From the Excel main menu, select: Data → Data Analysis → Descriptive Statistics → OK.

In the window that appears Descriptive Statistics enter:

Input interval– 100 random numbers in cells $ A$1: $ A$100 ;

Grouping- by columns;

Output interval– address of the cell from which the table begins Descriptive Statistics - $C$1 ;

Summary statistics- tick. OK.

Rice. 4. Dialog box Descriptive Statistics with completed input fields.

A table will appear on the Excel sheet - Column 1

Rice. 5. Table Column 1 with procedure data Descriptive Statistics.

The table contains descriptive statistics, in particular:

Average– estimation of mathematical expectation;

Standard deviation– estimation of standard deviation;

Excess And Asymmetry– estimates of kurtosis and asymmetry.

The approximate equality of kurtosis and skewness estimates to zero, and the approximate equality of the average estimate to the median estimate gives preliminary grounds for choosingH 0 distribution of elements of the general population is a normal law.

Interval– sample range;

Minimumminimum value random variable in the sample;

Maximum– the maximum value of the random variable in the sample.

In a cell F15 - length of partial interval h, calculated as follows:

Number of grouping intervals k in Excel it is calculated automatically using the formula

where the parentheses mean rounding down to the integer part of the number.

In the variant under consideration n = 100 , hence, k = 11 . Really:

This formula is entered in the cell F15: =($D$13-$D$12)/10

Results of the procedure Descriptive Statistics will be required later when constructing a theoretical distribution law.

LABORATORY WORK

CORRELATION ANALYSIS INEXCEL

1.1 Correlation analysis in MS Excel

Correlation analysis consists of determining the degree of connection between two random variables X and Y. The correlation coefficient is used as a measure of such connection. The correlation coefficient is estimated from a sample of n related pairs of observations (x i, y i) from the joint population of X and Y. To assess the degree of relationship between the values ​​of X and Y, measured in quantitative scales, it is used linear correlation coefficient(Pearson coefficient), which assumes that samples X and Y are normally distributed.

The correlation coefficient varies from -1 (strict inverse linear relationship) to 1 (strict direct proportional relationship). When set to 0, there is no linear relationship between the two samples.

General classification of correlations (according to Ivanter E.V., Korosov A.V., 1992):

There are several types of correlation coefficients, depending on the variables X and Y, which can be measured on different scales. It is this fact that determines the choice of the appropriate correlation coefficient (see Table 13):

In MS Excel, a special function is used to calculate pair linear correlation coefficients CORREL (array1; array2),

subjects

where array1 is a reference to the range of cells of the first selection (X);

Example 1: 10 schoolchildren were given tests for visual-figurative and verbal thinking. The average time for solving test tasks was measured in seconds. The researcher is interested in the question: is there a relationship between the time it takes to solve these problems? Variable X denotes the average time for solving visual-figurative tests, and variable Y denotes the average time for solving verbal test tasks.

R solution: To identify the degree of relationship, first of all, it is necessary to enter data into a MS Excel table (see table, Fig. 1). Then the value of the correlation coefficient is calculated. To do this, place the cursor in cell C1. On the toolbar, click the Insert Function (fx) button.

In the Feature Wizard dialog box that appears, select a category Statistical and function CORREL, and then click OK. Using the mouse pointer, enter the sample data range X in the array1 (A1:A10) field. In the array2 field, enter the sample data range Y (B1:B10). Click OK. In cell C1 the value of the correlation coefficient will appear - 0.54119. Next, you need to look at the absolute number of the correlation coefficient and determine the type of connection (close, weak, medium, etc.)

Rice. 1. Results of calculating the correlation coefficient

Thus, the connection between the time of solving visual-figurative and verbal test tasks has not been proven.

Exercise 1. Data are available for 20 agricultural holdings. Find correlation coefficient between the yields of grain crops and the quality of the land and evaluate its significance. The data is shown in the table.

Table 2. Dependence of grain yield on land quality

Farm number

Land quality, score

Productivity, c/ha


Task 2. Determine whether there is a connection between the operating time of a sports fitness simulator (thousand hours) and the cost of its repair (thousand rubles):

Simulator operating time (thousand hours)

Cost of repairs (thousand rubles)

1.2 Multiple correlation in MS Excel

At large number observations, when correlation coefficients need to be sequentially calculated for several samples, for convenience, the resulting coefficients are summarized in tables called correlation matrices.

Correlation matrix is a square table in which at the intersection of the corresponding rows and columns there is a correlation coefficient between the corresponding parameters.

In MS Excel, the procedure is used to calculate correlation matrices Correlation from the package Data analysis. The procedure allows us to obtain a correlation matrix containing correlation coefficients between various parameters.

To implement the procedure you need:

1. execute the command Service - Analysis data;

2. in the list that appears Analysis Tools select line Correlation and press the button OK;

3. in the dialog box that appears, specify Input interval, that is, enter a link to the cells containing the analyzed data. The input interval must contain at least two columns.

4. in section Grouping set the switch in accordance with the entered data (by columns or by rows);

5. indicate day off interval, that is, enter a link to the cell from which the analysis results will be shown. The size of the output range will be determined automatically and a message will be displayed if the output range may overlap with the source data. Press the button OK.

A correlation matrix will be output to the output range, in which at the intersection of each row and column there is a correlation coefficient between the corresponding parameters. Cells in the output range that have matching row and column coordinates contain the value 1 because each column in the input range is perfectly correlated with itself

Example 2. There are monthly observational data on weather conditions and attendance at museums and parks (see Table 3). It is necessary to determine whether there is a relationship between weather conditions and attendance at museums and parks.

Table 3. Observation results

Number of clear days

Number of museum visitors

Number of park visitors

Solution. To perform correlation analysis, enter the original data into the range A1:G3 (Fig. 2). Then in the menu Service select item Analysis data and then enter the line Correlation. In the dialog box that appears, specify Input interval(A2:C7). Specify that the data is viewed in columns. Specify the output range (E1) and press the button OK.

In Fig. 33 shows that the correlation between weather conditions and museum attendance is -0.92, and between weather conditions and park attendance is 0.97, and between park and museum attendance is 0.92.

Thus, as a result of the analysis, dependencies were revealed: a strong degree of inverse linear relationship between museum attendance and the number of sunny days and an almost linear (very strong direct) relationship between park attendance and weather conditions. There is a strong inverse relationship between museum and park attendance.

Rice. 2. Results of calculating the correlation matrix from example 2

Task 3. 10 managers were assessed using the method of expert assessments of the psychological characteristics of a manager’s personality. 15 experts assessed each psychological characteristic using a five-point system (see Table 4). The psychologist is interested in the question of the relationship between these characteristics of a leader.

Table 4. Study results

Subjects

tact

exactingness

criticality

​ Pearson's χ 2 test is a nonparametric method that allows us to assess the significance of differences between the actual (revealed) number of outcomes or qualitative characteristics of the sample that fall into each category, and the theoretical number that can be expected in the studied groups if the null hypothesis is true. To put it simply, the method allows you to estimate statistical significance differences between two or more relative indicators (frequencies, shares).

1. History of the development of the χ 2 criterion

The chi-square test for analyzing contingency tables was developed and proposed in 1900 by the English mathematician, statistician, biologist and philosopher, founder of mathematical statistics and one of the founders of biometrics Karl Pearson(1857-1936).

2. Why is Pearson's χ 2 test used?

The chi-square test can be used in the analysis contingency tables containing information on the frequency of outcomes depending on the presence of a risk factor. For example, four-field contingency table as follows:

There is an outcome (1) No outcome (0) Total
There is a risk factor (1) A B A+B
No risk factor (0) C D C+D
Total A+C B+D A+B+C+D

How to fill out such a contingency table? Let's look at a small example.

A study is being conducted on the effect of smoking on the risk of developing arterial hypertension. For this purpose, two groups of subjects were selected - the first included 70 people who smoke at least 1 pack of cigarettes daily, the second included 80 non-smokers of the same age. In the first group, 40 people had high blood pressure. In the second, arterial hypertension was observed in 32 people. Accordingly, normal blood pressure in the group of smokers was in 30 people (70 - 40 = 30) and in the group of non-smokers - in 48 (80 - 32 = 48).

We fill in the four-field contingency table with the initial data:

In the resulting contingency table, each line corresponds to a specific group of subjects. Columns - show the number of people with arterial hypertension or normal blood pressure.

The task that is posed to the researcher is: are there statistically significant differences between the frequency of people with blood pressure among smokers and non-smokers? This question can be answered by calculating the Pearson chi-square test and comparing the resulting value with the critical one.

3. Conditions and limitations for using the Pearson chi-square test

  1. Comparable indicators must be measured in nominal scale(for example, the patient's gender is male or female) or in ordinal(for example, the degree of arterial hypertension, taking values ​​from 0 to 3).
  2. This method allows you to analyze not only four-field tables, when both the factor and the outcome are binary variables, that is, they have only two possible values ​​(for example, male or female, the presence or absence of a certain disease in the anamnesis...). The Pearson chi-square test can also be used in the case of analyzing multifield tables, when a factor and (or) outcome takes three or more values.
  3. The groups being compared must be independent, that is, the chi-square test should not be used when comparing before-after observations. McNemar test(when comparing two related populations) or calculated Cochran's Q test(in case of comparison of three or more groups).
  4. When analyzing four-field tables expected values in each cell there must be at least 10. If in at least one cell the expected phenomenon takes a value from 5 to 9, the chi-square test must be calculated with Yates's amendment. If in at least one cell the expected phenomenon is less than 5, then the analysis should use exact criterion Fisher.
  5. When analyzing multifield tables, the expected number of observations should not be less than 5 in more than 20% of the cells.

4. How to calculate the Pearson chi-square test?

To calculate the chi-square test you need to:

This algorithm is applicable for both four-field and multi-field tables.

5. How to interpret the value of the Pearson chi-square test?

If the obtained value of the χ 2 criterion is greater than the critical value, we conclude that there is a statistical relationship between the studied risk factor and the outcome at the appropriate level of significance.

6. Example of calculating the Pearson chi-square test

Let us determine the statistical significance of the influence of the smoking factor on the incidence of arterial hypertension using the table discussed above:

  1. We calculate the expected values ​​for each cell:
  2. Find the value of the Pearson chi-square test:

    χ 2 = (40-33.6) 2 /33.6 + (30-36.4) 2 /36.4 + (32-38.4) 2 /38.4 + (48-41.6) 2 /41.6 = 4.396.

  3. The number of degrees of freedom f = (2-1)*(2-1) = 1. Using the table, we find the critical value of the Pearson chi-square test, which at the significance level p=0.05 and the number of degrees of freedom 1 is 3.841.
  4. We compare the obtained value of the chi-square test with the critical one: 4.396 > 3.841, therefore, the dependence of the incidence of arterial hypertension on the presence of smoking is statistically significant. The significance level of this relationship corresponds to p<0.05.

Assessment of compliance with normal distribution

This method is used to check the agreement between the experimental and theoretical distributions if the number of tests is more than 100.

The essence of the method is to determine the Pearson criterion ( c 2) followed by comparison of the obtained value with the theoretical one.

The procedure for determining the Pearson criterion:

The average value and standard deviation are determined. To calculate the Pearson criterion, a table is drawn up (Table 11).

2. Define attitude

3. Using a special table (Table 12), the distribution frequency is determined Y 0.


Table 11


Table 12

t 0,00 0,01 0,02 0,03 0,04 0,05 0,06 0,07 0,08 0,09
0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7 2,8 2,9 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7 3,8 3,9 0,3989 0,2420 0,0544 0,0044

4. Calculate the theoretical frequency value

(40)

Where n- total number of tests;

k- class interval;

S- standard deviation.

5. Determine the difference between the actual and theoretical frequency distribution

y i – U t(41)

count on

6. Find the Pearson criterion

(43)

7. Determine the number of degrees of freedom

C = m-3(44)

Where C- number of degrees of freedom;

m- number of classes or lines.

8. Setting the confidence probability q, determine the theoretical value of the Pearson criterion.

9. Compare c f 2 With c t 2. If c 2 f< c 2 т , then for the accepted confidence probability the hypothesis about the agreement of the experimental and theoretical distributions is accepted, otherwise it is rejected.

In Excel, checking is carried out using the function HI2TEST(Fig. 22). HI2TEST returns the value for the χ 2 distribution The test is used to determine whether a hypothesis is supported by an experiment.

Rice. 22. Function HI2TEST

HI2TEST(actual_interval;expected_interval)

Actual_interval is an interval of data that contains observations to be compared with expected values.

Expected_interval is a data interval that contains the ratio of the products of row and column totals to the grand total.

If the actual_interval and expected_interval have different numbers of data points, then the function HI2TEST returns the #N/A error value.

The χ 2 test first calculates the χ 2 statistic using the formula:

(45)

Where A ij- actual frequency in i-th line j th column

E ij- expected frequency in the i-th row, j-th column

r- number of lines

c- number of columns

The value of the χ 2 criterion is an indicator of independence. As can be seen from the formula, the criterion χ 2 is always positive or equal to 0, and the latter is possible only if A ij = E ij for any values i,j.

HI2TEST returns the probability that, given independence, a value of the χ 2 statistic that is at least as high as that obtained from the formula above can be obtained. To calculate this probability, HI2TEST uses the distribution χ 2 with the corresponding number of degrees of freedom ( df). If r> 1, and c > 1, then df= (r- 1)(c- 1). If r= 1, a c> 1, then df=c- 1 or if r> 1, a c= 1, then df= r- 1. Equality, where r = c= 1 is not allowed, so the error message #N/A will appear.

Function HI2TEST can be used in cases where the hypothetical distribution is completely specified, that is, not only the type of the hypothetical distribution law is specified, but also all the parameters of this law. Only in this case does the function correctly return the number of degrees of freedom.

CHIDIST(x;degrees_freedom) (Figure 23) returns the one-sided probability of the chi-square distribution. The χ 2 distribution is related to the χ 2 test. The χ 2 test is used to compare estimated and observed values. For example, in a genetic experiment it is hypothesized that the next generation of plants will have a certain color. By comparing observed results with expected results, you can determine whether the original hypothesis was correct.

x is the value for which you want to calculate the distribution.

Degrees_freedom – number of degrees of freedom.

Rice. 23. Function CHIDIST

If any of the arguments are not a number, the function CHIDIST returns the #VALUE! error value.

If x is negative, the function CHIDIST

If degrees_of_freedom< 1 или степени_свободы >10^10, function CHIDIST returns the #NUM! error value.

CHIDIST calculated as CHIDIST= P(X> x), where x - χ 2 random variable.

HI2OBR(probability;degrees_freedom) (Figure 24) returns the inverse of the one-sided probability of the chi-square distribution. If probability = CHIDIST(x;...), then HI2OBR(probability;...) = x. This function allows you to compare observed results with expected results to determine whether the original hypothesis was correct.

Likelihood is the probability associated with the c2 (chi-square) distribution.

Degrees_freedom - number of degrees of freedom.

If any of the arguments are not a number, CH2INV returns the #VALUE! error value.

Rice. 24. Function HI2OBR

If the probability< 0 или вероятность >1, function HI2OBR returns the error value #NUM!

If the value of the degrees_freedom argument is not an integer, it is truncated.

If degrees_of_freedom< 1 или степени_свободы ≥ 10^10, HI2OBR returns the error value #NUM!

If a probability value is given, then the function HI2OBR looks for the value x for which the function CHIDIST(x; degrees_of_freedom) = probability. However, the accuracy of the function HI2OBR depends on accuracy CHIDIST. In function HI2OBR The iteration method is used for searching. If the search has not ended after 100 iterations, the function returns the error message #N/A.

The correlation coefficient reflects the degree of relationship between two indicators. It always takes a value from -1 to 1. If the coefficient is located around 0, then there is no connection between the variables.

If the value is close to one (from 0.9, for example), then there is a strong direct relationship between the observed objects. If the coefficient is close to the other extreme point of the range (-1), then there is a strong inverse relationship between the variables. When the value is somewhere between 0 and 1 or 0 and -1, then we're talking about O weak connection(direct or reverse). This relationship is usually not taken into account: it is believed that it does not exist.

Calculation of correlation coefficient in Excel

Let's look at an example of methods for calculating the correlation coefficient, features of direct and inverse relationships between variables.

Values ​​of indicators x and y:

Y is an independent variable, x is a dependent variable. It is necessary to find the strength (strong/weak) and direction (forward/inverse) of the connection between them. The correlation coefficient formula looks like this:


To make it easier to understand, let's break it down into several simple elements.

A strong direct relationship is determined between the variables.

The built-in CORREL function avoids complex calculations. Let's calculate the pair correlation coefficient in Excel using it. Call the function wizard. We find the right one. The function arguments are an array of y values ​​and an array of x values:

Let's show the values ​​of the variables on the graph:


A strong connection between y and x is visible, because the lines run almost parallel to each other. The relationship is direct: y increases - x increases, y decreases - x decreases.



Pair correlation coefficient matrix in Excel

The correlation matrix is ​​a table at the intersection of rows and columns of which the correlation coefficients between the corresponding values ​​are located. It makes sense to build it for several variables.

The matrix of correlation coefficients in Excel is constructed using the “Correlation” tool from the “Data Analysis” package.


A strong direct relationship was found between the values ​​of y and x1. Between x1 and x2 there is a strong Feedback. There is practically no connection with the values ​​in column x3.