Statistical interpretation of data; Detection and handling of outlying observations in normal sample
Some standard content:
National Standard of the People's Republic of China
Statistical interpretation of data--Detection and handling of outlying observations in normal sample1 Reference
UDC519.28
GB 4883-85
1.1 This standard specifies the general principles and implementation methods for judging and handling outliers in normal samples. 1.2 An outlier (or an abnormal observation) is an individual value in a sample whose value deviates significantly from the remaining observations of the sample to which it (or they) belong.
An outlier may be an extreme manifestation of the inherent random variability of the population. Such an outlier belongs to the same population as the remaining observations in the sample. An outlier may also be the result of accidental deviations from the test conditions and test methods, or caused by errors in observation, calculation, and recording. Such an outlier does not belong to the same population as the remaining observations in the sample. 1.3 For other statistical terms used in this standard, see the national standard GB3358-82 "Statistical Terms and Symbols". 1.4 Application conditions: Except for individual abnormal values, most of the observed values (sample body) in the sample under examination (or the values obtained after a certain function transformation) come from the same normal population or an approximately normal population. The judgment that the sample comes from a normal population or an approximately normal population can be based on physical and technical knowledge; it can also be tested for normality through previous data with the same properties as the object of examination. The principles and methods can be found in the national standard GB4882-85 "Statistical Processing and Interpretation of Data - Normality Test". 2 Statistical principles for judging abnormal values
2.1 This standard judges abnormal values in the sample under the following different situations: upper side situation: based on past experience, abnormal values are all high-end values; lower side situation: based on past experience, abnormal values are all low-end values; two-sided situation: abnormal values are extreme values that may appear at both ends. Note: I: The upper side situation and the lower side situation are collectively referred to as the one-sided situation. 2.2 When implementing this standard, the upper limit of the number of outliers detected in the sample should be specified (a small proportion of the number of sample observations). When this upper limit is exceeded, the representativeness of the sample should be carefully studied and handled. 2.3 Test rules for judging single outliers
According to the actual situation, select appropriate outlier test rules (see Chapters 4, 5, and 6), specify the significance level α of the statistical test for detecting outliers, referred to as the detection level, and determine the critical value of the statistic based on α and the number of observations n; substitute each observation into the statistic given in the test rule. If the value obtained exceeds the critical value, the extreme observation to be checked is judged to be an outlier, otherwise it is judged that "there is no outlier". The detection level α should be 5%, 1% (or 10%). 2.4 Inspection rules for multiple outliers
In the case where the number of detected outliers is allowed to be greater than 1, the method specified in this standard is to repeatedly use the same inspection rule for single outliers, that is, first inspect all observations with the specified inspection level and the rules specified in 2.3. If no outliers are detected, the entire inspection is stopped: if one outlier is detected, the same inspection level and the same rules are used to continue inspecting the remaining observations after removing the detected outliers until no outliers are detected or the number of detected outliers exceeds the upper limit. Issued by the National Bureau of Standards on January 29, 1985
Implemented on October 1, 1985
3 General rules for handling outliers
GB4883-85
3.1 For detected outliers, the technical and physical reasons for the outliers should be sought as much as possible as a basis for handling the outliers.
The ways to deal with outliers are:
Outliers are retained in the sample for subsequent data analysis; outliers are allowed to be eliminated, that is, outliers are excluded from the sample; outliers are allowed to be eliminated and appropriate observations are added to the sample; outliers are corrected when the actual cause is found. 3.3 The user of the standard should weigh the cost of finding the cause of outliers based on the nature of the actual problem, correctly judge the benefits of outliers and the risk of incorrectly eliminating normal observations, and determine to implement one of the following three rules. a. Any outlier shall not be eliminated or corrected without sufficient technical and physical reasons for its abnormality. b. In addition to those with sufficient technical and physical reasons to explain their abnormality, those that are highly abnormal in statistics are also allowed to be eliminated or corrected. The meaning is: the significance level α* of the statistical test for judging whether the abnormal value is highly abnormal, referred to as the elimination level, is less than the detection level a
When implementing, after the test is carried out according to 2.3, the detected abnormal value is immediately tested again according to 2.3 with the elimination level α* replacing the detection level α. If the test is significant at the elimination level, the abnormal value is judged to be highly abnormal. In the case of repeated use of the same test rule, each time an abnormal value is detected, it must be tested again to see if it is highly abnormal at the elimination level. If the abnormal value detected in a certain test is highly abnormal, this abnormal value and the abnormal value detected before it can be eliminated or corrected.
Except for special circumstances, the elimination level is generally 1% or less, and it is not advisable to use a value greater than 5%. In the case of selecting the elimination level, the detection level can be 5% or larger. c. All detected outliers can be eliminated or corrected. 3.4 The detected outliers, the observed values that have been eliminated or corrected, and the reasons therefor, should be recorded for future reference. 4 Rules for judging and handling outliers in the case of known standard deviation 4.1 This chapter stipulates the use of Nair's test method or the repeated use of Nair's test method. 4.1.1 Test method for the upper side case
a. For the observations arranged by size 1)<α2)·(), calculate the statistic Rn=((m)-)/
Here α is the known population standard deviation and is the sample mean. b. Determine the detection level α, and find the critical value R1-α(n) corresponding to n and α in Table A1. c. When Rn>Rl-α(n), judge the maximum value α(n) as an outlier, otherwise, judge "no outlier". d. When the elimination level α* is given, find the critical value R1-α*(n) corresponding to n and α* in Table A1. When Rn>R1~α*(n), α(n) is judged to be highly abnormal, otherwise, it is judged that “there is no highly abnormal outlier”. 4.1.2 Test method for the lower side situation
The same as the rule in 4.2.1, but the statistic Rn=(-u))/g
is used instead of Rn, and the minimum value α(1) is to be judged. 4.1.3 Test method for the two-sided situation
a. Calculate the values of Rn and Ra;
b. Determine the detection level α, and find the critical value Ra/2(n)162
GB 4883--85
c. When Rn>R, and Rn>R1-a/2(n), the maximum value α(n) is judged as an outlier; when R>Rn, and R>Rl~α/2(n), the minimum value 1) is judged as an outlier; otherwise, it is judged as “no outlier”. d. Given the rejection level α*, find the critical value R1-α*/2(n) corresponding to n, α*/2 in Table A1. When Rn>R, and R>R-α*/2(n), the maximum value (n) is judged as highly abnormal: when R>RnHR>Ra/2(n), the minimum value (1) is judged as highly abnormal; otherwise, it is judged as “no highly abnormal outlier”. 4.2 Example of using Nair's test method:
Examine the dry shrinkage of a certain chemical fiber and get 25 independent observations: 3.13, 3.49, 4.01, 4.48, 4.61, 4.76, 4.98, 5.25, 5.32, 5.39, 5.42, 5.57, 5.59, 5.59, 5.63, 5.63, 5.65, 5.66, 5.67, 5.69, 5.71, 6.00, 6.03, 6.12, 6.76, (unit %). It is known that under normal conditions, the test quantity obeys the normal distribution, = 0.65, and now examine the abnormal values on the lower side. It is stipulated that at most three abnormal values can be detected, and the processing method b in 3.3 is adopted. Take the detection level α = 5%, and the elimination level α* = 1%. For n=25, we get =5.2856, R25=(-x1))/α=(5.2856-3.13)/0.65=3.316. And Ro.95(25)==2.815, Ro.99(25)=3.282, Rm>Ra.99(25), so 3.13 is a highly abnormal value. Take out 3.After 13, the mean intersection of the remaining 24 observations is calculated = 5.375, and the minimum value is 3.49, R24 = (5.375-3.49)/0.65 = 2.90. For n = 24, Ro.95 (24) = 2.800, Ro.99 (24) = 3.269, because R24>Ro.95 (24), 3.49 is judged to be an outlier. After taking out 3.13 and 3.49, the sample mean of the remaining 23 observations is 5.457, and the minimum value is 4.01, R23 = (5.4574.01)/0.65 = 2.227. For n=23, Ro.95(23)=2.784. Since R23R0.95(23), it is judged that there is no outlier. 3.13 and 3.49 are detected as outliers, of which 3.13 is highly abnormal and can be considered for elimination. 5 Rules for judging and handling outliers in the case of unknown standard deviation (I) The number of detected outliers does not exceed 1
5.1 This chapter gives the Grubbs test and the Dixon test. Standard users can choose to implement one of the test methods according to actual requirements (refer to Appendix B ). 5.2 Grubbs test method
5.2.1 Test method for the upper case
a. For the observed values αi,·an, calculate the value of the statistic G=(n)-)/s
, where α(n) is the maximum observed value, and s is the sample mean and sample standard deviation, that is, =(α:++an)/n, s=bZxz.net
-(-n\
Determine the detection level α, and find the critical value G1-α(n) corresponding to n, α in Table A2, b.
c. When Gn>G1-α(n), judge the maximum The value (n) is an outlier, otherwise, it is judged that "there is no outlier"; d. Given the elimination level α*, find the critical value Gl-α* (n) corresponding to n, α* in Table A2. When Gn>G-α* (n), (n) is judged to be highly abnormal, otherwise, it is judged that "there is no highly abnormal outlier". 5.2.2 Test method for the lower side case
is the same as the rule in 5.2.1, but the statistic G=(-()/s
is used instead of G, and the minimum observed value (1) is to be judged. 5.2.3 Test method for the two-sided case
a: Calculate the values of G. and G;
b Determine the detection level α, find the critical value G1α/2 (n) corresponding to n, α/2 in Table A2, when Gn>G%, and Gn>Gi-α/2(n), judge (m) as an abnormal value, when Gn>Gn, and G>G1-u/2 (n), judge (!) as an abnormal value, otherwise, judge "no abnormal value", d. Given the case of water removal α*, find the critical value G1-α*/2 (n) corresponding to n, α*/2 in Table A2. 163
GB 4883—85
When Gn>Gn, and G>G1-α*/2(n), α(n) is judged to be highly abnormal; when G≥Gn, and G>G1-α*/2(n), ) is judged to be highly abnormal; otherwise, it is judged to be “no highly abnormal outlier”. 5.2.4 Example of using Grubbs test method to test the compressive strength data of 10 samples of a certain brick delivery batch (arranged from small to large): 4.7, 5.4, 6.0, 6.5, 7.3, 7.7, 8.2, 9.0, 10.1, 14.0 (unit: MPa). Check whether the maximum value is an abnormal value, and take the detection level α=5%. Calculate ±=(4.7+5.4+6.0+6.5+7.3+7.7+8.2+9.0+10.1+14.0)/10=7.89s2 = [(4.7 - 8)2 + (5.4 - 8)2 +(6.0 -8)2+(6.5 - 8)2+(7.3 - 8)2+ (7.7 - 8 )2+ (8.2 - 8 )2+ (9.0 - 8)2 + (10.1 -- 8 )2 + (14.0- 8)2- 10(8 7.89)21/9 = 7.312s = 2.704
(When calculating s, subtract 8 from each observation value to simplify the calculation). G 10 = (&(10)- )/ s = (14 - 7.89) /2. 704 =2.260 For n=10, G0.95(10)=2.176. Since G1o>Go.95(10), x(10)=14 is judged to be an outlier. 5.3 Dixon test method
5.3.1 Test method for one-sided case
a. For the observations arranged by size z(1)<α(2)<α(n, calculate the sample size of the statistic
n3~7
n:8~10
n:11 ~13
n:14-30
Test high-end outliers
r(n)- F(n-1)
X(nyx()
E(mI(n 1)
xm (2)
x(n)- F(n-2)
r(n)- (2)
x(n) r(n-2)
(n)- (3)
b. Determine the detection level α, and find the critical value Di-α(n) corresponding to n and α in Table A3: Test the low-end outlier
DY=rio=
D=r22=
x(2- a)
(n) 1)
r(2)- (1)
r(n-1)- xa)
x(-1)- r()
α(n 2 r(1)
c. When testing the high-end value, when D>D1-a(n), judge α(m) as an abnormal value; when testing the low-end value, when D'>D1-α(n), judge α1) as an abnormal value; otherwise, judge "no abnormal value"; d. Given the rejection level α*, find the critical value D,-α*(n) corresponding to n, α* in Table A3. When testing the high-end value, when DD-α*(n), judge (n) as highly abnormal; when testing the low-end value, when D'>Da*(n), judge 1> is highly abnormal, otherwise, it is judged that "there is no highly abnormal outlier". 5.3.2 Test method for two-sided situation
a. Calculate the values of D and D, where D and D are given by a in 5.3.1; b. Determine the detection level α, and find the critical value D-α(n) corresponding to n, α in Table A3', c. When D>D', D>D,-α(n), it is judged that (m) is an abnormal value, and when D'>D, D'>Dl-α(n), it is judged that α1) is an abnormal value, otherwise, it is judged that "there is no abnormal value". d. When the elimination level α* is given, find the critical value Di-α*(n) corresponding to n, a* in Table A3'. 164
GB 4883—85
1-α*(n), judge α(n) as highly abnormal; when D'>D, D'>t-α*(n), judge α(1) as highly 4D>D, D>D
abnormal, otherwise, judge "no highly abnormal outlier". 5.3.3 Example of using Dixon test method
Fire 16 bullets, the range (arranged from small to large) are 1125, 1248, 1250, 1259, 1273, 1279, 1285, 1285, 1293, 1300, 1305, 1312, 1315, 1324, 1325, 1350 (unit: m). a. Check whether the low-end value is an abnormal value. Specify α=1% for n=16, use
D'=r'22
x(3) - x()
1250-1125
2(14)- 2(1)
1324-1125
Since D0.99(16)=0.595, D>Do.99(16), the minimum value 1125 is judged to be an outlier; b. Two-sided case
For m=16, calculate D'=0.6614 and
D= r22 =
Look up Table A3 and get Do.99(16)=0.627.
(16)- (14) -
1350-1324
(16) x(3)
1350-1250
Because r2>r22, r22>Do.99 (16), so the minimum value 1125 is judged to be an outlier. 6 Rules for judging and handling outliers in the case of unknown standard deviation (II) The upper limit of the number of detected outliers is greater than 1. 26
6.1 This chapter gives the repeated use method of the skewness-kurtosis test method and the Dixon test method. Standard users can choose to implement one of the test methods according to actual requirements (refer to Appendix B). 6.2 Skewness-kurtosis test method
6.2.1 Conditions of use: Examine the sample observations and confirm that their sample body comes from the normal population, and the extreme values should deviate significantly from the sample body.
6.2.2 One-sided case
Skewness test method
a. For the observed value, a2,
S, calculate the skewness statistic
x; +2n(±)3
[ a/-n3 27 3/2
b, determine the detection level aα, and find the critical value bi~α(n) corresponding to n, α in Table A4, c. For the upper case, when bs>b-α(n), judge the maximum value n) as an outlier, otherwise, judge "no outlier". For the lower case, when -bs>b-α(n), judge the minimum value (1) as an outlier; otherwise, judge "no outlier". d. When the elimination level α* is given, the critical value b1-α*(n) corresponding to n and α* is found in Table A4; for the upper case, when bs>bi-a*(n), α(n) is judged to be highly abnormal, and for the lower case, when -bs>b'l-α*(n), (1) is judged to be highly abnormal; otherwise, it is judged that "there is no highly abnormal outlier". 6.2.3 Bilateral case---kurtosis test method
a. For observations t, 2,…. Calculate the kurtosis statistic b
n(zi-)
≥r-4x
GB4883-85
Determine the detection level α, and find the critical value 6-α(n)b corresponding to n and α in Table 5.
When b>bl-α(n), the observation farthest from the mean is judged to be an outlier, and when bk6-α*(n), judge the observation farthest from the mean as highly abnormal, otherwise, judge "no highly abnormal outlier".
6.2.4 Example of repeated use of kurtosis test method. A famous example in early research on outlier problem (1883), the residuals of 15 observations of the vertical radius of Venus: (unit: second).
-1.40.-0.44,-0.30,-0.24,-0.22,-0.13,0.05,0.06,0.10,0.18,0.20,0.39,0.48,0.63,1.01.
To judge whether -1.40 and 1.01 are abnormal.
First, check the conditions of use, use normal probability paper (for usage, see national standard GB4882-85 "Normality Test") 99.99
Normal probability paper
Point the points on the normal probability paper. The main body of the sample is approximately close to a straight line on the graph. After a suitable straight line is drawn, some individual points at one or both ends of the sample deviate obviously outward, so the skewness-kurtosis test method can be used. Calculation shows
to=0.27/15=0.018
-1.417671
GB 488385
5.17024805
bk=15 [5.17024805+4×0.018×1.417671+6×(0.018)2×4.2545-45(0.018)13 / [4.2545-15(0.018)*] 2=79.20879579/18.05944013 = 4.3860. Taking α=5%, the corresponding critical value is 4.13. Because bk=4.3860>4.13, the value of 1.40, which is farthest from the mean value of 0.018, is judged to be an outlier. After removing -1.40, for the remaining 14 values 14
to = 1.67/14 = 0.1193, calculate
-1.417671
5.17024805
+2.744000-3.84160000
1.32864805
bk=141.32864805-4×0.1193×1.326329+6×(0.1193)2×2.2945-3×14(0.1193)\ /[2.2945 -14× (0.1193)2 2
=12.36462926/4.39025216 =2.8164 for α=5%, n=14. The corresponding critical value is about 4.11, and bk<4.11, so no more outliers can be detected. Only -1.40 is detected as an outlier.
6.3 Dickson test
6.3.1 See 5.3 for the rules of Dickson test. 6.3.2 Example of repeated use of Dickson test The data are the same as in 6.2.4. For n=15, calculate
(15) — 2(13)
α(15) - α(3)
(a)- α(1)
x(13)-(1)
1.01-0.48
1.01+0.30
~0.30+1.40
0.48+1.40
Take α=5%. For the two-sided problem, find the critical value Do.95(15)=0.565. Since r22>r22 and r22>Do.95(15), the minimum value -1.40 is judged to be an outlier. Excluding this observed value, the 14 values (n=14) are used, using r22=
a(14)~ α(12)
(14)— α(3 )
(3 ) 2(1 )
(12)- α(1)
1.01-0.48
1.01+0.24
0.24 +0.44
0.48 +0.44
For α=5%, the critical value is Do.95(14)=0.586, so no outliers can be detected anymore, and only -1.40 is detected as an outlier. 167
GB 4883-85
Appendix A
Statistical value table
(reference)
Critical value table of Nayle's test method
GB4883--85
Continued Table AI
Critical value table of Grubbs's test method
GB 4883
Continued Table A2
(n)- r(1)
X(n)- x(nt)
r(n)- r(2)
r 21 =
r(n)-x(n-2)
a(n)-(2)
2(n)- a(n-2)
r 22 =
(n)(3)
Statistic
The larger of the sum
The larger of rn and ri
The larger of r2 and r2
The larger of r2 and r2
GB488385
Critical value table of Dickinson test
(2)- 1)
or r10
-or r=
or r21=
or r22—
a()- (t)
xen)-xa)
αtn-1) (1)
(3)- (1)
r(n-)- r(1)
(3)- r(1)
(n-2)- x(1)
Critical value table of two-sided Dixon test
Statistics
The larger of r22 and r2
GB4883--85
Critical value table of skewness test
5Critical value table of kurtosis test
Tip: This standard content only shows part of the intercepted content of the complete standard. If you need the complete standard, please go to the top to download the complete standard document for free.