Precision of commercial soil testing practice for phosphorus fertilizer recommendations in Finland

Implementation of the Agri-Environmental Program in 1995 has emphasized the role of advisory soil testing in phosphorus (P) input planning and markedly expanded the market for commercial soil testing in Finland. A small precision experiment (5 laboratories) and a simulation study on soil sampling were conducted to evaluate the current precision of the soil testing practice for P. The observed values of reproducibility (95% probability) of soil P determination were 42–61% of the mean P concentration for three soils. This approximately corresponds to a maximum error of one P class in a seven-step classification system. Soil texture and organic matter content are used as secondary variables in P fertilization planning. In commercial soil testing these are both determined by finger assessment and the results have significant errors in most laboratories. Erroneous texture determinations are more likely to lead to errors in P fertilizer recommendations than soil P analysis itself. In this study the largest deviation from a correct P fertilization recommendation was +10 kg ha. In soil sampling simulation, stratified random sampling in areas of differing texture gave the most consistent results with geostatistical analysis of the soil test data, as compared with random, systematic, and judgment sampling strategies.


Introduction
The Finnish Agri-Environmental Program was launched in 1995 with the major objective of reducing the diffuse nutrient load (phosphorus (P) and nitrogen (N)) from agriculture to the environment (Maa-ja metsätalousministeriö 1998).
Measures used to achieve this goal include reducing nutrient inputs for crops and, subsequently, reducing the number of fields high in P. Farmers committed to the Agri-Environmental Program are compensated for possible losses in production, and about 90% of the farmers signed on during the first year because of the economic benefits (Grönroos et al. 1997). In the program,

Peltovuori, T. Precision of commercial soil testing in Finland
regular soil testing for every field is required to target P inputs. This has expanded the market for commercial soil testing in Finland and in-creased the number of laboratories offering their services.
Vol. 8 (1999): 299-308. ity classes according to texture and OM content are as large as 30 mg dm -3 at the high end of the scale. Valid soil test data (P, texture, and OM) are required to obtain full advantage from the detailed soil P classification system in P input planning. Two major factors affecting the validity of these data are soil sampling and laboratory analyses of the samples. The variation in the final results as affected by the soil test method, different soil testing laboratories, and soil sampling are discussed in this paper. Assessment of the precision of soil P testing practice is important because the results are used as an administrative control method in the Agri-Environmental Program.

Material and methods
The precision of the soil testing method was investigated in a precision-experiment with three different soils in five laboratories. Soil test P of soil 1 was approximately 30% lower, of soil 2 equal to, and of soil 3 approximately 6 times higher than the average result in Finland in the mid 90s (unpublished data from Soil Analysis Service Ltd., PO Box 500, FIN-50101 Mikkeli). Soils were ground to pass through a 2 mm sieve, homogenized in a plastic container and divided into 20 subsamples. Four subsamples of the three soils labeled as 12 distinct samples were sent to four commercial laboratories, referred to as A, B, C, and D in this study. It was understood that the laboratories used the established method for a soil test described by Vuorinen and Mäkitie (1955). The samples were also analyzed at the University of Helsinki as four replicates. The methods used were the same as in the commercial laboratories, with the exception of OM (analyzed with a Leco ® CHN 900 analyzer, soil OM content assumed to be 1.9 x content of organic carbon) and soil texture (pipette method, Elonen 1971). The laboratory of the University is referred to as laboratory E.
The precision of soil P determination was quantified by intra-laboratory repeatability (r) and inter-laboratory reproducibility (R), both with 95% probability, according to an international standard (Suomen standardisoimisliitto 1988). The values of r and R were calculated for each soil as: where s r is an estimate for pooled standard deviation of replicate P determinations in the five laboratories, and s R is an estimate for the combined standard deviation within and between the laboratories (Suomen standardisoimisliitto 1988). The values of r and R give an estimate of the largest expected difference between two single measurements on the same sample within one laboratory, and the largest expected difference of two measurements on the same sample in any laboratory, respectively. The variability introduced by the heterogeneity among the subsamples was considered to be insignificant. Prior to calculation of r and R, the presence of outliers in the original results, which could suggest gross errors, were determined with a Dixon's test for outliers, and the homogeneity of variances was tested with a Cochran's maximum variance test (Caulcutt and Boddy 1983). The soil P test results were analyzed with a two-way analysis of variance to detect laboratories with biased results. Analysis of variance and the consequent pairwise comparisons with Tukey's test were carried out using log-transformed values, because of unequal variances at the three P-levels of the soils.
The variability of P fertilizer recommendations based on the soil test results of the precision experiment was studied by determining a P fertilizer recommendation for a barley crop with an expected yield of 4 Mg ha -1 . According to the regulations of the Agri-Environmental Program the fertilization rates for soil P classes 1, 2, 3, 4, 5, and 6 or 7 are 43, 33, 28, 18, 13, and 0 kg ha -1 , respectively (Viljavuuspalvelu 1998, Maa-ja metsätalousministeriö 1998).
The variability of soil test results introduced

Peltovuori, T. Precision of commercial soil testing in Finland
by sampling was examined by simulating different sampling strategies on a dataset of 221 soil test results. The dataset was obtained previously using systematic sampling on a 1.44 ha test field that had three distinct areas differing in soil texture. All analyses for the dataset had been carried out in one laboratory. Four different strategies were simulated (Petersen and Calvin 1996): 1) simple random sampling in the whole field, 2) stratified random sampling in the three areas of differing texture, 3) systematic sampling along longitudinal lines across the field, and 4) a single random sample from the three areas representing an example of judgment sampling. Deviation of simulated random sampling (pre-selected sampling units) from true random sampling was compensated by the high original sampling density. In cases 1, 2, and 3, the results of ten original soil samples were chosen according to the respective strategy, and a soil test result was calculated for an imaginary composite sample as an average of the ten soil P results and a mode of the ten OM and texture results obtained. In case 4 the single figures ob-tained were used. The simulations were repeated ten times and the average result and repeatability (95% probability) were reported. Fertilizer recommendations were calculated as above.
The original soil test data obtained by grid sampling were analyzed with GEO-EAS 1.2.1. geostatistical software to illustrate the variability within the experimental field ( Fig. 2) (Englund and Sparks 1991).

Laboratory analyses
The soil P test method was precise when assessed by variability within a laboratory but considerably less precise when the variability between the laboratories was also taken into account ( Table  2). The relative standard deviation of the four replicates was more than 10% in only one case out of fifteen (laboratory D soil 2), and the values of repeatability were less than 20% of the average soil P concentrations for all the three P levels. Only three of the 60 soil P results could be assigned to be outliers among the four replicates (P<0.05) and their deviations from the replicate average were relatively small. According to the results of Cochran's test (P<0.05) no laboratory could be considered less precise than others.
The values of reproducibility were remarkably higher than those of repeatability: 56, 42, and 61% of the mean soil P concentration for the soils 1, 2, and 3, respectively ( Table 2). The poor reproducibility reflects the large differences in the detected P levels between the laboratories. For soil 3, the results from laboratories C and D were as much as 30% lower than the results from the other laboratories. For every soil, the results could be divided into two groups of statistically differing P levels ( Table 2). The grouping was not similar for every soil, but for all of them laboratory D measured low P results and laboratories A and B high results. According to the anal- Vol. 8 (1999): 299-308. ysis of variance of the P results, the interaction between soil and laboratory was highly significant (P<0.01), which implies that possible bias of the laboratories is not systematically structured throughout the P range.
Only laboratory B reported acceptable soil texture results for all samples as compared to the results obtained by a quantitative method (laboratory E) ( Table 3). The other three commercial laboratories, A, C, and D, had gross errors in texture determination, with the exception of soil 3. In most cases the clay soils 1 and 2 containing 48 and 44% clay, respectively, were mistaken for coarser soils. The laboratories were surprisingly consistent in texture determination.
Only laboratory D reported differing textural classes for the four replicates of one soil. In soil OM determination, the laboratories were usually within one category (see Viljavuuspalvelu 1998) of the values determined by laboratory E. Laboratory D reported constantly lower OM contents than laboratory E.
Fertilizer recommendations based on single soil test results had some variability because of the inaccurate results of texture and OM content (Table 4). When the original texture and OM results of the commercial laboratories were replaced by the more accurate values of laboratory E, the fertilizer recommendations obtained were 18 kg ha -1 for soil 1, and 13 kg ha -1 for soil

Peltovuori, T. Precision of commercial soil testing in Finland
2 with one exception: the lowest P result for soil 2 by laboratory D led to a recommendation of 18 kg ha -1 . The fertilizer rate calculated by all individual samples of soil 3 was zero, irrespective of the texture and OM results used. All errors in P recommendations were positive.

Soil sampling simulation
Separate sampling of areas differing in soil texture gave the most accurate soil P test results (Table 5, Fig. 2). Simulated simple random sampling and simulated systematic sampling cover-ing the whole test field gave similar results both in P level and variability. The average soil P test result of an inhomogeneous field does not, however, describe accurately any part of the field. Simulated stratified random sampling and single random sampling describing judgment sampling of the three areas also gave similar results, but the variability of judgment sampling was considerably higher in the west side of the field, with most variability in soil P test values. In the fairly homogenous east side of the field both stratified random and judgment sampling produced a satisfactory result. The results obtained with simulated stratified random sampling agree well with the results of geostatistical analysis of the test field. Fertilizer recommendation for the field, calculated using the results of all simulated simple random and systematic samplings, was 18 kg ha -1 . The P fertilizer recommendation for the clay patch and for the east side of the field was also 18 kg ha -1 , calculated using all single samples of simulated stratified random and judgment samplings. In stratified random sampling, the calculated rate for the fine sand in the west side was 18 kg ha -1 for five samples and 13 kg ha -1 for the other five samples. In simulated judgment sampling, the calculated rate for the same Soil 1 1 loam 3-5.9 (muddy) clay 6-11.9 fine sand 3-5.9 silt <3 sandy clay 4.7 2 loam 3-5.9 (muddy) clay 6-11.9 fine sand 3-5.9 silt 3-5.9 3 loam 3-5.9 (muddy) clay 3-5.9 fine sand 3-5.9 silt <3 4 loam 3-5.9 (muddy) clay 3-5.9 fine sand 3-5.9 silt <3 Soil 2 1 loam 3-5.9 (muddy) clay 6-11.9 fine sand 3-5.9 silt 3-5.9 clay loam 7.4 2 loam 3-5.9 (muddy) clay 3-5.9 fine sand 6-11.9 till <3 3 loam 3-5.9 (muddy) clay 3-5.9 fine sand 6-11.9 silt 3-5.9 4 loam 3-5.9 (muddy) clay 3-5.9 fine sand 3-5.9 till 3-5.9 Soil 3 1 fine sand 6-11.9 very fine sand 6-11.9 very fine sand 6-11.9 fine sand 3-5.9 fine sand 6.8 2 fine sand 3-5.9 very fine sand 6-11.9 very fine sand 3-5.9 very fine sand 3-5.9 3 fine sand 3-5.9 very fine sand 6-11.9 very fine sand 6-11.9 very fine sand 3-5.9 4 fine sand 3-5.9 very fine sand 6-11.9 very fine sand 6-11.9 very fine sand 3-5.9 area was 13 kg ha -1 for three samples, 18 kg ha -1 for five samples and 28 kg ha -1 for two.

Discussion
The calculated values of reproducibility of soil P analysis roughly correspond to the ranges of the soil P classification system at the respective P levels. This indicates that the maximum error in classification based on the test result is not likely to be more than one soil P class. The precision of soil P analysis is at best, as defined by the values of repeatability, of the same magnitude as the fine adjustments made to the soil P classes according to the OM content of soil. It must be stressed, however, that the concepts of repeatability and reproducibility are based on the assumption of normally distributed random error in the respective conditions, and do not entirely exclude larger deviation between replicates due to occasional gross errors in the laboratory. In this study there were three P test results that differed from the other replicates (Table 2). They were included in the calculations because of the small sample size of the experiment, and because Cochran's test failed to indicate an unusually large variance for these replicates. In practice, soil test results are reported as single numbers and no reliable estimate of error can be drawn. Lakanen (1960) has introduced formulas for estimating the average difference of duplicate soil P determinations made in one laboratory. They yield an average difference of 1.0, 1.2, and 4.0 mg dm -3 for soils 1, 2, and 3, respectively, roughly half of the values of repeatability in this study. This suggests variability of the same magnitude. On the other hand, Sippola and Tares (1978) have reported a relative standard deviation of 15.2% for soil P analysis which is considerably higher than the average relative standard deviation of 5.2% in this study. The obvious correlation of repeatability and reproducibility with P concentration suggests that the precision of the method is concentration-dependent, but the nature of this dependency could not be quantified in this study because of the low number of observations. Lower precision at high P concentration is not a serious defect in practice because the ranges of soil P classes also get wider at the high end of the scale.
Laboratory D seems to have a constant negative bias in P results, although conclusive assessment of the accuracy of the laboratories would require a larger study. The Finnish soil testing method is not standardized and part of the differences in P levels detected between laboratories may be due to minor local adjustments

Peltovuori, T. Precision of commercial soil testing in Finland
of the method. However, interpretation of soil test results is meaningful only if the original method described by Vuorinen and Mäkitie (1955) is used, and all laboratories doing commercial soil testing are expected to use exactly the same method. It is comforting to notice that, in spite of the variability, the very high P concentration of soil 3 was detected in all samples, since soils with excessive P pose the highest risk for the environment (Yli-Halla et al. 1995).
A phosphorus test result alone is of little value in P input planning because the fertilizer recommendations are also based on soil texture and OM content. The existing soil P classification system and P fertilizer recommendations are based on long-term field experiments carried out by the Agricultural Research Centre of Finland (Saarela et al. 1995). The texture and OM content of the soils in these experiments have been analyzed thoroughly to define the P requirements of crops on different soils. Commercial soil testing does not reach the same level of accuracy in determining soil type, and the detailed classification system for P input management cannot be utilized in practice. In this study only one of the 48 analyses in the commercial laboratories gave the same texture and OM results as the reference laboratory E. On mineral soils an error in soil texture and OM determination may correspond to an error as large as 15 mg dm -3 in soil P analysis (Table 1).
Calculation of fertilizer recommendations according to the soil test results of the precision experiment leveled off some of the differences between laboratories, but the 10 kg ha -1 difference remaining in the results for soil 1 is unacceptable. The Agri-Environmental Program requires soil testing every seven years at minimum when growing small grain (Maa-ja metsätalousministeriö 1998): a positive error of 10 kg ha -1 in annual P recommendation could lead to a 70 kg ha -1 higher total P input than considered appropriate by the existing recommendations. It has to be acknowledged, however, that the variability of P status within a field, lack of information on this variability, and the fertilization technol-ogy in use would probably prevent the fulfillment of the requirements of the Agri-Environmental Program comprehensively even with accurate soil test data. Special attention should be paid to the fact that practically all variability in the calculated fertilizer recommendations was caused by inaccurate soil texture determination.
The recommended sampling density for soil testing given by Viljavuuspalvelu (1998) is one composite sample per 1-2 ha. Any sample from the test field used in the sampling simulation fulfills this requirement, but no information on the variability of soil P concentration is obtained. Sampling the three areas of differing texture separately by judgment samples revealed the variability of P concentration in the field, and the use of composite samples in simulated stratified random sampling further improved the precision and accuracy of sampling. The precision of this strategy was very high, taking into consideration that the variability of the laboratory analyses is included in the values of repeatability (Table 5). Considerable variation remained in the P results only in the west side of the field where the soil P gradients were steepest. No indication of this variability is observed during the sampling, and it is difficult with reasonable costs to improve sampling beyond stratified random sampling. Jokinen (1983) has suggested 10-16 samples per hectare to be an adequate sampling density for determination of plant nutrients P, K and Mg. She used entire fields (3.6-17.2 ha) for delineation of areas to be sampled and the central limit theorem in order to determine the average value (±10%) of the respective nutrient for the whole area. Soil test results from a single field commonly have a wide range and the average result may represent only a small fraction of the field. In the sampling simulation of this study, delineation of areas within a field according to texture improved the precision and accuracy of sampling. It also reduces the sampling density required for a given accuracy by reducing the variability within the sampled area, but the material of this study is insufficient for any sampling density recommendations. Vol. 8 (1999): 299-308.

Conclusions
Maximum probable error in soil P classification induced by the variability of P analysis is one P class, if no gross errors are present. Errors in soil texture and OM determinations are common and they are likely to lead to errors in soil P classification and in P input planning. In particular the accuracy of texture determination in commercial soil testing requires improvement. If more accurate soil test data does not became available, the highly detailed P classification system should be simplified. Furthermore, the variability in P levels between laboratories should be reduced to ensure consistent results for customers of different laboratories. This is also important for the success of the Agri-Environmental Program.
The farmer can improve the quality of his soil test results by dividing the field to be sampled into sampling strata homogeneous in soil texture and cultivation history. Soil testing in these areas should be done by composite samples and the soil test results of each area should be monitored over a long time-span to detect possible gross errors in the results.
Despite its limited precision, commercial soil P testing serves well the function of detecting the high and very high P-levels in soil. In an environmental sense this can be considered more important than pinpointing fertilizer levels used.