Precision Agriculture is a data-based discipline; data that is collected to measure, describe, quantify, understand, or analyze agrosystems. A wide variety of measurement systems have been developed to measure agronomic parameters of interest, from plant vegetation status to crop yield, including weed detection and soil physico-chemical parameters. These increasingly sophisticated systems make it possible to acquire information at increasingly fine resolutions on production systems. Data resolution is often put forward as a criterion of quality or performance in the services offered, but what does working with ever higher resolution data imply? This post is quite short but keeps a number of ideas in mind.
Autocorrelation / Cross-correlation
Precision Agriculture data have correlation structures in space and/or time. This is a fairly intuitive phenomenon. For spatial correlation, for example, two soil samples collected very close together in a plot are more likely to share common characteristics than two soil samples each at one end of a plot. For temporal correlation, for example, in the same way, two vegetation images acquired by UAVs will be more similar the closer they are taken to each other. Correlation can also be spatio-temporal when the data are structured both in space and time. Some variables are also correlated with each other (this is called cross-correlation). Why do these forms of autocorrelation raise questions?
- The fact that the data are correlated violates the assumption of independence of observations in some statistical tests. For example, when trying to assess the state of a linear relationship between two agronomic variables, one must assume that the samples are independent (which is no longer the case when correlation structures are present). Without taking this phenomenon into account, the relationship between these variables will tend to be overestimated. And this risk is extremely important when working with very high resolution data, and especially when working with many variables at the same time (multi-factorial or multi-dimensional).
- When you are trying to set up a prediction model of an agronomic variable of interest, you often build a model on a training dataset, and validate it on a validation daset, which is supposed to be totally independent of the training dataset to be sure that the model is able to make predictions on a new data set (it would be useless to learn and validate a model on the same data set…). In the vast majority of cases, one acquires a large dataset, and separates this dataset into a training dataset and a validation dataset (more or less randomly). In the presence of autocorrelation/cross-correlation in the data, the selection of the training and validation sets must be much more thoughtful at the risk of overestimating the predictive capacity of the model (if your validation data are correlated with your training data, you are more likely to find good results for your model, which is a bit biased…). This can be for example the case for yield prediction, when the training and validation sets contain data collected under fairly close climatic conditions, or in plots that are very close in space.
Highly resolute data in Precision Agriculture is noisy, it is this “salt and pepper” effect or lack of color continuity (when you color the data) that you may observe in your data sets. This noise may be due to the natural variability of the plants, the accuracy of the sensor, or the acquisition conditions. Why does this noise matter?
- Noise can interfere with the reading of a map in the sense that it can be difficult to identify major spatial and/or temporal trends in the data. To facilitate map reading and/or to propose operational variable-rate maps, some people try to delimit homogeneous zones of the variable of interest in the plots. This operation is not obvious, firstly because the definition of a zone is not clear, and secondly because the presence of noise complicates the methods to be implemented. Others will seek to degrade the initial information on interpolation grids of varying sizes (whatever the interpolation method – averaging the data in a grid remains a form of interpolation). The question then arises as to the size of the grid meshes: how to degrade the resolution without losing too much information and remain consistent with an operational application in the field?
- In addition to disturbing the reading of a map, noise can affect the correlations between agronomic variables. Comparing data collected precisely at specific locations with high-resolution, noisy data at the same locations can be dangerous. Sometimes it is better to look for correlations at coarser scales (e.g. area scale rather than vine plant scale) to keep important trends in mind.
Noise and autocorrelation
Agronomic models, whether complex or not, (yield prediction model, water stress model, plant development model, etc.) are often built from data collected under experimental or laboratory conditions and acquired very cautiously. These data are generally few in number: let’s be clear, by saying “few”, I am not implying that these data are not sufficient to build a model, I am simply saying that they are few in number compared to what can be acquired under operational conditions in the field. Acquiring very high-resolution data (noisy, auto-correlated, etc.) to refine these models raises questions because models have often not been developed in this way. Can correlations established with very precise data also be considered with noisy and uncertain data? How will autocorrelation of data impact model relationships? All these questions are not obvious and deserve attention.
Reliabily and data quality
To talk about reliability and data quality may seem surprising, as it seems that these characteristics are accepted. In other words, we are not necessarily going to question a data or a result when we receive it. However, as one develops more complex models, and works with ever higher spatial and temporal resolutions, these issues of reliability and data quality become paramount! How can we imagine agronomic applications at the centimetre or metre scale if the data are noisy? Is the sensor used sensitive to the variations I am trying to detect? What compromise can be found between the quality of the data used and a working scale that remains operational?
A few words of conclusion
One of the main risks in working with all kinds of data, and high-resolution data, is to have the impression that we are going to be able to understand everything about the agrosystem under study. Care must be taken to account for the characteristics of the data collected to establish or not correlations, not to confuse cause and correlation, not to look for correlations between anything and everything, to be aware of data and/or non-measured factors, and above all to reconsider agronomy and expertise in the reflections! Data and numerical tools must be a service to agronomy and expertise, they must not and above all cannot replace it.
To conclude, I will insist on the importance of considering energy and environmental aspects in the development and use of digital solutions. Since digital technology as a whole is responsible for 10% of world electricity consumption, or about 4% of global greenhouse gas emissions (with extremely rapid growth, estimated at 5 to 10% per year), digital tools must be taken into account soberly and responsibly. These energy and environmental aspects must be at the heart of any digital project in agriculture. Do I really need all this data? What spatial/temporal resolution of data do I really need? Why acquire such high spatial resolution information if it is going to be degraded so much afterwards? How can I simply measure my parameter of interest? Are my data acquired during a machine run that I must necessarily perform or do I need additional runs?