Four Criteria for Collecting Data
By Andy Oram, Editor, O’Reilly Media
Everyone collects data nowadays, and the uses to which it can be put are mind-boggling. For instance, by recording the positions of cell phones in a city, researchers have been able to predict traffic jams. Other data of unexpected value is discussed regularly on O’Reilly Media’s Internet of Things website.
But what to collect? Certainly, you need to ask your staff what data they would find useful. But four general criteria can help your programmers and engineers decide what data will help them do their jobs: quality, integrity, granularity, and timeliness.
All manufacturers care about quality, but it is normally difficult to define and measure. One popular movement from the 1980s defined quality as "conformance to specifications," which had drawbacks of its own. But with data, quality should be easy to pin down. Compare the output of sensors or other data collection techniques with a sample set of data collected by hand, which you know to be correct. If the automated data collection falls close enough to the true data for your needs, your data has quality.
All sorts of hurdles get in the way of quality. Thus, if you don't have enough sensors to go around the statistics collected in one room may not reflect what's going on in another room, and the statistics collected near the window may not reflect what's going on in dark corners.
Real life rarely hands you the exact information you want on a silver plate. Instead, you need proxy data, which stands in for the "real" data. As a simple example, the number of defective products returned by customers can stand in as a measure of the number of defective products manufactured. But it's not a perfect measure, because some customers live with their defective products, some might damage them in transit and then return them, etc.
Finally, you have to take into account that sensors routinely fail. If your sensors work only within a narrow temperature range, and you need to determine whether your equipment is getting too hot or too cold, you had better not rely on the sensors--unless you respond to every routine failure as an urgent event.
This has nothing to do with ethical behavior. It is a term used in the data security field to indicate that no one has tampered with the data. You seal a paper letter in an envelope so no one can scribble on it in transit; that's a way of preserving integrity in real life. Encryption generally does the job for digital data.
Furthermore, before sealing a letter you affix a signature, so that a letter claiming to be from Professor Wen Li actually was written by Professor Li. This is an element of integrity called provenance. Some transactions go further and require Professor Li to provide credentials proving her expertise. This is equivalent to a doctor or lawyer hanging her license on the wall of her office.
On the Internet of Things, where a bent wire can lead to wildly misleading data, some ways of assuring integrity contributes to your trust of the data. Often, statistical analytics can flag suspicious input.
This refers to the number of data samples created. When you shop for an Oriental rug, you will prefer one that has 16 knots per square inch to one with only 12. The density leads to richer colors as well as better durability. The same goes for printing: a glossy magazine prints more dots per square inch than a daily newspaper, so both the text and the pictures look crisper.
In a factory setting, collecting data once an hour is probably too coarse a granularity to alert you to oncoming problems. Better to collect once a second, or even more often. The technology should not be hard: after all, a consumer fitness device you may be wearing on your wrist can measure your heart rate and blood pressure through similar fine-grained sensors.
But fine-grained data collection has a drawback: by increasing the amount of data collected, it increases the demands on the devices, depletes the battery, increases your storage needs and bandwidth needs for data transfer, and even makes it harder to deliver information quickly--the theme of our next criterion.
Real-time data is defined as data that is of little or no use unless it comes at the right time. Your engineers must take the time required to analyze data into account to determine whether results will come in time to act on them.
Many data sets are valuable, often in ways that were unanticipated when they were collected. But the four general criteria in this article always need to be considered when you decide what to collect and how to use it.