Garbage in, garbage out
Between school and university I had a gap year (nine months because of the need to do entrance examinations for university in November). I spent most of that time in a low-temperature physics research laboratory, where I had the job of writing computer programs to help the research team.
The laboratory was served by a computer centre, and the only language available was Algol 60. (Hands up if you have ever heard of Algol! And a gold star if you know the differences between Algol 60, Algol 68 and Algol R!) I still think about programming in the structured way that I learnt in those very distant days. One of the questionable joys of Algol 60 was that the language specification did not describe input and output routines. So these were dependent on the computer centre or the computer manufacturer. The staff at our computer centre had written their own input/output routines. Because there were numerical analysts in the team, they had implemented input in a way which honoured their calling and expertise. Inputting real numbers could be done either by the command "READ" or the command "EXACTREAD". The latter corresponds to every input routine that I have used since then. It treated the input number as being exact, so the number 10. meant 10.000000000000 and not somewhere between 9.5 and 10.5. Input 10.0 and it meant the same. But if you used the command "READ" the internal representation of 10. was less precise than that for 10.0, and even less precise than for 10.000. I guess it was stored in a binary variable with a different number of bits. (The implementation of this is left as an exercise for the reader.)
The reason for this was to help the scientists doing their physics experiments design ways of analysing their data that were robust as far as the accuracy of calculation was concerned. If you made measurements to three significant figures on the lab machines, then you could not expect more than three significant figures in the output from your programs. And if you were careless and started to do arithmetic which lost some of those significant figures, then the compiler would "Punish" you, and you would see that you had lost accuracy.
I learnt this the hard way, because I constructed some theoretical models to support the physics team. Why, I wondered, did my model give the same results when one key parameter was 6. as when it was 7.? Answer - I had input that parameter with "READ". In the process of the calculations of my theoretical model, I had lost accuracy and the result was garbage.
It was an important lesson for me. Ever since then, I have been conscious of the dangers of squeezing too much accuracy out of an imprecise piece of data, and of the need to design computer programs which do not waste significant digits. Hopefully most O.R. people are aware of the potential for errors of poor computer design. In Excel2003 (which I have on my computer) standard deviations are found the "lazy way". Try it. Find the SD of 1, 0, -1 (you get the answer 1). Now try finding the SD of 10^15+1, 10^15, 10^15-1 (and I get 16777216)
When does this matter? In my postgraduate LP course we were warned about the accuracy of inputs to LP models; it may be apocryphal, but the story goes that the oil company O.R. team wanted to know how accurate the measurement of viscosity was. Who entered it into the data? A man who dipped his thumb and forefinger into the latest batch, rubbed them together, and pronounced a number for the viscosity. It meant that efforts to make the LP model more accurate were limited by the accuracy of this input measurement.
If one of your research papers has crossed my desk as a referee, I may have included comments about the spurious accuracy of your tables of results. Even the highest ranked journals allow papers through where there are tables with six or more significant figures, based on parameters given to two significant figures. Even the best computer models cannot generate more accuracy than is present in the data. And anyway, why should anyone be really interested to know that your model took 131.4159 seconds to converge rather than 130 seconds?
Elsewhere I have commented on the habits of the publishers of food recipes to give the number of calories per serving to three significant figures when the recipes include such variables as "one onion", "about 50 grammes of hard cheese" and so on. One of my colleagues used to give lectures about this accuracy (or lack of it) under the title "How much does a kilogramme of bananas weigh?"
One can address the same criticism to the statisticians who month by month trot out that the average price of houses sold in the UK in month X is £Y - and there are six significant figures in Y.
All of this came to mind when I was cornered and asked to complete a survey on tourism. Having once written a paper when I took a sideswipe at a university report on tourism, based on a self-selecting group of respondents, who were about 2% of the number attending the event, I was curious to know what would be asked. My sideswipe in that case was addressed at the amazing precision that the university team produced for the revenue generated for the tourist economy by the event; it was slightly less than £1million, stated to the nearest pound. (Six significant figures, on that sort of sample!) The survey that I was taking part in was done using a tablet computer, which meant that inappropriate questions could be left out. When it came to finance - how much have you spent on food, transport, accommodation, outings - all the entries were in broad intervals. Perhaps some clever statistician knows that when respondents say that they spent between £101 and £250 on food, then it means that the average for such respondents is exactly £132. Perhaps not. And the interviewer knew the categories, and guided me towards what she considered the likely answers. Anyway, it is difficult to say exactly how much you have spent on food in one day on holiday, let alone recalling the previous few days and estimating what you will spend for the rest of the holiday. So, once again, I will view the output as suspicious.
The sad thing is that numbers from all these suspicious sources may be used in somebody's model; I just hope that the modeller knows what he or she is doing when they are included.
Otherwise, as the title says, Garbage In, Garbage Out!
The laboratory was served by a computer centre, and the only language available was Algol 60. (Hands up if you have ever heard of Algol! And a gold star if you know the differences between Algol 60, Algol 68 and Algol R!) I still think about programming in the structured way that I learnt in those very distant days. One of the questionable joys of Algol 60 was that the language specification did not describe input and output routines. So these were dependent on the computer centre or the computer manufacturer. The staff at our computer centre had written their own input/output routines. Because there were numerical analysts in the team, they had implemented input in a way which honoured their calling and expertise. Inputting real numbers could be done either by the command "READ" or the command "EXACTREAD". The latter corresponds to every input routine that I have used since then. It treated the input number as being exact, so the number 10. meant 10.000000000000 and not somewhere between 9.5 and 10.5. Input 10.0 and it meant the same. But if you used the command "READ" the internal representation of 10. was less precise than that for 10.0, and even less precise than for 10.000. I guess it was stored in a binary variable with a different number of bits. (The implementation of this is left as an exercise for the reader.)
The reason for this was to help the scientists doing their physics experiments design ways of analysing their data that were robust as far as the accuracy of calculation was concerned. If you made measurements to three significant figures on the lab machines, then you could not expect more than three significant figures in the output from your programs. And if you were careless and started to do arithmetic which lost some of those significant figures, then the compiler would "Punish" you, and you would see that you had lost accuracy.
I learnt this the hard way, because I constructed some theoretical models to support the physics team. Why, I wondered, did my model give the same results when one key parameter was 6. as when it was 7.? Answer - I had input that parameter with "READ". In the process of the calculations of my theoretical model, I had lost accuracy and the result was garbage.
It was an important lesson for me. Ever since then, I have been conscious of the dangers of squeezing too much accuracy out of an imprecise piece of data, and of the need to design computer programs which do not waste significant digits. Hopefully most O.R. people are aware of the potential for errors of poor computer design. In Excel2003 (which I have on my computer) standard deviations are found the "lazy way". Try it. Find the SD of 1, 0, -1 (you get the answer 1). Now try finding the SD of 10^15+1, 10^15, 10^15-1 (and I get 16777216)
When does this matter? In my postgraduate LP course we were warned about the accuracy of inputs to LP models; it may be apocryphal, but the story goes that the oil company O.R. team wanted to know how accurate the measurement of viscosity was. Who entered it into the data? A man who dipped his thumb and forefinger into the latest batch, rubbed them together, and pronounced a number for the viscosity. It meant that efforts to make the LP model more accurate were limited by the accuracy of this input measurement.
If one of your research papers has crossed my desk as a referee, I may have included comments about the spurious accuracy of your tables of results. Even the highest ranked journals allow papers through where there are tables with six or more significant figures, based on parameters given to two significant figures. Even the best computer models cannot generate more accuracy than is present in the data. And anyway, why should anyone be really interested to know that your model took 131.4159 seconds to converge rather than 130 seconds?
Elsewhere I have commented on the habits of the publishers of food recipes to give the number of calories per serving to three significant figures when the recipes include such variables as "one onion", "about 50 grammes of hard cheese" and so on. One of my colleagues used to give lectures about this accuracy (or lack of it) under the title "How much does a kilogramme of bananas weigh?"
One can address the same criticism to the statisticians who month by month trot out that the average price of houses sold in the UK in month X is £Y - and there are six significant figures in Y.
All of this came to mind when I was cornered and asked to complete a survey on tourism. Having once written a paper when I took a sideswipe at a university report on tourism, based on a self-selecting group of respondents, who were about 2% of the number attending the event, I was curious to know what would be asked. My sideswipe in that case was addressed at the amazing precision that the university team produced for the revenue generated for the tourist economy by the event; it was slightly less than £1million, stated to the nearest pound. (Six significant figures, on that sort of sample!) The survey that I was taking part in was done using a tablet computer, which meant that inappropriate questions could be left out. When it came to finance - how much have you spent on food, transport, accommodation, outings - all the entries were in broad intervals. Perhaps some clever statistician knows that when respondents say that they spent between £101 and £250 on food, then it means that the average for such respondents is exactly £132. Perhaps not. And the interviewer knew the categories, and guided me towards what she considered the likely answers. Anyway, it is difficult to say exactly how much you have spent on food in one day on holiday, let alone recalling the previous few days and estimating what you will spend for the rest of the holiday. So, once again, I will view the output as suspicious.
The sad thing is that numbers from all these suspicious sources may be used in somebody's model; I just hope that the modeller knows what he or she is doing when they are included.
Otherwise, as the title says, Garbage In, Garbage Out!
When I taught statistics, I began one lecture in each course with a quote from one of your countryment, "Stamp's Law" (http://en.m.wikipedia.org/wiki/Josiah_Stamp,_1st_Baron_Stamp, first quote).
ReplyDelete