Messy problems and messy data
There are times when those of us involved in operational research give the impression that the problems that we deal with are neat and tidy. This problem is a queueing problem, that one needs an integer programming model, the third one requires an off-the-shelf forecasting model. Or we can build a simulation model, or arrange a workshop using problem-structuring techniques. Life is not that simple. The better text-books used by students admit that their examples and exercises are simplifications of the real world, and some go on to explain the messiness of practical operational research. I have even seen a few illustrations which resemble dust storms to drive the lesson home. Yes, real world problems are generally messy.
The latest issue of ORMS Today, the journal of INFORMS (US OR Society) [cover date June 2013], has Anne Robinson, INFORMS president, writing about one source of messiness in our work - the data we use. She writes, "Data is ugly ... really, really ugly". (I will forgive her using data as a singular noun; pedants have lost that particular battle.) And she continues to set out five ways to measure the quality of data, using five "C"s. They are: Completeness, Correctness, Consistency, Currency, Collaborativeness.
(I have used nouns, where Anne used a mix of nouns and adjectives.) So she suggests that we spend time before model-building to run through some checks on the data to see of the numbers and descriptors meet these standards. Sometimes such checks lead to the need to edit the data, carefully, and making assumptions that can be justified afterwards.
Educators in OR, and analytics and statistics, need to do as those better text-books do. Describe methods for analysing high quality data and then lead into what happens when the data fail in one respect or more. A few years ago, I was working with a student who had successfully completed a postgraduate module in multivariate statistical analysis. The project she had involved 800,000 records, each of which had a dozen items of data. I asked her about the largest data-set she had met in that module; it comprised 200 records. The first stage of her project was given over to exploring the transition from that number of records to one 4000 times larger, learning how to check the data for Completeness, Correctness and Consistency. Currency and Collaboration were - for the most part - less significant. I hope that the lessons learned helped her when she joined a large international OR group, where she would be working with millions of records.
And I believe that one of the joys of operational research is the challenge of messy problems and messy data.
The latest issue of ORMS Today, the journal of INFORMS (US OR Society) [cover date June 2013], has Anne Robinson, INFORMS president, writing about one source of messiness in our work - the data we use. She writes, "Data is ugly ... really, really ugly". (I will forgive her using data as a singular noun; pedants have lost that particular battle.) And she continues to set out five ways to measure the quality of data, using five "C"s. They are: Completeness, Correctness, Consistency, Currency, Collaborativeness.
(I have used nouns, where Anne used a mix of nouns and adjectives.) So she suggests that we spend time before model-building to run through some checks on the data to see of the numbers and descriptors meet these standards. Sometimes such checks lead to the need to edit the data, carefully, and making assumptions that can be justified afterwards.
Educators in OR, and analytics and statistics, need to do as those better text-books do. Describe methods for analysing high quality data and then lead into what happens when the data fail in one respect or more. A few years ago, I was working with a student who had successfully completed a postgraduate module in multivariate statistical analysis. The project she had involved 800,000 records, each of which had a dozen items of data. I asked her about the largest data-set she had met in that module; it comprised 200 records. The first stage of her project was given over to exploring the transition from that number of records to one 4000 times larger, learning how to check the data for Completeness, Correctness and Consistency. Currency and Collaboration were - for the most part - less significant. I hope that the lessons learned helped her when she joined a large international OR group, where she would be working with millions of records.
And I believe that one of the joys of operational research is the challenge of messy problems and messy data.
I've read various estimates putting data cleaning (or data acquisition, conversion and cleaning) at 60-70% of a "typical" analytics project. A fair bit of that can be attributed to Stamp's Law (http://en.wikipedia.org/wiki/Josiah_Stamp,_1st_Baron_Stamp#Quotes), something I liked to quote to my students.
ReplyDelete