Settling the Settled Data Issue
Bill Inmon concludes a recent article Settled Data is Best for the Data Warehouse with this statement:
Mr. Inmon suggests that pulling data into a data warehouse in real time leads to messy complications that would be avoided if we wait 24 hours to a month or more to transform each transaction. For some strategic applications this might be reasonable, but for more and more applications, the data is less valuable as time goes by.
Mr. Inmon's example is an incorrect phone bill, where a call to India was identified by the customer weeks after the bill was issued. By waiting a month or more, these issues are supposed to settle. Even if this was a legitimate approach to handling data volatility, how would you decide when data was settled? Perhaps the customer had already paid the bill, then found the error. Perhaps the customer didn't find the error for three months. If the correction came after the settlement period, wouldn't we still have a mess?
Suppose we can wait for transaction information to settle for analyzingng long-term payment patterns. It would be reckless to use the same settlement rules to manage call volumes. To have variable settlement times for different applications of the same call transaction adds another level of complexity.
We need to stand back and assess the data volatility in the context of our business requirements. If our business needs require near real-time analysis, then we must either accept added volatility or added complexity. If our business needs require an absolutely static view of the data, we need to accept that these are historic snapshots and may contain errors that have since been corrected elsewhere. In either case, the business requirements must come first.
The most accepted school of thought should be, that when transactions occur, that those transactions should not be rushed to the data warehouse but should be allowed to "settle" in the operational environment.I think this conclusion says more about Bill Inmon's view of a data warehouse than it provides practical advice. Data volatility is an issue that needs to be tackled head on and not ignored until the data is settled.
Mr. Inmon suggests that pulling data into a data warehouse in real time leads to messy complications that would be avoided if we wait 24 hours to a month or more to transform each transaction. For some strategic applications this might be reasonable, but for more and more applications, the data is less valuable as time goes by.
Mr. Inmon's example is an incorrect phone bill, where a call to India was identified by the customer weeks after the bill was issued. By waiting a month or more, these issues are supposed to settle. Even if this was a legitimate approach to handling data volatility, how would you decide when data was settled? Perhaps the customer had already paid the bill, then found the error. Perhaps the customer didn't find the error for three months. If the correction came after the settlement period, wouldn't we still have a mess?
Suppose we can wait for transaction information to settle for analyzingng long-term payment patterns. It would be reckless to use the same settlement rules to manage call volumes. To have variable settlement times for different applications of the same call transaction adds another level of complexity.
We need to stand back and assess the data volatility in the context of our business requirements. If our business needs require near real-time analysis, then we must either accept added volatility or added complexity. If our business needs require an absolutely static view of the data, we need to accept that these are historic snapshots and may contain errors that have since been corrected elsewhere. In either case, the business requirements must come first.

<< Home