Not so long ago, Gartner defined the Logical Data Warehouse (of which Mark Beyer claimed the paternity, but I won’t try to rebuild the Data Warehouse family tree, too many fathers and version numbers out there…).
To put it simply, the premise of the Logical Data Warehouse is that not all data needs to be physically moved over to the Data Warehouse, but instead, when appropriate, it can stay in place in its “owner” application/database, as long as a logical layer exists that enables transparency in the access to this data (through a Data Services approach, or virtual data federation/EII like our friends at Composite Cisco do). I know I am oversimplifying here, and I am probably in trouble with Mark – but if you were looking for some real expert advice you wouldn’t be reading my ramblings but his research anyway!
The same rule of not moving everything for the sake of moving everything applies to your big data projects. Too many organizations looking for ways to break down data silos bring all the data together in one central place and, sure enough, Hadoop is an excellent storage resource for large amounts of data. Hadoop distro vendors will absolutely love it when your Hadoop cluster grows and they can sell you more maintenance and support (and for the record, Teradata and Oracle are the same, just with a price tag 100 times higher and fully proprietary stuff).
You need to think “data distribution” beyond Hadoop. It’s not always necessary to duplicate and replicate everything. Some data is already readily available in the enterprise data warehouse, with fast, random access through highly optimized schemas and indexing. You may need to bring in a subset of this data when needed to perform lookups or joins, but it does not need to reside permanently in Hadoop. Some other data sets might be better off just residing where they are produced.
As with every data management project, proper architecture design is essential. And so is having a proper integration infrastructure. In the Logical Data Warehouse model, seamless access to distributed data is key. The same is true for big data. Only with this infrastructure will you be able to build this Logical Big Data Warehouse.
Thank you, TiA