An overview of data warehousing

In the good old days of “data processing”, database experts and programmers focused their attention on storing and retrieving data in the best way to do real-time transactions. Being able to store and retrieve data to record a sale, to record a bank deposit and so on. Relational databases became the best thing since sliced-bread. These are built through a complex process of entity relationship modelling and are unintelligible to all but the technically inspired.

As relational databases became established and systems became more stable and reliable, management of organisations running these huge and complex systems began to realise that the databases contained more than data – they contained information. “Data Processing” became “Information Technology”. The demand for information grew. IT departments became inundated with requests for management information reports of various types. They couldn’t cope. Delivery of these reports was always a week too late. The extraction process was typically very complex and slow requiring data from a wide variety of tables. To make matters worse, the extraction of information for urgent reports often impacted negatively on the performance of the entire system.

A few bright sparks in the industry began to realise that accessing information for management reports required something other than a relational database. Something much simpler and more accessible was needed – something that is able to reflect the business rather than the technology. The operational database was not the ideal place from which to compile executive reports.

The data warehouse was born. It would contain the kind of information sought by management, organised in a way that was simple to understand and to easy to access.

One of the simplest and most amenable data warehouse model is the “Star Schema”. “Facts” are kept in a Fact Table and “Dimensions” – the characteristics by which the facts are accessed – in a Dimension Table. For instance, Customer may be a fact file, while region, age, gender would be dimensions. The popular multi-dimensional cube works in a similar way.

Data for the warehouse is extracted from the operational database and undergoes a number of processes – extraction, cleaning, aggregating and loading. The extraction takes place at specific times – daily, weekly or monthly – and provides a “snapshot” of the organisation at a particular time and date.

Data for each Fact Table may be aggregated (or summarised)at a number of different levels. For example, the warehouse may store total sales, sales per region, sales by product, and sales by outlet. “Front-End” packages allow the user to “cut and slice” through the information, to “drill-down” where necessary to more detailed information that makes up the summary. Drilling-down is sometimes known as data-mining.

Something that relational databases handle with difficulty is historical data. The warehouse manages this much more effectively. Time is another dimension thus allowing for trend analysis or time-based change.

The design of the databases used for warehousing is typically one that allows for the extraction and summation of huge amounts of data at an incredible speed. Even when resource demands are high, the operational database is unaffected. A special language called SQL – Structured Query Language – is the preferred medium for extracting information. Most warehouse technology comes with a front-end that allows for information to be extracted without detailed knowledge of SQL or of the source relational database.

Business Intelligence (BI) is somewhat related to data warehousing. Data warehousing is generally designed to manage more routine management reporting, BI is more dynamic and complex – but often built on the same foundation.

In summary, data warehousing provides access to management and other decision makers in a fast and simple way. The information is a snapshot of the organisation’s data at a particular time and date. Information is readily available in summary form – known as aggregations – allowing for drilling-down to more detailed information. The data warehouse can provide historical data more easily that the operational system and may be used as the basis for a BI system.