The Difference Between a Data Warehouse and a Database
Posted On August 5, 2019
It is time to talk about the difference between a data warehouse and a database. Semantics aside, a database it typically referred as a collection of tables as is a data warehouse, but that is where the similarities should end. Most data warehouse engineers struggle to see the differences between the two leading to a less than optimal performant data warehouse.
What is a Data Warehouse?
A data warehouse is a system that pulls together data from many different sources within an organization for reporting and analysis. The reports created from complex queries within a data warehouse are used to make business decisions.
Think of a data warehouse as a consolidated view of data collected from various systems. The primary focus of a data warehouse is to provide a correlation between data in multiple systems with the focus of this collected data being report ready and performant. Find out more about our data warehouse consulting services.
What is a Database?
A database contains information organized in columns, rows, and tables that is typically indexed to make accessing relevant information more accessible. A database management system (DBMS) is a tool in which teams can create and manage databases. Organizations most often will use databases for online transaction processing (OLTP). In other words, databases need to provide easy access to information through fast querying and quick transactions.
Both data warehouses and databases tend to be relational data systems where tables relate to other using keys that relate one set of data to another. As an example think about a table that stores orders placed at an online store and another that stores the products in that store. There is a reference to the product in the orders table, but the record of truth for that product is stored in the products table.
Both data warehouses and databases are typically relational, but they are built to serve different purposes. A data warehouse is built to store large quantities of historical data with the intent to be retrieved for online analytical processing (OLAP). But, databases tend to store current transactions for ongoing business processes and are retrieved for the purpose of online transactional processes (OLTP).
Most databases are normalized data structures meaning that data is organized so that no redundant data exists and all related data is stored together. In the example above, we talk about all of the products being together in the same table as opposed to all of the product information being stored in a table meant to store orders. Normalizing data ensures that data takes up minimal disk space and response times to retrieve that data are maximized.
A data warehouse, while using relational tables, can be built for speed and may have redundant data in the same table because the report that is using the data will use the same data often. This means that a database may have to use more complex queries to get the same data since it has to join against many tables to get all of the needed data. But, a data warehouse could have one aggregated table having all of the report data in one place when it makes more sense.
Difference Between a Data Warehouse and a Database – Data Analysis
One of the biggest difference between a data warehouse and a database comes down to how data is analyzed. Databases are typically used for transactional purposes. But, that does not mean that the data is not able to be queried for analysis. Doing analysis on a database still can be done, but users must recognize that databases are typically not as performant as data warehouses.
Data warehouses are made to be performant for data analysis. For the most part, the breadth of all of the historical data is in the data warehouse. This means that there is typically more data ready for analysis at the warehouse level. The structure of a data warehouse lends itself to be able to be queried easier. Not as much knowledge is needed in order to extract data from a warehouse for the most part. This is because there are typically less relationships to manage when retrieving queries.
An OLTP database needs to support many more concurrent connections as a transactional data store. And, an OLAP system can typically support less concurrent connections. As a result, there are less simultaneous users retrieving data from a data warehouse in most use cases.
Database Use Cases
Speaking of use cases, let’s take a look at some typical database use cases. Database use cases are typically highly transactional systems being written to often. The content is typical of day-to-day activities.
- Online booking systems
- Bank account transactions
- Online ordering systems
Data Warehouse Use Cases
The focus of data warehouses are on providing high-level reporting and analysis capabilities for an organization.
- Data modeling
- Data mining
- Reporting tool to research business decisions
Data Warehouse Elements
There are some elements that are typical of data warehouses.
Also, in a fact and dimensional model, there are dimension tables. These tables contain categorizations of data that support the fact tables. An example would be a products table.
A fact table is a table consisting of metrics (facts) and their associated dimensions. Fact tables tend to be denormalized and could contain redundancies. Fact tables contain foreign keys to their dimensional tables. An example would be a reporting table that lists sales by product linking back to the product dimension table.
Difference Between a Data Warehouse and a Database – Columnar Storage
One last topic to discuss when diving into a data warehouse is the storage type being used. Databases are typically relational schemas stored in a row-based format. Data warehouses have an option to use columnar storage. Instead of storing records in rows, the system stores records in columns. Records can span across multiple rows in this format. The benefits to columnar storage is retrieving data can be made faster. The downside is that writing records in a little slower. But, a data warehouse is meant to be write once read many times making columnar storage more appealing.