100 Days of Data Engineering Day 10A warehousing
Chapter 1: Introduction to Data Warehousing
Dimensional Modeling Primer**
- Definition: A database modeling technique that prioritizes data analysis performance and ease of use. It structures data around “fact” tables and “dimension” tables.
- Fact Tables: Contain the primary numerical measurements or metrics of a business process (e.g., sales revenue, units sold, inventory levels).
- Dimension Tables: Provide descriptive context to the facts in the fact table. Examples include date/time, product details, customer information, and geographic locations.
- Key Benefits:
- Fast query performance: Optimized for aggregating and filtering data, enabling quicker analysis.
- Intuitive for business users: The star schema design aligns with how business users think about their data.
- Adaptable to changing requirements: Adding new dimensions or changing hierarchies is relatively straightforward.
Different Information Worlds
- Operational Systems: Designed to support day-to-day transactions and business operations (e.g., order processing, CRM systems, manufacturing software). They are optimized for fast updates and inserts of individual records.
- Data Warehouse: Designed for historical analysis, trend identification, and strategic decision-making. Data warehouses prioritize efficient retrieval of large quantities of data.
Goals of a Data Warehouse
- Centralized Data: Provides a single, trusted source of information for all departments within an organization, eliminating inconsistencies.
- Informed Decisions: Empowers business users to analyze historical data, identify trends, and make data-driven decisions that improve overall performance.
- Improved Consistency: Enforces data standardization, ensuring that reports and analyses are consistent across the enterprise, regardless of the specific tool used.
- Adapting to Change: Dimensional modeling, with its emphasis on flexible structures, can more easily accommodate new business requirements or changing data sources.
The Publishing Metaphor
- Source Systems: The raw, unedited information sources, similar to the manuscripts and notes generated by authors.
- Data Staging: The process of cleansing, transforming, and conforming data, ensuring it’s ready for ‘publication’ in the warehouse. Think of it as the editing and proofreading stages.
- Data Presentation: The final, structured format of the data within the warehouse. This is the published material, available in a variety of forms for diverse audiences.
- Data Access Tools: The mechanisms users employ to consume the information. These could be reports, dashboards, visualization tools, or direct query interfaces, much like libraries and bookstores provide access to the finished publications.
4 Components of a Data Warehouse
- Operational Source Systems: The original data sources, such as transactional databases, ERP systems, log files, or external data feeds.
- Data Staging Area: A temporary storage area. Here, data is:
- Cleansed: Removing errors and inconsistencies.
- Transformed: Restructured to align with the warehouse’s data model.
- Loaded: Prepared for transfer into the main data warehouse.
- Data Presentation Area: The core of the warehouse. Data is organized according to a dimensional model, with fact and dimension tables.
- Data Access Tools: Provide user-friendly ways to query, analyze, and visualize the data in the warehouse. This includes reporting tools, dashboards, OLAP (Online Analytical Processing) tools, and ad-hoc query interfaces.
Additional Considerations
- Metadata: Critical for understanding and managing the warehouse. Metadata describes the structure, meaning, lineage, and quality of the data.
- Security: Data warehouses often house sensitive or confidential business information. Strong access controls, encryption, and auditing are necessary.
- Scalability: Warehouses should be designed to accommodate future growth in data volume, number of users, and the complexity of analytical needs.
Dimensional Modeling Vocabulary
- Fact Table (Expanded)
- Holds the numerical measures (facts) like sales figures, inventory counts, or website clicks.
- Typically contains foreign keys that link to corresponding rows in dimension tables.
- Fact tables are often additive- numeric values can be summed or averaged across dimensions.
- Dimension Table (Expanded)
- Contains descriptive attributes that add context (who, what, where, when, etc.) for the facts.
- Includes a primary key that uniquely identifies each dimension record.
Bringing Together Facts and Dimensions
- Star Schema: The classic dimensional modeling design. A central fact table is surrounded by dimension tables, forming a star-like shape.
- Snowflake Schema: A variation where dimensions are further normalized. Some hierarchies are broken into separate, more detailed tables.
Dimensional Modeling Myths and Common Pitfalls to Avoid
- Myth: Data warehouses are just for reporting. Modern warehouses support advanced analytics, dashboards, and even self-service data exploration.
- Pitfall: Trying to model everything upfront. Start with a core set of business questions and build out iteratively.
- Pitfall: Overly complex models. Balance usability with detail.