Data Blueprint — Data Layers
In a modern data architecture, data is managed within data layers. These layers function like sequential stages of a production line, where raw data undergoes transformation into semi-finished data products and, ultimately, refined into consumer-facing data products. The data layer design offers advantages by establishing a systematic methodology for the logical organization of data. A clear definition of a data product’s purpose can be better achieved by understanding the data layer(s) within which it is classified, thereby aiding in the determination of its eventual interactions with other data products.
The standard data layers are listed in the following sections, each characterized by specific objectives and constraints. These objectives and constraints are articulated through facets such as data organization, data history, data format, data loading patterns, data transformations, and security measures. The following list of layers aims to cover all possible scenarios and is not intended to be adopted without modification because every layer in the design adds an extra hop of data movement, which increases processing cost and time.
You can adapt these layers and their objectives to align with your organization’s data platform specific goals and expectations, including factors related to cost, performance, security, and consumption. This adaptation involves consolidating the listed layers below and adjusting anticipated goals and constraints accordingly. Your organization should strive to thoroughly define these data layer standards, achieve organizational alignment, rename the layers to match its unique definitions and objectives, and ensure comprehensive communication to prevent confusion, as many data professionals may have preconceived notions about the objectives of the layers. Consistency and stability of these data layer guidelines should be a priority, requiring diligent oversight and maintenance.
Some use cases, such as MDM (Master Data Management), real-time data stores, and real-time analytical systems, do not integrate well with the data layer design and should be developed separately from the data management platform. These systems typically require custom solutions due to their transactional processing nature and specialized processes that involve more complex workflows beyond direct data transformations between layers. Solutions developed for these use cases can serve as sources feeding into the data management layered platform.
Landing Layer (Bronze)
The primary function of the landing layer is to capture data directly from its source. Here, data is temporarily stored before being transferred to subsequent data layers. From a security standpoint, the landing layer facilitates collaboration between the data source team and the data management team by providing a shared storage location where data can be added or removed. However, in subsequent data layers, data management control is restricted solely to the data management team, excluding the data source team from direct data management capabilities.
Organization –> Source: The data layer is segregated based on the data source system.
History → No: This data layer is transient; it does not retain any historical data.
Format → Raw: The data format is determined by the extraction process or how the source data system provides the data. The schema here is highly flexible because the data undergoes no processing. The data model is designed for efficient updates, improving the ingestion process.
Load Patterns → Append: Data in this layer is never updated; it is only appended.
Transforms → None: Data within this layer remains unaltered; no modifications are made.
Security → Sharing: The security model at this layer facilitates collaboration between the data source team and the data management team. Access is restricted to those authorized users or teams only.
Raw Layer or Archived Layer (Bronze)
The raw layer (or archived layer) serves to preserve data from the source in its original, unaltered format along with its complete history. Subsequent layers will inevitably involve some form of data modification. Establishing this layer ensures zero data loss and provides a reliable foundation for data integrity.
Organization → Source: The data layer is segregated based on the data source system.
History → Yes: Data history is preserved in accordance with retention policies.
Format → Raw: The data format is determined by the extraction process or how the source data system provides the data. The schema here is highly flexible because the data undergoes no processing. The data model is designed for efficient updates, improving the ingestion process.
Load Patterns → Append: Data in this layer is never updated; it is only appended.
Transforms → None: Data within this layer remains unaltered; no modifications are made.
Security → Data Platform Team: While granular security controls can be implemented, the typical approach allows broad access to data sets limited to specialized data engineering and data science teams. Your organization’s data security protocols should guide specific requirements.
Formatted Layer (Bronze)
The formatted layer serves to store all data from the source in an open, tabular data format with its complete history. Establishing this layer ensures the full data sets in relation to columns and history are available in a consistent format. The layer is very similar to the raw layer, the difference being the implementation of the technical data quality validations and consistency of the data format.
Organization → Source: The data layer is segregated based on the data source system.
History → Yes: Data history is preserved in accordance with retention policies.
Format → Tabular Open Data Format: Data should adhere to an open, tabular format defined by the source schema, including column names, data types, and additional system fields. The data model is designed for efficient updates, improving the ingestion process.
Load Patterns → Append: Data in this layer is never updated; it is only appended.
Transforms → Parsed: Data is parsed to conform to the requirements of the tabular open data format, including conversion to appropriate data types suitable for binary serialization.
Security → Data Platform Team: While granular security controls can be implemented, the typical approach allows broad access to data sets limited to specialized data engineering and data science teams. Your organization’s data security protocols should guide specific requirements.
Source Copy Layer (Bronze)
The purpose of the source copy layer is to store necessary data from the source in an open, tabular data format. If the source data originates from a relational table, this layer should result in a snapshot of the source table. In this layer, the schema of the source data is established, which may incorporate additional system fields such as source system ID, process ID, batch ID, creation date, update date, and logical deletion date.
Inline technical data validations are performed here, encompassing tasks such as data type casting, null primary key checks, duplicate record checks, and duplicate primary key checks. Additional data checks, such as ensuring referential integrity, can be implemented, but the objective should be to remain as lenient as possible while ensuring data integrity (rejecting records only when data loss is imminent). Errors encountered during data parsing must be effectively managed and reported to the source teams for resolution.
Organization → Source: The data layer is segregated based on the data source system.
History → Yes: Data history is maintained in accordance with established retention policies.
Format → Tabular Open Data Format: Data should adhere to an open, tabular format defined by the source schema, including column names, data types, and additional system fields. The data model is designed for efficient updates, improving the ingestion process.
Load Patterns → Merge/Truncate-Reload: The load pattern ensures the existence of only one version of each data record per primary key. The primary approach is merge, with occasional use of truncate-reload or partial truncate-reload as necessary.
Transforms → Parsed: Data is parsed to conform to the requirements of the tabular open data format, including conversion to appropriate data types suitable for binary serialization.
Security → Data Platform Team: While granular security controls can be implemented, the typical approach allows broad access to data sets limited to specialized data engineering and data science teams. Your organization’s data security protocols should guide specific requirements.
Shared Modeled Layer (Silver)
The purpose of the shared modeled layer is to store data in a standardized format that can be accessed by multiple data products. Your organization should strive to clearly define the objectives of the shared modeled layer. It is crucial to establish comprehensive modeling standards, achieve organizational alignment, and ensure comprehensive communication. Consistency and stability of these modeling guidelines should be a priority, requiring diligent oversight and maintenance. The process of constructing the shared data model can encompass one or more of the following facets, each requiring dedicated resources and time to develop. It is crucial not to underestimate the time required to achieve each facet, thus realism in what the organization can deliver is essential. When defining goals and standards, carefully consider these facets and the implications to deliver.
Common Nomenclature — Establish consistency by standardizing terms used to represent data entities and columns. For instance, renaming CUSTOMER_SURROGATE and CUST_REF to CUST_ID promotes uniformity. Common nomenclature can be achieved with just views.
Common Schema — Achieve consistency in the Entity-Relationship Model (ERM) by standardizing the normalization and de-normalization of tables into a uniform layout across all sources. Please do not underestimate the difficulty in relation to this data harmonization, the conforming of the attributes and related error handling can be difficult.
Common Master Data — Establish a consistent association between master and transactional data. To ensure uniform reporting across systems, master data cross-references are usually required to link master data records together. Ideally, a dedicated MDM solution addresses most of these challenges within your enterprise, but integrating this data into the shared data model remains a necessary task.
Common Code Values — Establish consistency in data values, particularly concerning code values. For instance, while priority customers may be flagged as “423” in one source and “HIGH” in another, they can be standardized into a unified enterprise code value such as “PRIOR_CUST.”
Business Quality Checks — Now that technical quality checks are implemented in the source copy layer, you can introduce more stringent business quality checks in this layer. These checks do not need to prioritize leniency. Additionally, you can designate data records with quality flags using inline quality fields (system columns). These fields can define the reliability of the data, enabling detailed information to be accessed at the time of consumption.
The shared modeled layer establishes a unified representation of data that remains consistent across various data sources and facilitates the sharing of derived data sets, minimizing inconsistencies and redundant development efforts. Additionally, it offers a centralized query interface that enables new users to access data without needing to comprehend the intricacies of each individual data source.
Organization → Subject Area/Domain: The data layer is structured according to subject areas or domains, such as sales, finance, etc.
History → Yes: Data history is maintained in accordance with established retention policies.
Format → Tabular Open Data Format: Data is formatted in an open tabular format aligned with the shared model schema. The granularity is closely aligned with the source system to retain as much detail as possible at this layer, with aggregations deferred to later stages. The data model is designed for efficient reads, enhancing the consumption process.
Load Patterns → Merge/Truncate-Reload: The load pattern ensures the existence of only one version of each data record per primary key. The primary approach is merge, with occasional use of truncate-reload or partial truncate-reload as necessary.
Transforms → Joins/Splits/Transforms: Source data is split, joined, and mapped to the shared model. Cross-reference transformations for common code values are applied, and additional derived columns are created from business logic using the granular data record fields. Business data quality checks are applied as well. Aggregations are avoided at this stage, as downstream layers will have varying requirements for the “group by” columns.
Security → Data Platform Team & Ad-Hoc Users: Access to this layer is essential for specialized data engineering and data science teams. Anticipate demand from additional users seeking live ad-hoc access. A well-defined and organized data layer will attract users who require insights without full production solution design and deployment. Security processes should encompass request, approval, provisioning, authentication, and authorization procedures. Additional security measures such as row-level security and data masking for specific columns may be necessary.
Solution Layer (Gold)
The purpose of the solution layer is to structure data for the final analytical consumption, ensuring appropriate security and performance. Standards in this layer are flexible, allowing for adjustments based on analytical solution requirements. Typically, data products in this layer involve denormalization, aggregations, and the creation of custom derived metrics as needed.
Organization → Application: The data layer is tailored for specific analytical application purposes.
History → Yes: Data history is preserved according to the application’s retention requirements.
Format → Any: The data format is chosen based on application needs. For applications intending to share data with others, an open format facilitates data exchange. Otherwise, the format can be diverse, including relational, cubed, in-memory, etc., to meet specific application requirements. The data model is designed for efficient reads, enhancing the consumption process.
Load Patterns → Merge/Truncate-Rebuild: The load pattern is flexible. The preferred method may be truncate-rebuild, but alternative strategies are considered based on factors such as data volume, cost, and SLAs.
Transforms → Any: Various processing methods are employed to achieve application objectives.
Security → Business Community: The data solution is designed for general user consumption within the business community or for other downstream data products. Security processes should be established, including request, approval, provisioning, authentication, and authorization procedures. Additional security measures like row-level security and data masking for specific columns may also be required.