Data Blueprint — Data Management Services

Michael Monschke
6 min readJul 2, 2024

--

Data management systems will provide services for writing and reading organizational data. However, a range of data services will complement the data management system to form your organization’s data management platform. Some of these services will be common across all data products within your data management platform, while others will be targeted specifically for certain data product deliveries. It is crucial to clearly define the interaction between these common and targeted services. Your organizational data blueprint should detail how these services logically support your operations and their technical implementation. The following list of services aims to cover all the primary aspects of a data management platform at a high level, with future articles delving deeper into each one.

conceptual data management platform diagram

Extract, Loads, and Transform (ELT) Services

A targeted service within the data management platform is the ELT service. These processes must extract (or consume) data from sources outside the data management system, load it into the system, and then transform it as it moves between data layers. The first data layer should aim to capture the full data sets unaltered, with transformations occurring within the data management system itself. Unlike the traditional ETL approach, a modern data management platform leverages ELT.

While the ELT computational system can be considered part of the data management system, the code necessary for its operations is customized to the specific data layer and data products being impacted. Therefore, the ELT service itself is not a common service. However, the standards, tools, and modules used can be considered a common practice, ensuring consistency across all data products.

Data Process Orchestration Management & Monitoring Service

A crucial common service that must exist across all data products is the orchestration and monitoring of data processes, such as ELT jobs. Data layers and data products require data to move between them in a sequence, similar to stages on a production line. Therefore, an independent system external to the data products must oversee these sequential data processes to provide a comprehensive understanding of the organization’s data operations. A data blueprint must ensure the data management platform can identify the standard daily workload, assess the health of daily data operations, and quickly find and resolve any issues that might arise. Attempting to add this capability after delivering your data products is a mistake.

Process orchestration is primarily responsible for understanding the interdependencies between data processes and initiating them at the appropriate times, accounting for planned downtime or other events. While the orchestrator might start a sequence of activities based on a schedule, it will usually use event-based triggering. Process monitoring, on the other hand, is responsible for overseeing the status of all actively running data processes, essentially serving as a heartbeat monitor for these processes.

Once a data process starts, the platform should immediately begin its monitoring. Therefore, it makes sense for the data process orchestration management and monitoring service to be part of the same technology suite, although this is not strictly necessary. Process orchestration can be driven by event-based technologies to support dependency management, but it is important to ensure that a data process is not responsible for reporting its own failures, as critical system failures might prevent alerts from being sent.

Data Quality Services

Another essential service within the data management platform is data quality services. While some data quality services are integrated within other targeted services, the metrics and reporting for these services can be centralized. Your organization must define the core data quality metrics to be captured in your data blueprint. Data quality services can be categorized into three primary capabilities:

· Inline Technical Validations — Basic checks necessary to load data into an open tabular format, such as data type verification and primary key checks. These validations occur as part of the ELT targeted services and must be handled inline with the data load processes.

· Business Data Validations — Data quality as defined by business rules, typically including referential integrity checks. This service can be a common service.

· Reconciliation Verification — Ensures the data matches between data sources and data layers. This service can also be a common service.

Metadata Management

A common service within the data management platform is responsible for metadata management. Implementing metadata management effectively requires resources, so your organizational data blueprint should prioritize the metadata that is most important. The primary types of metadata include business metadata, technical metadata, and operational metadata:

· Business Metadata — Describes your data in business terms. The technology responsible for managing this metadata is called the Enterprise Data Catalog.

· Technical Metadata — Describes your data in technical terms. The technology responsible for managing this metadata in a shared-disk management system is called the metastore.

· Operational Metadata — Varies depending on use cases, including:

o Data Management Metadata — Provides information about data loading, such as creation/update time, the process used, and the data source. This metadata is usually included inline with the actual data records.

o Data Access Metadata — Provides information about data access, including who accessed the data, when, and under what authorization rule. The level of data access metadata to capture is typically defined by your organization’s security requirements.

o Data Process Metadata — Provides information about data processes executing within your data management platform. This includes a master definition of the data processes and their executions, and it is part of the data process orchestration common service.

o Data Lineage Metadata — Data lineage metadata comes in two forms. Data process lineage defines the relationship between the processes that move data between data sets in the data layers, which is part of the data process orchestration common service. Data set lineage defines the relationship between the data sets themselves in the data layers, which is part of the ELT services. Although data process lineage and data set lineage are closely related, they do not have a one-to-one relationship. Capturing and maintaining this data effectively is often a challenge for organizations.

o Data Quality Metadata — The data quality as managed by the data quality services.

o Platform & System Metadata — Telemetry metadata from the platforms within your organization’s infrastructure.

Security

A common service within the data management platform is responsible for user and data security. Your data blueprint in relation to security should be heavily influenced by your organization’s security policies. The primary components of security involve infrastructure and user access:

· Infrastructure — Secure the technology platforms and network according to security policies.

· Data Encryption — Apply data encryption in line with your organization’s security policies.

· Roles/Permissions/Rules — Define permissions to modify and access data, grouping them into roles. Establish additional rules for more granular security, such as row-level security, attribute-level security, or time-based controls.

· Access Requests — A standard process for users to request access to data sets.

· Access Approvals — A standard process for data stewards to approve access requests.

· Authentication — A service to verify an identity (user/service) before granting access to data.

· Authorization — A service to verify an identity (user/service) has permission to access the requested data. While this may occur at a role level, your organization may require more complex rules for authorization.

· Data Masking/Tokenization — A service to hide or obfuscate sensitive data.

Alerting & Incident Management

As part of process monitoring, data quality, and security common services, it is crucial to report data process failures and data quality issues to the relevant teams for resolution. A common service should be implemented to capture these alerts, offer a data operations dashboard, and track issues until they are resolved or closed.

Infrastructure

The foundational technology common service is infrastructure, encompassing platforms, servers, and networking. Your organization’s infrastructure forms the basis for all other core common services. The ability to deploy rapidly (via SaaS & IaC) in a secure manner (with identities & network segmentation) can significantly impact the success of your data management platform initiative.

article icon

--

--

Michael Monschke
Michael Monschke

Written by Michael Monschke

Senior data professional with extensive experience spanning software development, security, data management, cloud computing, and artificial intelligence.

No responses yet