Data Blueprint — Data Products
As discussed in the Data Management Systems article, the core data management system utilizes a shared-disk architecture. This architecture supports the distribution of data processing from a physical design standpoint. Given the movement away from a monolithic shared-nothing physical architecture, organizations should also logically decouple the delivery of data solutions. This approach can be achieved through the design pattern of data products.
Data products adhere to the software engineering principles of high cohesion and loose coupling, striking a balance between autonomy and dependency. A data product comprises three phases: data ingestion, data processing, and data consumption. It is highly cohesive, with a clear purpose and well-defined output expectations (data available for consumption) based on its input (ingested data with quality in terms of timeliness and accuracy). Data products are loosely coupled, focusing solely on the data interfaces and not the implementation details of other data sources or products. However, they must integrate effectively within the data management common services to support a holistic cohesive data platform. Each data product will have a designated owner and sustainment team.
Data Layers & Data Products
Data products must have a clearly defined purpose to achieve high cohesion. Therefore, data layer blueprint design decisions should guide the definition of data products’ purposes. A data product should focus on implementing a single aspect of the data layer. If a data product attempts to implement all the bronze, silver, and gold aspects of the data layer design, its purpose becomes overloaded, reducing its ability to deliver quickly to a focused purpose with high cohesion.
Organizations should create well-defined data product “classes” as part of their blueprint design. A data product “class” abstractly defines the different types of data products the organization will implement and their intended purposes. For example, a Source Data Product class would implement the organizational data layer standards for the bronze layers for a given OLTP system. The primary purposes of this class could include providing an audit trail of data changes, serving as a secondary backup of source data, and enabling downstream data products to consume source data. A Data Domain Data Product class (silver layer) might integrate OLTP and MDM datasets for consumption by multiple downstream data products. The gold layer will vary in data product class types; for instance, an ML Data Product class could offer predictive/prescriptive insights and support a shared feature store used across data products.
Data Management Systems & Data Products
The central shared-disk data management system will be the primary means of sharing data across your data products. Since each data product has its own resources for data ingestion and processing, it can ensure its own performance goals and manage its own costs. The consumable interface for a data product could simply provide details on query patterns to access the most recent data as needed and instructions on how to request secure access to the data. Downstream data products will use their own resources to consume the data, benefiting from direct access and ensuring greater data fidelity.
Data Management Services & Data Products
While data products are highly cohesive with targeted purposes and loosely coupled with well-designed interfaces to interlink them, the data platform should still function as a holistic, unified solution. Your organization can achieve this by integrating all data products with common data management services. These services include infrastructure, security, orchestration and monitoring, data quality, the metastore and enterprise data catalog, operational metadata, alerting, and incident management.
The list of common data management services is extensive and complex, not to mention the targeted services that need to be built into each data product. Implementing a modern data platform involves significant architectural planning and tooling, which should be approached with care. A future data blueprint article will outline a logical architecture for a data product container design, detailing mandatory features and optional features that can be added later based on the organizational data blueprint decisions.
Organizational/Team Management
Apart from the architectural planning and tooling, another key challenge in implementing a data product design is effectively organizing teams and resources across the different data products. Considering the number of data source bronze data products that must be developed (one per source) and the number of silver data products (one per subject area), the resource requirements for this architecture can quickly become daunting, especially since the development of gold data products hasn’t even begun. A future data blueprint article will discuss the leadership decisions necessary to make this data product design approach feasible for an organization. As a preview, the organizational design will need to allow certain roles to develop across multiple data products and enable some roles to manage multiple data products.
Closing Word
The data management platform discussed thus far assumes a batch or micro-batch architecture. However, organizations can also incorporate streaming services into the design. A streaming architecture introduces complexities and potentially higher costs, so it must be carefully planned and thoroughly tested. While this streaming design approach is beyond the scope of this article, to demonstrate the complexity, here are a few key questions that must be addressed for a streaming architecture:
· When streaming data is transformed by a process, should the process fork the data into both a data layer landing area and another real-time topic?
· For the introduction of a new data product, how will the product capture stored historical data versus streaming new data? How can the complexity of managing two different interfaces be reduced?
· How should denormalization (joins) between data sets be handled if data arrives late?
· How can transactional updates be accounted for, ensuring performant tracking of aggregations?
· How are data quality failures reported and handled in the streaming system?