Data Blueprint — Data Management Systems

Michael Monschke
4 min readJun 25, 2024

--

For the purpose of this article, the discussion of unstructured and semi-structured is not mentioned.

Data platforms typically operate on a singular, central, core data management system. Historically, these systems employed a shared-nothing architecture. However, most modern core data management systems utilize cloud technologies that implement a shared-disk architecture. Understanding the fundamental concepts of these two types of data management systems is essential for understanding the evolution and challenges of modern data architectures.

Shared Nothing Architecture

The shared-nothing architecture treats the database as a single, large system — a monolith that supports data management. It operates by dividing each table into smaller segments called shards, with each shard handling a subset of the table’s data. For instance, if a table has a million rows and there are 100 compute shards, each shard processes only 10,000 records per query. These compute shards can operate in parallel because each has its own dedicated CPU, memory, and disk, and they can coordinate with each other. The term “shared-nothing” reflects the fact that CPU, memory, and disk resources are not shared between shards; instead, they are managed independently.

The shared-nothing architecture necessitates data sharing across compute shards to support certain queries, such as joins. This requirement means the architecture performs optimally when supported by a dedicated high-speed back-end network between the compute shards, making it more suitable for appliances and less valuable in a cloud environment. Additionally, because all queries share the same resources per shard, a workload manager is needed to prioritize queries based on predefined rules, such as running smaller queries before larger ones.

The shared-nothing architecture is typically implemented by a single technology vendor, who keeps the platform’s internals proprietary due to the complexity of the data management. This unified monolithic platform also addresses various technical complexities, such as maintaining object organization, object schemas, security measures, performance, and access interfaces, by bundling them into a single capability set.

Shared Disk Architecture

The shared-disk architecture separates the disk from the CPU and memory, allowing multiple distributed compute engines (or clusters) with their own CPU and memory to share the data at the disk level. In modern data architectures, data stored on the disk is maintained in an open-tabular format, enabling different computational engines to access it directly. The leading open-tabular formats are Delta Lake, Apache Iceberg, and Apache Hudi.

The shared-disk architecture, however, comes with its own set of challenges. One primary issue is managing concurrency control. Popular open table formats use optimistic concurrency control, where data readers access the latest snapshot, and data writers operate under the assumption that their writes won’t conflict with others. Writers verify the absence of conflicts before committing changes. This method works because data blocks are never updated directly; instead, they are read, updated, and rewritten, with the old blocks being retired. This design also enables a feature called time travel, which is beyond the scope of this article.

Another significant challenge for the shared-disk architecture is its aim to support multiple compute engines. The open tabular format is only one piece of the puzzle; compute engines must also share a metastore that ideally provides details on object organization, object schemas, and additional security information such as roles, permissions, data masking rules, and row-level security rules. Ensuring that all compute engines fully implement the features of the metastore is a technological challenge in terms of delivery and maintenance. This challenge can be mitigated by limiting the number of compute engines your organization supports. For instance, an organization generally only needs one type of compute engine technology to provide a SQL-based interface for accessing its data, and BI reporting engines should access data via this SQL interface rather than requiring their own compute clusters. However, there are benefits to using different compute engines for specific scenarios, such as machine learning compute clusters. Also, for organizational efficiency, it is preferable to avoid maintaining multiple metastores. The diagram below illustrates the complexity that can arise when managing multiple compute technologies in a shared-disk architecture. The industry is actively striving to make this architecture more open and easier to deploy.

diagram of a shared-disk architecture

Conclusion

In a modern data architecture, you can incorporate a mix of data management platform designs, including cubing and graphing systems, but the centralized core data management system for the initial data layers will typically be a shared-disk architecture. This architecture facilitates data sharing both internally, among segregated data products, and externally. Decisions about which data layers, if any, should be split from the centralized core shared-disk system must be outlined in your organization’s data blueprint. Understanding the key data management and consumption technologies within your organization, and how they relate to the defined data layers, is crucial for developing this blueprint.

The shared-nothing and shared-disk architectures will form the foundation of data marts, data warehouses, data lakes, data lakehouses, and data meshes. While data virtualization technologies can help simplify the complexity of these data ecosystems, that topic will be covered in a future article.

--

--

Michael Monschke
Michael Monschke

Written by Michael Monschke

Senior data professional with extensive experience spanning software development, security, data management, cloud computing, and artificial intelligence.

No responses yet