Data Blueprint — Data Security
In a modern data architecture, transitioning from a monolithic to a distributed data management system increases complexity, making data platform security a top priority in your organization’s data blueprint design. Security should cover infrastructure, encryption, access requests and approvals, authentication and authorization, data masking, data sharing, and auditing.
Infrastructure & Encryption
Your organization’s data platform should adhere to security policies for platform administration and networking, including network perimeters and segmentation. It must also comply with all standards for data encryption, both at rest and in motion. The data platform blueprint does not need to consider any unique requirements for infrastructure and encryption, aside from acknowledging that your data platform is distributed. For example, if you secure the ingestion and transformation processes but introduce a new analytic technology that bypasses all security controls, your overall data platform security becomes compromised. The security of your data platform is only as strong as its weakest link.
As your data platform technology stack expands, ensure all components follow well-defined security standards. Just as a single leak can sink a ship, one weak point can jeopardize the entire system. The organizational data blueprint must designate an overall security data platform owner and establish the standards and processes to follow before integrating any new technology into the data platform. Given the rush to introduce new analytic capabilities, balancing delivery and security can be challenging. However, adherence to security standards and processes is essential.
Access Controls
Access controls define who can access data and under what conditions through roles, permissions, and rules. These controls can govern both direct access to datasets and indirect access through analytic interfaces like user reports or machine learning APIs. While the granularity of roles and complexity of rules are usually established after the data blueprint is created, expectations for the security level at each data layer or role type can be set early on.
In a shared-disk architecture, multiple compute engines with broad access to data storage are responsible for implementing access controls for users and entities. As a result, organizations must support multiple technologies that are qualified as trusted systems. The data blueprint must specify where access control definitions (roles, permissions, rules) are established and ensure all technologies adhere to these standards. Organizations can centralize access control definitions or distribute them across tools. However, the more distributed the access controls, the more complex the processes for access requests, approvals, authorization, and auditing will become.

Access Requests and Approvals
The access request and approval process defines how users or entities request access to data or analytical interfaces and identifies the gatekeepers who approve these requests. Together, these processes support the provisioning of access controls. The organization’s security identity and access management (IAM) tools and capabilities will likely be used to support this function.
A key item to define during the data blueprint phase is the data stewardship procedures for granting access. Does the organization’s IAM system have well-defined roles that can manage auto-provisioning of access, or is it a manual request process? If it is a manual process, what is the plan for delegating the approval process to trusted individuals who must have well-defined authorization procedures and enough insight to know when to deny access requests?
Authentication & Authorization
Authentication and authorization capabilities will be largely determined by the organization’s IAM system and the compute engines that comprise the data platform. The organizational data blueprint should specify the primary mechanisms for authentication and authorization, such as SAML, OAuth, or Kerberos, along with any approved alternatives. This identification can then serve as a checklist for evaluating new technologies to be added to the data platform.
Environments, Data Masking & Tokenization, Auditing
A crucial aspect of data platform security is managing development, testing, pre-production, and production environments — particularly how to develop, test, and deploy data product changes securely. This presents several challenges. For instance, machine learning often requires access to production data to create accurate models, which involves mixing the different environment landscapes. The data blueprint should establish best practices for handling these environment challenges and will involve making tough decisions to balance real-world needs with optimal security. Data masking, tokenization, and auditing are processes that can help mitigate these challenges and enhance data security.
Data Sharing
Data sharing is often associated with exchanging data between parties outside the organization, and there are various technical methods to implement this. It’s important to distinguish whether the data is simply being copied (e.g., via FTP file transfer) or if a shared-disk technology is being used to read data in place, which increases data fidelity.
Another aspect of data sharing that is less commonly discussed in the industry is its application within the organization. Data sharing can also refer to in-house processes, where data products within a network segmentation design may “share” data with other data products, each within its own security domain. Clear boundaries between data products can enhance data security. Interactions between these data products should leverage all relevant security features, including infrastructure, access controls, request and approval processes, authentication and authorization, data masking, and auditing. At a minimum, your organization’s data blueprint should define how all these security aspects will work together cohesively.
