Breaking Down Data Silos with Cross-Account Collaboration
Data sharing has become a crucial aspect of driving innovation, contributing to growth, and fostering collaboration across industries. According to Gartner, organizations promoting data sharing outperform their peers on most business value metrics. However, managing cross-account permissions and discovering the right data across accounts present significant challenges.
Amazon DataZone provides a solution by offering a fully managed data management service that helps catalog, discover, share, and govern data stored across AWS accounts.
Solution Overview
This cross-account data collaboration solution uses Amazon DataZone domain association to maintain security and governance while enabling seamless data sharing. The solution involves:
- A producer account that contains and shares data assets
- A consumer account that accesses the shared data
- Amazon DataZone domain created in the producer account and associated with the consumer account
The process leverages AWS Resource Access Manager (AWS RAM) to share resources. When accounts are in the same AWS Organization, domain association happens automatically. For accounts in different organizations, AWS RAM sends an invitation to accept or reject the resource grant.
Key User Personas
- Data Administrators: Account owners responsible for creating domains, configuring associations, and accepting domain associations
- Data Publishers: Users in producer accounts who create publish projects and environments, produce data assets, and accept subscription requests
- Data Subscribers: Users in consumer accounts who create subscribe projects, search for and subscribe to data assets, and query data
Implementation Walkthrough
The solution follows these high-level steps:
1. Create an Amazon DataZone domain in the producer account
2. Request domain association from producer to consumer account
3. Accept domain association in the consumer account
4. Add data users to the domain
5. Create publish projects for AWS Glue and Amazon Redshift
6. Set up environments to publish data assets
7. Create and run data sources to publish assets into the business catalog
8. Create subscribe projects
9. Configure environment profiles and environments
10. Subscribe to and consume the shared data
Technical Considerations
Amazon DataZone uses Amazon Redshift Datashares for cross-account data sharing, which has specific requirements:
- Both producer and consumer clusters must be encrypted
- Data sharing is supported only for provisioned ra3 cluster types and Amazon Redshift Serverless
- Proper IAM roles and permissions must be configured
- AWS Secrets Manager is used to store database credentials with specific tags for access control
Data Publishing Process
The data publishing workflow involves:
1. Creating data sources that connect to AWS Glue and Amazon Redshift
2. Running these data sources to ingest metadata into Amazon DataZone
3. Reviewing and publishing the assets to the business data catalog
4. Making the assets discoverable and accessible to authorized users
Data Consumption Process
The data consumption workflow includes:
1. Searching for published assets in the catalog
2. Requesting subscription with proper justification
3. Getting approval from data publishers
4. Accessing and querying the data using analytics tools like Amazon Athena and Amazon Redshift query editor
Security and Governance
Throughout the process, Amazon DataZone maintains security and governance by:
- Using AWS RAM for secure resource sharing
- Implementing proper IAM roles and policies
- Requiring explicit approval for subscription requests
- Supporting AWS Lake Formation access monitoring and AWS CloudTrail for auditing
This comprehensive solution enables organizations to overcome the challenges of cross-account data sharing while maintaining robust security, governance, and discoverability.
Leave a Reply