HomeBlogData & AIData Governance
Data Governance

Data Governance:
definition, stakes and how to implement it

Data governance is the invisible foundation on which all successful data projects rest. Without it, Data Lakes become swamps, KPIs contradict each other, and GDPR compliance becomes a permanent source of risk. Understanding its fundamentals is essential for every data professional.

10 min readData GovernanceIntermediate

What you will learn

  • The exact definition of Data Governance and what it covers
  • Key roles: CDO, Data Owner, Data Steward, Data Custodian
  • The reference framework DAMA-DMBOK and its 11 domains
  • Market tools: Collibra, Alation, Atlan, Apache Atlas
  • How to start a governance initiative without blocking data teams
Foundations

What is Data Governance?

Data Governance is the set of policies, processes, roles, and standards that define how an organization collects, manages, uses, and protects its data. It is not a tool or technology: it is an organizational and operational framework.

Governance answers fundamental questions: who has the right to modify this data? What is the official definition of 'active customer' in our company? Where is our users' personal data stored? Who is accountable if a critical data point is incorrect?

The scope of governance

Data governance typically covers six domains: data quality, metadata management, master data management, data lineage (end-to-end traceability), security and privacy (GDPR, compliance), and the data catalog (inventory and documentation of data assets).

Data vs information

Governance applies to data (raw facts) but also to information (interpreted data) and metadata (data about data). A good data catalog governs all three levels.

Stakes

Why governance has become critical

Three forces have made Data Governance unavoidable in modern organizations: data proliferation, regulation, and data dependency for decisions.

Data quality as a business issue

An IBM study estimates that poor-quality data costs US businesses $3.1 trillion per year. Contradicting dashboards, mailings sent to non-existent customers, ML models trained on erroneous data: the business consequences of poor data quality are direct and quantifiable.

Regulatory compliance: GDPR and beyond

The GDPR (in force since 2018 in Europe) made the governance of personal data mandatory. Knowing where personal data is stored, who accesses it, how long it is retained, and how to delete it on request: these are legal obligations, not optional best practices. GDPR fines can reach 4% of global turnover.

Generative AI amplifies the need for governance

Generative AI models trained on ungoverned data can reproduce biases, disclose confidential information, or produce inconsistent responses. Data Governance is becoming a prerequisite for deploying AI responsibly and compliantly.

Key figure

According to IDC, the global data volume will reach 175 zettabytes by 2025. Without adequate governance, this exponential growth makes management and compliance increasingly complex.

IDC DataSphere, 2020
Organization

Key roles in data governance

Data governance is not the responsibility of a single person or team. It distributes responsibilities according to a model of defined roles.

Chief Data Officer (CDO)

The CDO is the executive responsible for data strategy and governance at the organizational level. They define the vision, allocate resources, arbitrate data ownership conflicts, and represent the value of data at the executive committee. A rare role in 2015, it is now present in 65% of large companies according to Gartner.

Data Owner

The Data Owner is a business role: the business responsible for a specific data domain (e.g., the CFO for financial data, the CHRO for HR data). They decide who has access to the data, validate business definitions, and are accountable for the quality and correct use of data within their scope.

Data Steward

The Data Steward is an operational and cross-functional role. They implement the policies defined by the Data Owner: documenting metadata in the data catalog, monitoring data quality, resolving quality issues, and training users on best practices. This is often a part-time role, held by business or data experts.

Data Custodian (or Data Engineer)

The Data Custodian is the technical guardian of data. They manage the infrastructure (databases, data lake, pipelines), apply access controls defined by the Data Owner, and ensure the physical and logical security of data. This is typically the Data Engineer or database administrator.

Framework

The DAMA-DMBOK framework: the global reference

The DAMA-DMBOK (Data Management Body of Knowledge), published by DAMA International, is the most widely used reference for structuring data governance. It breaks data management into 11 knowledge domains.

The 11 DAMA-DMBOK domains

The domains covered are: Data Governance (overall steering), Data Architecture, Data Modelling and Design, Data Storage and Operations, Data Security, Data Integration and Interoperability, Documents and Content, Reference and Master Data, Data Warehousing and BI, Metadata Management, and Data Quality. These domains articulate around a central core: governance, which steers all the others.

Academic reference

The DAMA-DMBOK 2nd edition (2017) is the bible of data governance professionals. It forms the basis of the CDMP (Certified Data Management Professional) certification, internationally recognized.

DAMA International, DAMA-DMBOK 2nd Edition, 2017
Tooling

Data governance tools

Data governance relies on two main categories of tools: data catalogs (inventory and documentation) and data quality tools (monitoring and remediation).

Data Catalogs: Collibra, Alation, Atlan

Collibra is the historical enterprise market leader, with advanced stewardship, lineage, and GDPR compliance features. Alation stands out for its collaborative approach and recommendation engine based on usage patterns. Atlan is the modern cloud-native solution, very popular in young data-driven organizations. Apache Atlas is the open-source alternative, natively integrated into the Hadoop ecosystem.

Data Quality: Great Expectations, Monte Carlo, dbt tests

Great Expectations is the most widely used open-source Python library for writing declarative data quality tests. Monte Carlo is the leading Data Observability platform, automatically detecting anomalies in pipelines. dbt offers native quality tests (not_null, unique, accepted_values, relationships) directly within SQL transformations.

Implementation

How to start a governance initiative

Data governance is not implemented in a 6-month project. It is a continuous program that starts with quick wins and expands progressively.

Key steps to get started

Start by identifying your Critical Data Elements (CDEs): the 20% of data that generates 80% of business value. Assign Data Owners to these domains, create a shared business glossary in a minimal data catalog, and implement quality tests on CDEs. The goal for the first 90 days: demonstrate concrete value, not deploy a complete framework.

Common pitfall

The most common mistake is starting with the tool rather than with processes and roles. Buying Collibra without naming Data Stewards or defining policies produces an empty data catalog that is abandoned within 6 months.


Frequently asked questions about Data Governance

What is Data Governance?

Data Governance is the set of policies, processes, roles, and standards that define how an organization collects, manages, uses, and protects its data. It answers the questions: who owns which data, who can access it, how it is defined, and how its quality is assured.

What is the difference between Data Owner and Data Steward?

The Data Owner is a business responsible person who holds authority over a data domain: they decide on access policies and validate definitions. The Data Steward is an operational role that implements these policies: documenting metadata, monitoring quality, and resolving issues on a daily basis.

What is the DAMA-DMBOK?

The DAMA-DMBOK (Data Management Body of Knowledge) is the global reference for data management, published by DAMA International. It structures data management into 11 knowledge domains, with governance as the central domain that steers all others.

How is Data Governance related to GDPR?

GDPR imposes specific governance obligations for personal data: knowing where it is stored, who accesses it, how long it is retained, and how to delete or transfer it on request. Data Governance is the organizational framework that enables these obligations to be met systematically.

What is a data catalog?

A data catalog is a centralized, searchable inventory of an organization's data assets: tables, columns, files, reports, ML models. It documents metadata (business definition, owner, sensitivity, lineage) and enables teams to discover and understand available data. Collibra, Alation, and Atlan are the leading tools.

What is Master Data Management (MDM)?

Master Data Management is the management of reference data shared between multiple systems: customers, products, suppliers, employees. The goal is to maintain a single, reliable version of these entities (the 'golden record') to avoid inconsistencies between systems (e.g., a customer that exists under 3 different IDs in 3 systems).

Where to start a Data Governance initiative?

Start by identifying your Critical Data Elements (CDEs) - the data most critical to the business. Assign Data Owners to these domains, create a shared business glossary, and implement quality tests. Prioritize quick wins on a restricted scope rather than an 18-month global program.


Previous article: Data Lake, Data Warehouse and Lakehouse

Next article: Business Intelligence and KPI