Why Data Quality Sucks

If we agree on nothing else, we can all agree that "Data Quality" sucks. It sucks to build, it sucks to maintain, and it usually sucks at its principal objective — making sure the systems downstream don't break.

As a consultant building analytics tools for enterprise clients for the better part of a decade, I spent more than a quarter of my time addressing data quality issues that gated the delivery of valuable insights. Across the industry, billions have been spent on tooling, frameworks and dedicated teams to address data quality issues, and yet most data issues are still discovered reactively. A dashboard breaks. A report is wrong. A customer complains. The ensuing fire drill is a ritual that practitioners know all too well.

Despite near universal acknowledgement of this, the problem is only getting worse. Enterprises are integrating more data sources, building multi-system analytics, and feeding data into models that amplify every upstream flaw. The surface area for failure is expanding faster than any team can cover.

Data Quality Is Painful to Build

Issues are Needles in the Haystack

The principal reason data quality is so painful is that the problem space is enormous. Every column, every row, every relationship between tables is a potential failure point. A dataset with 200 columns and 10 million rows presents an almost infinite number of ways things can go wrong. Even knowing where to start requires expertise that most teams don't have, and the effort to systematically assess that space is prohibitive at any meaningful scale.

This is what makes writing good rules so difficult. The rules that catch real problems are rarely obvious. "Null checks" and "value in set" constraints are table stakes. The issues that actually cause damage are subtle.

Example

Knowing that a purchase order to Shenzhen should not have a promised delivery date on Chinese New Year, or that return quantities can plausibly make a sales value negative, requires the kind of domain expertise that schema definitions don't capture.

And yet stakeholders tend to treat data quality as binary — the data is good or it's bad. And most often, it's bad. In reality, it's a continuous optimization exercise. Teams are trying to maximize coverage of a sprawling attack surface with limited resources. Classic framing of data quality as a pass/fail exercise creates a fundamental mismatch that drives chronic underinvestment and perpetual disappointment. Under these conditions, data quality teams inevitably default to shallow checks that give the illusion of effectiveness while leaving the most damaging issues completely uncovered.

Long Cycles to Implement Basic Checks

Even simple validation rules require a surprising amount of coordination.

Example

Take the rule: "the total order value should equal the sum of its line items after discounts and tax adjustments". This is an intuitive check that sounds straightforward until you consider the number of participants involved in implementing it. Someone with domain knowledge has to decide the validation matters and define the conditions under which it should be applied. A business systems analyst has to translate that into formal requirements, mapping the terms the business uses back to the data schema. A data engineer has to implement the code required to execute the check. A platform team has to deploy it into the pipeline, define alerting thresholds, and configure notifications. Four teams coordinating for one rule.

In smaller companies, this may seem extreme. But in large enterprises, organizational complexity, compliance requirements, and separation of duties make this level of coordination not just common but mandatory.

Now multiply that by the hundreds or thousands of checks a mature data estate actually needs. The backlog is infinite, the throughput is slow, and every rule carries an ongoing maintenance burden that compounds over time.

Thankless, Under-Resourced Work

There's a fundamental economic problem that prevents organizations from addressing this head on: success in data quality is invisible. When everything works, nothing breaks. There's no dashboard that shows the "problems that didn't happen." The ROI case is preventing a negative, and nobody gets promoted for preventing a fire that never started.

Given that reality, the economics are a tough sell in fast-moving, value-driven organizations. It's high-cost to implement, hard to measure returns, and a single data quality incident can undermine the credibility of the entire exercise. Data quality teams end up trapped in an impossible position — doing difficult, thankless work where success is invisible and failure is catastrophic. That kind of environment burns people out, drives turnover, and produces the mediocre outcomes everyone complains about. The people doing this work know it matters. They just can't prove it in a language the business responds to.

Takeaway

Data quality is painful because the problem space is massive, the cycle time to implement data checks is slow, and the economoics leave the work critically under-resourced in large organizations.

Traditional Data Quality Is Ineffective

Even when organizations invest heavily, the results are often underwhelming. The reasons are structural, not a matter of effort or competence.

The Business-IT Disconnect

Business users know what "good data" looks like.

Example

They can tell you that a supplier's quoted lead time shouldn't deviate by more than 30% from historical averages, that inbound shipment quantities should reconcile against advance ship notices within a tolerance band, that a bill of materials shouldn't reference components flagged as discontinued by the vendor.

They carry deep institutional knowledge about what the data means and what patterns signal a problem. But they can't express that knowledge in code.

Engineers, on the other hand, can write the checks. They can build sophisticated validation frameworks and deploy them at scale. What they lack is the domain context to know what actually matters.

Example

Which of the 200 columns are business-critical? What constitutes an anomaly versus normal seasonal variation? What's a real problem versus an acceptable edge case? The translation layer between "inbound quantities should match advance shipment notices within tolerance" and a multi-join query that reconciles shipment records against advance ship notices across three systems is where quality goes to die.

The business expert describes the intent. A systems analyst translates it to requirements. The engineer interprets it. Somewhere in that game of telephone, the original meaning is lost. The result is rules that are either too naive (they miss the real problems) or too strict (they cry wolf until everyone stops listening).

Rules Rot

Data quality is not a set-it-and-forget-it exercise, but it almost always ends up being treated like one. Schemas evolve, business logic shifts, and new systems are onboarded. A rule that was perfectly calibrated six months ago may be flagging false positives today — or worse, silently missing real issues because the data moved out from under it.

The reasoning behind a rule (why it exists, what it's protecting against, and what business context made it necessary) lives in someone's head, not in the system. When that person moves on, changes roles, or simply forgets, the rule becomes an orphaned artifact. It still runs. It still alerts. But nobody knows whether it's still relevant, and nobody wants to be the one to turn it off. Alert fatigue sets in, notifications pile up, and teams start ignoring quality signals entirely. This makes the system worse than useless. It creates a false sense of coverage.

Fixing Is Harder Than Finding

Detection without remediation is just noise. And this is where most data quality tooling stops. It tells you something is wrong, but not what to do about it. Today, remediation happens through manual interventions and expensive "war rooms" once issues are escalated.

Example

Engineers, analysts, and business owners are pulled into ad hoc firefighting sessions to diagnose and patch problems one at a time. Most of the people in that room are waiting to be called on, while a handful try to trace the root cause through a chain of systems, burning expensive hours just to get to the starting line of an actual fix.

Finding a problem is cheap. Triage and repair is expensive. Executives don't consider data quality work complete until the fix is delivered and downstream systems are whole again, but today that remediation cycle is highly manual. Delivering the fixes that actually resolve data quality issues is the key to connecting quality work to the value drivers the business requires to justify these activities.

Takeaway

Traditional data quality often underperforms because of the disconnect between business knowledge and technical implementation, the set it and forget it nature of existing data quality approaches and the lack of follow through when data quality issues are found.

What a Better World Looks Like

For the first time, advances in large language models and agentic AI systems have fundamentally altered the underlying economics of data quality management. What makes this moment different is not incremental improvement to existing approaches, but a structural shift in how data quality artifacts are created, maintained, and enforced.

The Cost Curve Is Inverting

Today, the question every data consuming organization faces is: "Can we afford to design, implement, and maintain enforcement for this?" Every rule has a real cost — engineering time to build, cross-team coordination to scope, ongoing effort to maintain. That cost forces triage. Teams cover the obvious risks and handle everything else reactively.

When the marginal cost of creating a rule drops to near-zero, the question inverts: "Why not check this?"

Example

If someone thinks about a quality concern once, mentions it in a meeting, or notices it in a report, it can be encoded and made part of the data quality suite with almost no additional effort or friction.

The ROI equation that has trapped data quality teams for decades finally flips. Data quality becomes justifiable not by proving the absence of failure, but by the sheer breadth of coverage relative to cost. It's the same inflection that transformed software testing. Once the cost of writing tests dropped far enough, coverage became an expectation rather than a luxury. Data quality is approaching that same threshold.

The Ownership Model Changes

When domain experts can express quality expectations in natural language and have the machine generate the implementation, ownership fundamentally shifts. The business-IT translation layer (where quality has gone to die for decades) collapses.

Rules become readable, auditable, and maintainable by the people who actually understand what the data should look like. A business analyst can define an expectation and see it become an executable check without writing a line of code. They can review the generated logic, approve it, and modify the intent later when business conditions change.

Example

A finance analyst reviewing a quarterly report notices that several line items show negative revenue — numbers that should never appear for completed transactions. She flags it in the system. The platform traces the anomaly back to its source: a batch of unit prices updated this quarter came through as negative values from the upstream ERP. The system surfaces the root cause, shows the affected records, and offers a remediation path: substitute last quarter's validated unit prices for the corrupted entries so downstream reporting can proceed while the source issue is escalated to procurement. No war room. The person who spotted the problem is the same person who approves the fix.

This doesn't diminish the role of IT — it elevates it. Engineers stop spending their days translating business intent into boilerplate validation logic and triaging alert escalations. Instead, they focus on the infrastructure, pipeline reliability, and systems architecture work that actually delivers differentiated value to the business. IT moves out of escalation hell and into the work they were hired to do.

Quality Becomes a Living System

Static rulesets decay from the moment they're written. A living quality system does the opposite. It compounds in value over time.

Agentic LLM systems can analyze data and operational artifacts to generate rules proactively, scanning the whole haystack rather than hoping an engineer finds the needle. They can re-evaluate rules against changing schemas and distributions, flagging when a check is going stale before it becomes a blind spot.

Example

Consider the delivery date rule from earlier — promised delivery dates should never fall on a non-operating day for the receiving facility. That rule was written when the warehouse operated Monday through Friday. This quarter, demand surged and the facility moved to seven-day operations. Someone mentions it in a planning meeting. In the old world, technical teams are engaged to relax the constraint. In a living system, that context update is captured when it's surfaced and propagated to every rule it affects. The delivery date check is updated, the false positives stop, and the dozens of other rules that reference facility operating schedules can adapt.

Domain knowledge, the institutional understanding of what data should look like and why, finally gets encoded into the system in a durable, evolving way. Rules stop rotting because the context that created them is no longer trapped in someone's head. It's part of the system itself.

Takeaway

An AI-native approach to data quality creates a structural shift data quality economics that allows it to scale in lockstep with the growth of the enterprise data estate and drive outcomes that are better tied to real business value.

The Widening Gap

Organizations that capitalize on this shift stand to gain a decisive and durable advantage.

The technology is not perfect, and the approaches to delivering on this vision are still being refined. But those things are not the bottleneck. The organizations that begin now, even imperfectly, will build an institutional asset that grows more valuable with every rule written, every schema change absorbed, and every remediation cycle completed. Each month of accumulated intelligence makes the next month's coverage broader and the system harder for competitors to replicate. Those that start encoding their domain knowledge now will see this compounding benefit earliest and fastest.

Teams that don't will stay trapped in the old economics: expensive to build, impossible to maintain, and perpetually one schema change away from a production fire. The automation and AI initiatives that sit on top of the data foundation — the very programs leadership is betting on — will never deliver on their promise if the underlying data can't be trusted. The gap between these two trajectories widens fast. And for one side of that gap, data quality continues to suck.

Takeaway

The new data quality paradigm has a compounding effect. Starting early builds durable advantage and avoids a far more painful transition later when it is no longer optional.

Data Quality Is Painful to Build

Issues are Needles in the Haystack

Long Cycles to Implement Basic Checks

Thankless, Under-Resourced Work

Takeaway

Traditional Data Quality Is Ineffective

The Business-IT Disconnect

Rules Rot

Fixing Is Harder Than Finding

Takeaway

What a Better World Looks Like

The Cost Curve Is Inverting

The Ownership Model Changes

Quality Becomes a Living System

Takeaway

The Widening Gap

Takeaway

Still have questions?