Business analysts live in dashboards and ad-hoc SQL. 99 times out of 100 that’s fine but the 100th time the query hits production, joins explode, and leadership wonders why yesterday’s “active customers” is 17 percent lower than last week.
Below we dissect the most frequent modeling mistakes we see in real-world analytics stacks, show the SQL anti-patterns that cause them, and rewrite each query the way a seasoned analytics engineer would. Use this as a gut-check before you publish another “quick” report.
1. The “Mystery Join” (Implicit Cartesian Growth)
Anti-pattern
sql
CopyEdit
SELECT o.order_id,
c.customer_name,
p.product_name
FROM orders o, customers c, products p --!! implicit joins
WHERE o.order_date >= '2025-01-01';
What goes wrong
The comma-separated table list performs a Cartesian join; every row is multiplied by every other row unless you remember to add predicates. A filter on order_date does nothing to stop millions of phantom combinations.
Rewrite
sql
CopyEdit
SELECT o.order_id,
c.customer_name,
p.product_name
FROM orders AS o
JOIN customers AS c ON o.customer_id = c.customer_id
JOIN products AS p ON o.product_id = p.product_id
WHERE o.order_date >= '2025-01-01';
Ask yourself: “Have I explicitly declared every join key? Could this query ever return more rows than exist in orders?
2. “Group-By-Everything” (Unstable Aggregations)
Anti-pattern
sql
CopyEdit
SELECT region,
SUM(revenue) AS revenue
FROM sales
GROUP BY 1, 2; -- region *and* revenue? makes no sense
A forgotten column position or drag-and-drop mistake means you’re grouping on the metric itself, generating nonsense buckets and bloated result sets.
Rewrite
sql
CopyEdit
SELECT region,
SUM(revenue) AS total_revenue
FROM sales
GROUP BY region;
Ask yourself: “Does each column in my GROUP BY represent a dimension? Could the metric accidentally sneak in?”
3. “Time-Travel Leakage” (Future Data in Historical Reports)
Anti-pattern
sql
CopyEdit
SELECT user_id,
MAX(subscription_end_date) AS latest_end_date
FROM subscriptions
WHERE created_at <= '2025-06-01'
GROUP BY user_id;
created_at filters when the row was inserted, not when the subscription ended. Back-filled rows added after June 1 will still appear, leaking future information into a past snapshot.
Rewrite
sql
CopyEdit
SELECT user_id,
MAX(subscription_end_date) AS latest_end_date
FROM subscriptions
WHERE subscription_end_date <= '2025-06-01' -- time-consistent filter
GROUP BY user_id;
Ask yourself: “Am I filtering on the business event date or merely the load timestamp?
4. “Left Join Null-Trap” (Phantom Growth in KPIs)
Anti-pattern
sql
CopyEdit
SELECT a.account_id,
b.plan_type
FROM accounts a
LEFT JOIN billing b ON a.account_id = b.account_id
WHERE b.plan_type <> 'enterprise';
The WHERE clause null-filters the very rows you meant to preserve with the LEFT JOIN, silently turning it into an inner join and under-counting free-tier users.
Rewrite
sql
CopyEdit
SELECT a.account_id,
COALESCE(b.plan_type, 'none') AS plan_type
FROM accounts a
LEFT JOIN billing b ON a.account_id = b.account_id
WHERE COALESCE(b.plan_type, 'none') <> 'enterprise';
Ask yourself: “Did my filter undo the join I purposely loosened? Should I COALESCE instead?
5. “The Untested UDF” (Hidden Logic in Functions)
Saving clever business rules in a database function looks clean—but hides logic from code review, lineage tools, and often the analyst who wrote yesterday’s dashboard.
sql
CopyEdit
CREATE OR REPLACE FUNCTION is_high_value(customer_id INT) RETURNS BOOLEAN AS $$
SELECT spend_to_date > 10000 AND last_login >= CURRENT_DATE - INTERVAL '30 days';
$$ LANGUAGE SQL;
Remedy
Inline critical logic or document and version-control UDFs in the same repository as your models. Expose the function’s assumptions in code review.
Ask yourself: “If this function changed tomorrow, would downstream models fail silently?”
Five Questions Analysts Should Ask Before Shipping Any Model
- Cardinality: Could this join increase row counts beyond the fact table?
- Temporal Integrity: Are all filters anchored to the correct business dates?
- Null Safety: What happens to metrics when a dimension is missing?
- Lineage: Where does every transformation live—SQL, Excel, BI tool, or UDF?
- Reproducibility: Can a teammate rerun the query tomorrow and get the same answer?
The Bottom Line
Bad data modeling is rarely about ignorance, it’s about hidden complexity and untested assumptions. By treating every query like production code, explicit joins, time-consistent filters, and peer review, you turn reporting chaos into maintainable, shareable knowledge.
Clean models aren't optional—they’re foundational.
Analysts fluent in modeling patterns produce insights stakeholders trust. DASCA programs are designed to formalize these fundamentals—because every great model starts with getting the basics right.
