5. How would you handle a situation where a query is returning duplicate records in SQL?

Overview

Handling duplicate records in SQL queries is a crucial skill for developers and DBAs, ensuring data integrity and accurate results in applications. Mastering techniques to identify and eliminate duplicates is essential for database optimization and reliable data analysis.

Key Concepts

Distinct Keyword: Used to remove duplicates in the result set.
Group By Clause: Helps in aggregating data and identifying duplicates based on specific conditions.
Window Functions: Provide advanced ways to handle duplicates, especially when you need to keep one record out of the duplicates.

Common Interview Questions

Basic Level

What is the purpose of the DISTINCT keyword in SQL?
How can you identify duplicate rows in a table?

Intermediate Level

How would you delete duplicate rows, keeping one instance of the duplicate?

Advanced Level

Can you explain how to use window functions to handle duplicate records while maintaining data integrity?

Detailed Answers

1. What is the purpose of the DISTINCT keyword in SQL?

Answer: The DISTINCT keyword is used in SQL to remove duplicate rows from a result set. It is applied to one or more columns to ensure that the values returned are unique across the specified columns.

Key Points:
- The DISTINCT keyword can significantly impact query performance, especially on large datasets.
- It is applied to the SELECT statement.
- Care should be taken when using DISTINCT, as indiscriminate use can lead to unexpected results, especially when selecting multiple columns.

Example:

// Assuming we have a table 'employees' with columns 'name' and 'department'
// This query returns unique department names
string query = @"
SELECT DISTINCT department 
FROM employees;
";

2. How can you identify duplicate rows in a table?

Answer: To identify duplicate rows, you can use the GROUP BY clause combined with the HAVING clause to filter groups by their counts. This method highlights rows that appear more than once based on specific criteria.

Key Points:
- GROUP BY aggregates the data based on specified columns.
- HAVING clause filters the groups created by GROUP BY.
- This approach is useful for finding duplicates without deleting or altering them.

Example:

// Assuming we have a table 'orders' with columns 'order_id' and 'customer_id'
// This query identifies duplicate 'customer_id' entries
string query = @"
SELECT customer_id, COUNT(*) 
FROM orders 
GROUP BY customer_id 
HAVING COUNT(*) > 1;
";

3. How would you delete duplicate rows, keeping one instance of the duplicate?

Answer: To delete duplicate rows while keeping one instance, you can use a common table expression (CTE) along with the ROW_NUMBER() window function. This approach assigns a unique row number to each row within a partition of a result set, which can then be used to identify and remove duplicates.

Key Points:
- CTEs provide a temporary result set that can be referenced within a SQL statement.
- ROW_NUMBER() assigns a sequential integer to rows within a partition of the result set.
- By using a WHERE clause, you can filter out duplicates based on their row numbers.

Example:

// Assuming we have a table 'products' with columns 'product_id' and 'product_name'
string query = @"
WITH CTE AS (
   SELECT product_name, 
          ROW_NUMBER() OVER (PARTITION BY product_name ORDER BY product_id) AS rn
   FROM products
)
DELETE FROM CTE WHERE rn > 1;
";

4. Can you explain how to use window functions to handle duplicate records while maintaining data integrity?

Answer: Window functions allow you to perform calculations across a set of table rows that are somehow related to the current row. To handle duplicates while maintaining data integrity, one can use window functions like ROW_NUMBER(), RANK(), or DENSE_RANK() in conjunction with a CTE or a subquery. These functions can help in identifying duplicates based on specific criteria without removing non-duplicate rows.

Key Points:
- Window functions perform calculations without collapsing rows, maintaining data integrity.
- ROW_NUMBER() is particularly useful for identifying duplicates because it allows you to assign a unique sequence number within partitions of the result set.
- This approach is flexible and can be adapted to various scenarios where data integrity is crucial.

Example:

// Assuming we have a table 'customer_orders' with columns 'order_id', 'customer_id', and 'order_date'
string query = @"
WITH RankedOrders AS (
   SELECT order_id, customer_id, order_date,
          ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rn
   FROM customer_orders
)
SELECT order_id, customer_id, order_date
FROM RankedOrders
WHERE rn = 1;
";

This query demonstrates how to use window functions to select the most recent order for each customer, effectively handling potential duplicates by considering only the latest entry.