Delete Duplicate Records in SQL
Inefficient use of memory space and inaccurate data arise from duplicate entries in databases. This article focuses to delete duplicate records in SQL to ensure accurate data retrieval and optimize memory usage. Duplicate rows are those with identical data values.
Now, to understand the whole scenario, let us start by creating the table and inserting duplicate records to understand how we can progress First, find the duplicate rows from the table and then make the table free from duplicate records.
Step 1: Creating a table named “INFORMATION”
Query:
Explanation: Using the CREATE TABLE syntax, we create a table named "INFORMATION" and define the different columns along with their datatypes and values. GENERATED ALWAYS AS IDENTITY will make a continuous column for counting serial number.
Step 2: Inserting values or data in the INFORMATION table
Query:
The INFORMATION table containing records that contain DUPLICATE as well as UNIQUE entries.
[IMAGE 1 {Add Scaler topics logo into it} START SAMPLE] [IMAGE 1 FINISH SAMPLE]
How to delete duplicate Rows in SQL using Group BY and Having Clause
This approach to delete duplicate records in SQL utilizes the SQL GROUP BY clause to identify duplicate rows. By grouping data based on specified columns and employing the COUNT function, we can determine the occurrence of each row. For instance, executing the query below retrieves records from the INFORMATION table with occurrences greater than 1.
Output:
[IMAGE 2 {Add Scaler topics logo into it} START SAMPLE] [IMAGE 2 FINISH SAMPLE]
In the output above, we see that there are duplicate records present in the table. We need to remove this. Vanya has 3 duplicate records while Kunal and Mohit have 2. We should keep one in each of them and delete the rest.
We use the SQL MIN function to calculate the minimum id among these duplicates. Keep that row and delete the others row among the group.
First we will show the command to see the minimum id among all entries.
[IMAGE 3 {Add Scaler topics logo into it} START SAMPLE] [IMAGE 3 FINISH SAMPLE]
We have to keep only the above records. So, we make a delete query on those id which are not present in above table.
Output: [IMAGE 4 {Add Scaler topics logo into it} START SAMPLE] [IMAGE 4 FINISH SAMPLE]
Hence, the duplicate rows are deleted from the table.
How to delete duplicate Rows in SQL using Common Table Expressions (CTE)
We can use CTE to delete duplicate records in SQL by using a window function row_number() and partition the data on the basis of duplicate data.
See the below code:
The output will be: [IMAGE 5 {Add Scaler topics logo into it} START SAMPLE] [IMAGE 5 FINISH SAMPLE]
Now, we can delete those rows in which the number in duplicates is greater than 1.
Code:
Output:
[IMAGE 5 {Add Scaler topics logo into it} START SAMPLE]! [IMAGE 5 FINISH SAMPLE]
Table after deletion:
[IMAGE 6 {Add Scaler topics logo into it} START SAMPLE]! [IMAGE 6 FINISH SAMPLE]
RANK function to delete duplicate rows in SQL
Another method to delete duplicate records in SQL involves utilizing the "Rank()" function alongside the "PARTITION BY" clause and an "INNER JOIN".
Code:
Output: [IMAGE 6 {Add Scaler topics logo into it} START SAMPLE]! [IMAGE 6 FINISH SAMPLE]
We have to remove the query whose rank is greater than 1.
Code:
Output: [IMAGE 7 {Add Scaler topics logo into it} START SAMPLE]! [IMAGE 7 FINISH SAMPLE]
Use SSIS package to delete duplicate rows in SQL
To use an SSIS (SQL Server Integration Services) package to delete duplicate records in SQL, you can follow these general steps:
- Create an SSIS Package: Begin by launching SQL Server Data Tools (SSDT) and initiating a new Integration Services project to proceed. Then, create a new SSIS package within the project.
- Add Data Flow Task: Place a Data Flow Task onto the SSIS package's Control Flow design surface by dragging and dropping it.
- Configure Data Flow Task: Double-click the Data Flow Task to enter the Data Flow design surface. Inside the Data Flow, add a source component to fetch data from the SQL Server table containing duplicate rows. Configure this source component accordingly.
- Add Sort Transformation: Add a Sort Transformation after the source component. Configure the Sort Transformation to sort the data based on the columns you want to use to identify duplicates.
- Add Row Count Transformation: Insert a Row Count Transformation after the Sort Transformation. This transformation will count the number of rows processed.
- Add Conditional Split Transformation: Add a Conditional Split Transformation after the Row Count Transformation. Use this transformation to filter out duplicate rows based on your criteria.
- Add Destination Component: Add a destination component (e.g., OLE DB Destination) to write the non-duplicate rows back to the SQL Server table.
- Execute the Package: Execute the SSIS package to delete the duplicate rows from the SQL Server table.
Conclusion
- Eliminating duplicate rows from a database table is essential for maintaining accurate data and optimizing memory usage.
- SQL provides several methods to identify and delete duplicate records in SQL, including using GROUP BY and HAVING clauses, Common Table Expressions (CTE), RANK function, and SSIS packages.
- The GROUP BY and HAVING clauses are effective for identifying duplicate rows based on specified columns and their occurrence counts.
- Common Table Expressions (CTE) with the row_number() function partition data to identify duplicates and facilitate their deletion.
- The RANK function, combined with the PARTITION BY clause and INNER JOIN, offers another approach to identify and delete duplicate records in SQL.
- Using SSIS packages in SQL Server Integration Services provides a comprehensive solution for deleting duplicate rows by incorporating various data flow tasks and transformations.