Category : Database | Date : 12/18/2017 1:47:00 PM | ID : BLOG632150

How to Remove Duplicates from a Table in SQL Server



Duplicates of data in an Excel file you receive is an everyday problem. It is okay if we have 10 records of data in a file we receive and only 2 of them are duplicates. It will be a small manual work to remove those two duplicate records in the file and then ETL (Extract, Transform, and Load) the data into a SQL Server database using SSIS. Click Here  for a step-by-step process, if you don't know how to ETL data using SSIS. For understanding and working on this task better, let us create a table named Persons and some sample data in it with duplicates.
CREATE TABLE Persons(
    Name varchar(50) NOT NULL,
    City varchar(30) NOT NULL,
    [State] char(2) NOT NULL
)
 
INSERT INTO Persons(Name, City, [State]) VALUES('John', 'Dallas','TX')
INSERT INTO Persons(Name, City, [State]) VALUES('Mark', 'Seattle','WA')
INSERT INTO Persons(Name, City, [State]) VALUES('Nick', 'Phoenix','AZ')
INSERT INTO Persons(Name, City, [State]) VALUES('Laila', 'San Jose','CA')
INSERT INTO Persons(Name, City, [State]) VALUES('Samantha', 'Tulsa','OK')
INSERT INTO Persons(Name, City, [State]) VALUES('Bella', 'San Antonio','TX')
INSERT INTO Persons(Name, City, [State]) VALUES('John', 'Dallas','TX')
INSERT INTO Persons(Name, City, [State]) VALUES('John', 'Dallas','TX')
INSERT INTO Persons(Name, City, [State]) VALUES('Mark', 'Seattle','WA')
INSERT INTO Persons(Name, City, [State]) VALUES('Nick', 'Tempe','FL')
INSERT INTO Persons(Name, City, [State]) VALUES('John', 'Dallas','TX')
 
SELECT * FROM Persons 

If you look at the data, there are considerable duplicates in it. John and Mark's names got repeated many times in the data. Now, let us write a query that uses the ROW NUMBER function and gives any duplicate values the rank of 2, 3 and so on. Click Here  to read more about ranking functions.
SELECT Name
    , City
    , [State]
    , ROW_NUMBER() OVER(PARTITION BY Name, City, [State] ORDER BY [Name]) AS Rnum
FROM Persons
Remember that an ORDER BY clause is mandatory in a Ranking function. Result set after the usage of ranking function will look as below with all ranks defined 1 for unique values and all duplicates with values greater than 1.
 

Now let us write a query which will delete all duplicate data in one shot. We will use a CTE (Common Table Expression) for this purpose. We will read in future posts what a CTE is and why it is used. On a lighter note, CTE's can be imagined as equivalent to temporary result sets that can be used only in an underlying SELECT, INSERT, UPDATE, DELETE or CREATE VIEW statement.
;WITH CTE AS
(
SELECT Name
    , City
    , [State]
    , ROW_NUMBER() OVER(PARTITION BY Name, City, [State] ORDER BY [Name]) AS Rnum
FROM Persons
)
DELETE FROM CTE WHERE Rnum <> 1
In the code by saying WHERE Rnum <> 1, we are asking SQL Server to keep all the records with Rank 1, which are not duplicates, and delete any other record. After executing this query in SQL Server Management Studio, you will end up with no duplicates in your table. To confirm that just run a simple query against your table.
SELECT * FROM Persons