Remove duplicates with large sql value. Removing repetitions in T-SQL

Removing repetitions

Database source

The need to deduplicate data is common, especially when addressing data quality issues in environments where duplication has arisen due to a lack of constraints to ensure data uniqueness. To demonstrate, use the following code to prepare an example of data with duplicate orders in a table named MyOrders:

IF OBJECT_ID("Sales.MyOrders") IS NOT NULL DROP TABLE Sales.MyOrders; GO SELECT * INTO Sales.MyOrders FROM Sales.Orders UNION ALL SELECT * FROM Sales.Orders UNION ALL SELECT * FROM Sales.Orders;

Imagine that you need to eliminate duplicate data, leaving only one instance of each with a unique orderid value. Duplicate numbers are marked using the ROW_NUMBER function, partitioning by a supposedly unique value (orderid in our case) and using random ordering if you don't care which row to keep and which to remove. Here's the code where the ROW_NUMBER function marks duplicates:

SELECT orderid, ROW_NUMBER() OVER(PARTITION BY orderid ORDER BY (SELECT NULL)) AS n FROM Sales.MyOrders;

Then you need to consider different variants depending on the number of rows that need to be deleted, the percentage of table size, what that number is, the activity of the production environment, and other circumstances. When the number of rows being deleted is small, it is usually sufficient to use a fully logged delete operation, which deletes all instances with a row number greater than one:

But if the number of rows being deleted is large - especially when it constitutes a large proportion of the table's rows - deleting with full recording log operations will be too slow. In this case, you might want to consider using a bulk logging operation such as SELECT INTO to copy the unique rows (numbered 1) to another table. After this, the original table is deleted, then new table the name of the remote table is assigned, constraints, indexes and triggers are recreated. Here is the code for the completed solution:

WITH C AS (SELECT *, ROW_NUMBER() OVER(PARTITION BY orderid ORDER BY (SELECT NULL)) AS n FROM Sales.MyOrders) SELECT orderid, custid, empid, orderdate, requireddate, shippeddate, shipperid, freight, shipname, shipaddress, shipcity, shipregion, shippostalcode, shipcountry INTO Sales.OrdersTmp FROM C WHERE n = 1; DROP TABLE Sales.MyOrders; EXEC sp_rename "Sales.OrdersTmp", "MyOrders"; -- recreating indexes, constraints and triggers

For simplicity, I have not added any transaction control here, but you must always remember that multiple users can work with the data at the same time. When implementing this method in a production environment, you must follow the following sequence:

    Open transaction.

    Obtain a table lock.

    Execute the SELECT INTO statement.

    Delete and rename objects.

    Recreate indexes, constraints and triggers.

    Commit the transaction.

There is another option - to filter only unique or only non-unique rows. Both ROW_NUMBER and RANK are calculated based on orderid, something like this:

SELECT orderid, ROW_NUMBER() OVER(ORDER BY orderid) AS rownum, RANK() OVER(ORDER BY orderid) AS rnk FROM Sales.MyOrders;

Notice that in the results, only one row for each unique value in orderid matches the row number and row rank. For example, if you need to remove a small part data, you can encapsulate the previous query in a CTE definition, and in the outer query, issue an instruction to delete rows that have different number lines and rank.

(25-07-2009)

In the previous article, we looked at solving the duplicate problem caused by a missing primary key. Let us now consider a more difficult case, when the key seems to be there, but it is synthetic, which, if designed incorrectly, can also lead to duplicates from the point of view subject area.

It’s strange, but when I talk at lectures about the shortcomings of synthetic keys, I nevertheless constantly come across the fact that students invariably use them in their first database projects. Apparently, a person has a genetic need to renumber everything, and only a psychotherapist can help here. :-)

So, let's say we have a table with a primary key id and a column name, which, in accordance with domain restrictions, must contain unique values. However, if you define the table structure as follows

CREATE TABLE T_pk (id INT IDENTITY PRIMARY KEY , name VARCHAR (50 ));

then nothing prevents the appearance of duplicates. The following table structure should be used:

CREATE TABLE T_pk (id INT IDENTITY PRIMARY KEY, name VARCHAR (50) UNIQUE);

Everyone knows the right thing to do, but often you have to deal with “legacy” structure and data that violates domain constraints. Here's an example:

id name 1 John 2 Smith 3 John 4 Smith 5 Smith 6 Tom

You may ask: “How is this problem different from the previous one? After all, there is an even simpler solution here - simply remove all rows from each group with the same values ​​in the name column, leaving only the row with the minimum/maximum id value. For example, like this:"

DELETE FROM T_pk WHERE id > (SELECT MIN (id) FROM T_pk X WHERE X.name = T_pk.name);

That's right, but I haven't told you everything yet. :-) Imagine that we have a child table T_details linked to the table T_pk by foreign key:

CREATE TABLE T_details (id_pk INT FOREIGN KEY REFERENCES T_pk ON DELETE CASCADE , color VARCHAR (10), PRIMARY KEY (id_pk, color);

This table may contain the following data:

id_pk color 1 blue 1 red 2 green 2 red 3 red 4 blue 6 red

For greater clarity, let's use the query

SELECT id, name, color FROM T_pk JOIN T_details ON id= id_pk;

to see names:

id name color 1 John blue 1 John red 2 Smith green 2 Smith red 3 John red 4 Smith blue 6 Tom red

Thus, it turns out that data that actually relates to one person was mistakenly allocated to different parent records. In addition, there were duplicates in this table:

1 John red 3 John red

Obviously, such data will lead to erroneous analyzes and reports. Moreover, cascade deletion will result in data loss. For example, if we leave only the rows with the minimum ID in each group in the T_pk table, we will lose the row

4 Smith blue

in the T_details table. Therefore, we must take both tables into account when eliminating duplicates.

The data “cleaning” procedure can be carried out in two stages:

  1. Update the T_details table by assigning data related to one name to the id with the minimum number in the group.
  2. Remove duplicates from the T_pk table, leaving only rows with the minimum id in each group with the same value in the name column.

Updating the T_details table

SELECT id_pk, name, color , RANK () OVER (PARTITION BY name, color ORDER BY name, color, id_pk) dup ,(SELECT MIN (id) FROM T_pk WHERE T_pk.name = X.name) min_id FROM T_pk X JOIN T_details ON id=id_pk;

detects the presence of duplicates (dup value > 1) and minimum value id in a group of identical names (min_id). Here is the result of running this query:

id_pk name color dup min_id 1 John blue 1 1 1 John red 1 1 3 John red 2 1 4 Smith blue 1 2 2 Smith green 1 2 2 Smith red 1 2 6 Tom red 1 6

Now we need to replace the id_pk value with the min_pk value for all rows except the third one, because this line is a duplicate of the second line, as indicated by the value dup=2. An update request can be written like this:

UPDATE T_details SET id_pk=min_id FROM T_details T_d JOIN (SELECT id_pk, name, color , RANK () OVER (PARTITION BY name, color ORDER BY name, color, id_pk) dup ,(SELECT MIN (id) FROM T_pk WHERE T_pk.name = X.name) min_id FROM T_pk X JOIN T_details ON id=id_pk) Y ON Y.id_pk=T_d.id_pk WHERE dup =1 ;

When the task of optimizing a database arises or its structure changes, sometimes a related task arises of organizing the already accumulated data. It’s good if the table is already given in the development normal shape, and the entire system is organized in such a way that it does not accumulate unnecessary duplicate information. If this is not the case, then when finalizing such a system, you want to get rid of all redundant data and do everything with the highest quality.

In this article we will consider the task of removing duplicate rows in a database table. I would like to point out right away that we're talking about about the need to remove duplicate lines. For example, records in the order table with the fields “order code”, “product code”, “customer code”, “order date” can differ only in the order code, since one customer can order the same product several times on the same day once. And the main indicator here that everything is correct is the presence key field.

If we see a table full of duplicate fields, with no clear need for each entry, then this is exactly what needs to be fixed.

An example of a clearly redundant table:

Now let's look at how we can solve this problem. Several methods can be applied here.


1. You can write a function to compare and iterate through all the data. It takes a long time, and you don’t always want to write code for one-time use.


2. Another solution is to create a select query that groups the data so that only unique rows are returned:

SELECT country_id, city_name
FROM mytable
GROUP BY country_id, city_name

We get the following sample:

Then we write the resulting data set into another table.


3. B the above decisions additional applies program code or additional tables. However, it would be more convenient to do everything using only SQL queries without additional tables. And here is an example of such a solution:

DELETE a.* FROM mytable a,
(SELECT

FROM mytable b

) c
WHERE
a.country_id = c.country_id
AND a.city_name = c.city_name
AND a.id > c.mid

After executing such a query, only unique records will remain in the table:

Now let's take a closer look at how it all works. When requesting deletion, you must set a condition that will indicate which data should be deleted and which should be left. We need to remove all non-unique entries. Those. if there are several identical records (they are the same if they have equal country_id and city_name values), then you need to take one of the lines, remember its code and delete all records with the same country_id and city_name values, but a different code (id).

SQL query string:

DELETE a.* FROM mytable a,

indicates that the deletion will be performed from the mytable table.

The select query then generates a auxiliary table where we group the records so that all records are unique:

(SELECT
b.country_id, b.city_name, MIN(b.id) mid
FROM mytable b
GROUP BY b.country_id, b.city_name
) c

MIN(b.id) mid – forms the column mid (abbreviation min id), which contains the minimum id value in each subgroup.

The result is a table containing unique records and the first row id for each group of duplicate records.

Now we have two tables. One general one containing all records. Extra lines will be removed from it. The second contains information about the rows that need to be saved.

All that remains is to create a condition that states: you need to delete all lines where the country_id and city_name fields match, but the id will not match. IN in this case the minimum id value is selected, so all records whose id is greater than the one selected in the temporary table are deleted.


It is also worth noting that the described operation can be performed if there is a key field in the table. If suddenly you encounter a table without unique identifier, then just add it:

ALTER TABLE ` mytable` ADD `id` INT(11) NOT NULL AUTO_INCREMENT , ADD PRIMARY KEY (`id`)

Having executed such a query, we get an additional column filled with unique numerical values for each row of the table.

We carry out all the necessary actions. After the operation to clear the table of duplicate records is completed, this field can also be deleted.