This question came up recently and there was some discussion about it so I thought I’d summarise here…
Problem
How best to delete a large amount of data from a table. The database in question was Sybase so we’ll talk about that. The concepts are the same for all major databases though the syntax may differ.
Solution
As came out in the conversation there are several things you need to think about.
- How many rows are there in the table?
- How much data needs to be deleted?
- What indexes are there on the table?
- Is this a production database?
- Is this a one off task?
Lets look at the effect each of these questions has on how you approach this task.
How many rows are there in the table?
This is the first question you need to answer.
SELECT count(*) from MyTable
< 5,000 - Probably not going to be any problems here.
< 5,000,000 - Pick your query carefully and take some time.
5,000,000 + – In the realms of large tables, this needs planning carefully.
How much data needs to be deleted?
This is the second question you need to answer.
All – Think about using Truncate, this is a very effient way of deleting all the rows in the table since it simply moves the “next row marker”.
Truncate has several side effects though:-
- It’s not logged, can cause problems with replication
- No space is reclaimed because no rows have actually been deleted
You could also consider dropping and re-creating the table.
Lots – (30%+) If you’re planning to delete the majority of the table it’s worth considering moving the rows you need to a temporary table, clearing the old table (using truncate or dropping it) and moving the rows you need back again.
Some – (<30%) This is where you should be thinking about deleting the rows you don’t want (carefully)
Those percentages are just a guide, please use your judgement when deciding what to do. You also need to combine how much data you’re deleting with the size of the table. Deleting most of the rows from a 10 million row table is not a small undertaking
What indexes are there on the table?
Having indexes on the table during a large delete operation will slow the delete enormously. Typically it is quicker to drop all the indexes, delete the data and then recreate the indexes.
Once the delete is complete it is vital that the statistics for the table and index(es) involved are updated. If the query optimiser has incorrect information it will not pick the best plan. It’s entirely possible a catastrophically bad plan could be chosen.
Is this a production database?
If the delete needs to be carried out on a production database it should be carried out by a DBA and out of hours. It’s probably worth having a DBA on hand even if it’s a development database just in case something goes wrong
Is this a one off task?
If this is a one off task then the simplest approach is best.
- Fire up your trusty SQL Execution tool
- Execute the following SQL until no rows are deleted
SET ROWCOUNT 100 — limits the number of rows affected
DELETE FROM MyTable WHERE [Clause]
SET ROWCOUNT 0 — removes the previous limit
Clearly this isn’t going to work well if you need to delete thousands of rows (you’ll be running it all day)
If this is a task that needs to be carried out frequently it should be wrapped up in a stored procedure
A better plan…
Of course you can avoid the vast majority of all these problems with the simple (but often overlooked) step of developing and implementing an archiving strategy during the development phase.






2 Responses
Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.
I am not a database expert, but I don’t understand why this should be a big deal. Is 5,000,000 really such a big number?
<p>
What are the risks? A performance hit? Irreversible damage to the data?
There are quite a lot of potential issues, the exact numbers are just for example really. It depends alot on the precise set up of your DB and the data it contains.
The problem you’re most likely to hit is filling the transaction log (since each deleted row needs to be logged and the log space is limited) this could stop everybody else on the DB running anything.
If you work around this (by chunking your deletes) then the statistics the query planner uses to determine the best way to get your data will be wrong. This might mean it forces a table scan where an index lookup would be better. Bad queries can cause havoc to a big production DB.
Irreversible damage to data is impossible (in theory) because all DBs conform to the ACID requirements. Obviously nothing is perfect but the risk is incredibly low