March 2005
M T W T F S S
« Feb   Apr »
 123456
78910111213
14151617181920
21222324252627
28293031  

Recent Posts

Recent Comments

Ads


« Weird London Weather | Main | Albums you should listen to before you die »

Deleting lots of data…

By dave | March 13, 2005

This question came up recently and there was some discussion about it so I thought I’d summarise here…

Problem
How best to delete a large amount of data from a table. The database in question was Sybase so we’ll talk about that. The concepts are the same for all major databases though the syntax may differ.

Solution
As came out in the conversation there are several things you need to think about.

Lets look at the effect each of these questions has on how you approach this task.

How many rows are there in the table?
This is the first question you need to answer.
SELECT count(*) from MyTable
< 5,000 - Probably not going to be any problems here.

< 5,000,000 - Pick your query carefully and take some time.

5,000,000 + - In the realms of large tables, this needs planning carefully.

How much data needs to be deleted?
This is the second question you need to answer.
All - Think about using Truncate, this is a very effient way of deleting all the rows in the table since it simply moves the “next row marker”.
Truncate has several side effects though:-

  1. It’s not logged, can cause problems with replication
  2. No space is reclaimed because no rows have actually been deleted

You could also consider dropping and re-creating the table.
Lots - (30%+) If you’re planning to delete the majority of the table it’s worth considering moving the rows you need to a temporary table, clearing the old table (using truncate or dropping it) and moving the rows you need back again.
Some - (<30%) This is where you should be thinking about deleting the rows you don’t want (carefully)
Those percentages are just a guide, please use your judgement when deciding what to do. You also need to combine how much data you’re deleting with the size of the table. Deleting most of the rows from a 10 million row table is not a small undertaking

What indexes are there on the table?
Having indexes on the table during a large delete operation will slow the delete enormously. Typically it is quicker to drop all the indexes, delete the data and then recreate the indexes.
Once the delete is complete it is vital that the statistics for the table and index(es) involved are updated. If the query optimiser has incorrect information it will not pick the best plan. It’s entirely possible a catastrophically bad plan could be chosen.

Is this a production database?
If the delete needs to be carried out on a production database it should be carried out by a DBA and out of hours. It’s probably worth having a DBA on hand even if it’s a development database just in case something goes wrong

Is this a one off task?
If this is a one off task then the simplest approach is best.

  1. Fire up your trusty SQL Execution tool
  2. Execute the following SQL until no rows are deleted

SET ROWCOUNT 100 — limits the number of rows affected
DELETE FROM MyTable WHERE [Clause]
SET ROWCOUNT 0 — removes the previous limit
Clearly this isn’t going to work well if you need to delete thousands of rows (you’ll be running it all day)
If this is a task that needs to be carried out frequently it should be wrapped up in a stored procedure

A better plan…
Of course you can avoid the vast majority of all these problems with the simple (but often overlooked) step of developing and implementing an archiving strategy during the development phase.

Tags: , ,
Topics: /Databases, /Technology |

2 Responses to “Deleting lots of data…”

  1. John H Says:
    March 17th, 2005 at 9:38 am

    I am not a database expert, but I don’t understand why this should be a big deal. Is 5,000,000 really such a big number?
    <p>
    What are the risks? A performance hit? Irreversible damage to the data?

  2. David Hayes Says:
    March 18th, 2005 at 2:11 am

    There are quite a lot of potential issues, the exact numbers are just for example really. It depends alot on the precise set up of your DB and the data it contains.
    The problem you’re most likely to hit is filling the transaction log (since each deleted row needs to be logged and the log space is limited) this could stop everybody else on the DB running anything.
    If you work around this (by chunking your deletes) then the statistics the query planner uses to determine the best way to get your data will be wrong. This might mean it forces a table scan where an index lookup would be better. Bad queries can cause havoc to a big production DB.
    Irreversible damage to data is impossible (in theory) because all DBs conform to the ACID requirements. Obviously nothing is perfect but the risk is incredibly low

Comments