Saturday, February 19, 2011

How do you add a NOT NULL Column to a large table in SQL Server?

To add a NOT NULL Column to a table with many records, a DEFAULT constraint needs to be applied. This constraint causes the entire ALTER TABLE command to take a long time to run if the table is very large. This is because:

Assumptions:

  1. The DEFAULT constraint modifies existing records. This means that the db needs to increase the size of each record, which causes it to shift records on full data-pages to other data-pages and that takes time.
  2. The DEFAULT update executes as an atomic transaction. This means that the transaction log will need to be grown so that a roll-back can be executed if necessary.
  3. The transaction log keeps track of the entire record. Therefore, even though only a single field is modified, the space needed by the log will be based on the size of the entire record multiplied by the # of existing records. This means that adding a column to a table with small records will be faster than adding a column to a table with large records even if the total # of records are the same for both tables.

Possible solutions:

  1. Suck it up and wait for the process to complete. Just make sure to set the timeout period to be very long. The problem with this is that it may take hours or days to do depending on the # of records.
  2. Add the column but allow NULL. Afterward, run an UPDATE query to set the DEFAULT value for existing rows. Do not do UPDATE *. Update batches of records at a time or you'll end up with the same problem as solution #1. The problem with this approach is that you end up with a column that allows NULL when you know that this is an unnecessary option. I believe that there are some best practice documents out there that says that you should not have columns that allow NULL unless it's necessary.
  3. Create a new table with the same schema. Add the column to that schema. Transfer the data over from the original table. Drop the original table and rename the new table. I'm not certain how this is any better than #1.

Question:

1) Are my assumptions correct? 2) Are these my only solutions? If so, which one is the best? If not, what else could I do?

From stackoverflow
  • I think this depends on the SQL flavor you are using, but what if you took option 2, but at the very end alter table table to not null with the default value?

    Would it be fast, since it sees all the values are not null?

  • Vertically segment the table. This means you will have two tables, with the same primary key, and exactly the same number of records... One will be the one you already have, the other will have just the key, and the new Non-Null column (with default value) . Modify all Insert, Update, and delete code so they keep the two tables in synch... If you want you can create a view that "joins" the two tables together to create a single logical combination of the two that appears like a single table for client Select statements...

    Vinko Vrsalovic : Did you mean vertically?
    Charles Bretana : ahh yes, missed a t... I'm unfortunately a 2-finger typer...
    rmeador : This seems like a very dangerous solution. Do it the right way. If you do attempt this solution, make the new view that joins the two tables be the name of the original table (so no client code needs to be changed).
  • Here's what I would try:

    • Do a full backup of the database.
    • Add the new column, allowing nulls - don't set a default.
    • Set SIMPLE recovery, which truncates the tran log as soon as each batch is committed.
    • The SQL is: ALTER DATABASE XXX SET RECOVERY SIMPLE
    • Run the update in batches as you discussed above, committing after each one.
    • Reset the new column to no longer allow nulls.
    • Go back to the normal FULL recovery.
    • The SQL is: ALTER DATABASE XXX SET RECOVERY FULL
    • Backup the database again.

    The use of the SIMPLE recovery model doesn't stop logging, but it significantly reduces its impact. This is because the server discards the recovery information after every commit.

    Lamar : Depending on size of the table, make sure you have enough log space. Resetting the column to no longer allow nulls is log intensive.
  • I would use CURSOR instead of UPDATE. Cursor will update all matching records in batch, record by record -- it takes time but not locks table.

    If you want to avoid locks use WAIT.

    Also I am not sure, that DEFAULT constrain changes existing rows. Probably NOT NULL constrain use together with DEFAULT causes case described by author.

    If it changes add it in the end So pseudocode will look like:

    -- without NOT NULL constrain -- we will add it in the end
    ALTER TABLE table ADD new_column INT DEFAULT 0
    
    DECLARE fillNullColumn CURSOR LOCAL FAST_FORWARD
        SELECT 
            key
        FROM
            table WITH (NOLOCK)
        WHERE
            new_column IS NULL
    
    OPEN fillNullColumn
    
    DECLARE 
        @key INT
    
    FETCH NEXT FROM fillNullColumn INTO @key
    
    WHILE @@FETCH_STATUS = 0 BEGIN
         UPDATE
             table WITH (ROWLOCK)
         SET
             new_column = 0 -- default value
         WHERE
             key = @key
    
         WAIT 00:00:05 --wait 5 seconds, keep in mind it causes updating only 12 rows per minute
    
         FETCH NEXT FROM fillNullColumn INTO @key
    END
    
    CLOSE fillNullColumn
    DEALLOCATE fillNullColumn
    
    ALTER TABLE table ALTER COLUMN new_column ADD CONSTRAIN xxx
    

    I am sure that there are some syntax errors, but I hope that this help to solve your problem.

    Good luck!

  • If you want the column in the same table, you'll just have to do it. Now, option 3 is potentially the best for this because you can still have the database "live" while this operation is going on. If you use option 1, the table is locked while the operation happens and then you're really stuck.

    If you don't really care if the column is in the table, then I suppose a segmented approach is the next best. Though, I really try to avoid that (to the point that I don't do it) because then like Charles Bretana says, you'll have to make sure and find all the places that update/insert that table and modify those. Ugh!

  • You could:

    1. Start a transaction.
    2. Grab a write lock on your original table so no one writes to it.
    3. Create a shadow table with the new schema.
    4. Transfer all the data from the original table.
    5. execute sp_rename to rename the old table out.
    6. execute sp_rename to rename the new table in.
    7. Finally, you commit the transaction.

    The advantage of this approach is that your readers will be able to access the table during the long process and that you can perform any kind of schema change in the background.

  • I had a similar problem, and went for your option #2. It takes 20 minutes this way, as opposed to 32 hours the other way!!! Huge difference, thanks for the tip. I wrote a full blog entry about it, but here's the important sql:

    Alter table MyTable
    Add MyNewColumn char(10) null default '?';
    go
    
    update MyTable set MyNewColumn='?' where MyPrimaryKey between 0 and 1000000
    go
    update MyTable set MyNewColumn='?' where MyPrimaryKey between 1000000 and 2000000
    go
    update MyTable set MyNewColumn='?' where MyPrimaryKey between 2000000 and 3000000
    go
    ..etc..
    
    Alter table MyTable
    Alter column MyNewColumn char(10) not null;
    

    And the blog entry if you're interested: http://splinter.com.au/blog/?p=116

  • ran into this problem for my work also. And my solution is along #2.

    Here are my steps (I am using SQL Server 2005)

    1) ALTER TABLE MyTable ADD MyColumn varchar(40) DEFAULT('')

    2) Add Not null constraint with nocheck option. NOCHECK option does not enforce on existing values.

    ALTER TABLE MyTable WITH NOCHECK ADD CONSTRAINT MyColumn_NOTNULL CHECK (MyColumn IS NOT NULL)

    3) UPDATE TOP(3000) MyTable SET MyColumn = '' WHERE MyColumn IS NULL GO 1000

    • The update statement will only update maximum 3000 records. This allow to save a chunk of data at the time. I have to use "MyColumn IS NULL" because my table does not have a sequence primary key.

    • GO 1000 will execute the previous statement 1000 times. This will update 3 millions records, if you need more just increase this number. I will continue to execute untill sql server return 0 records for UPDATE statement.

0 comments:

Post a Comment