How to deploy a database

When I'm building an application that stores data, there are a few things I try to make sure we can do when it comes to managing the lifecycle of our storage. I'm going to focus on SQL Server, but this applies to other relational storage.

What's in a database?

Even outside of production, a database is more than schema. A database "definition" consists of:

  • The database schema (tables, views, procedures, schemas)
  • Reference data (ZIP codes, salutations, other things I expect to always "be there")
  • Sample master/transactional data (for developers/testers)
  • Security (roles, permissions, users)

It may even include management job definitions (backup, shrink), though they tend to be left to DBA's.

Transitions, not States

As each iteration progresses, we'll make changes to our database.

One approach to upgrading existing databases is to "diff" the old state of the database with our new state, and then to manually modify this diff until we have something that works. We then employ the "hope and pray" strategy of managing our database, which generally means "leave it to the DBA" for production. Tools like Red Gate SQL Compare and Visual Studio Database Edition encourage this approach.

I never understood the state or model-driven approach, and I'll explain why. Here's a simple example of some T-SQL:

create table dbo.Customer (
    Id int not null identity(1,1) constraint PK_CustomerId primary key, 
    FullName nvarchar(200) not null
 );

See the "create table"? That's not a definition - it's an instruction. If I need to change my table, I don't change the statement above. I just write:

alter table dbo.Customer 
    add IsPreferred bit not null 
        constraint DF_Customer_IsPreferred default(0);

Again, an instruction. A transition. I don't tell the database what the state is - I tell it how to get there. I can provide a lot more context, and I have a lot more power.

You'll note that I used a default constraint above - that's because my table might have had data already. Since I was thinking about transitions, I was forced to think about these issues.

Our databases are designed for transitions; attempting to bolt a state-based approach on them is about as dumb as bolting a stateful model on top of the web (and Visual Studio Database Edition is about as much fun as ViewState).

Keep in mind that making changes to databases can be complicated. Here are some things we might do:

  • Add a column to an existing table (what will the default values be?)
  • Split a column into two columns (how will you deal with data in the existing column?)
  • Move a column from one table onto another (remember to move it, not to drop and create the column and lose the data)
  • Duplicate data from a column on one table into a column on another (to reduce joins) (don't just create the empty column - figure out how to get the data there)
  • Rename a column (don't just create a new one and delete the old one)
  • Change the type of a column (how will you convert the old data? What if some rows won't convert?)
  • Change the data within a column (maybe all order #'s need to be prefixed with the customer code?)

You can see how performing a "diff" on the old and new state can miss some of the intricacies of real-life data management.

Successful database management

Here are some things I want from my database deployment strategy:

  1. Source controlled
    Your database isn't in source control? You don't deserve one. Go use Excel.
  2. Testability
    I want to be able to write an integration test that takes a backup of the old state, performs the upgrade to the current state, and verifies that the data wasn't corrupted.
  3. Continuous integration
    I want those tests run by my build server, every time I check in. I'd like a CI build that takes a production backup, restores it, and runs and tests any upgrades nightly.
  4. No shared databases
    Every developer should be able to have a copy of the database on their own machine. Deploying that database - with sample data - should be one click.
  5. Dogfooding upgrades
    If Susan makes a change to the database, Harry should be able to execute her transitions on his own database. If he had different test data to her, he might find bugs she didn't. Harry shouldn't just blow away his database and start again.

The benefits to this are enormous. By testing my transitions every single day - on my own dev test data, in my integration tests, on my build server, against production backups - I'm going to be confident that my changes will work in production.

Versions table

There should be something that I can query to know which transitions have been run against my database. The simplest way is with a Versions table, which tells me the scripts that were run, when they were run, and who they were run by.

When it comes to upgrading, I can query this table, skip the transitions that have been run, and execute the ones that haven't.

Sample data

Development teams often need access to a good set of sample data, ideally lots of it. Again, these should be transition scripts that can be optionally run (since I might not want sample data in production), and in source control.

Document Databases

Most of these principles apply to document databases too. In fact, in some ways the problems are harder. While you don't have a fixed schema, you're probably mapping your documents to objects - what if the structure of the objects change? You may need to run transitional scripts over the document database to manipulate the existing documents. You may also need to re-define your indexes. You want those deployment scenarios to be testable and trusted.

Migration libraries

Rails popularized the approach of using a DSL to describe data migrations. There are a number of .NET ports of this concept, like Fluent Migrator and Machine.Migrations.

Personally, I actually find T-SQL a perfectly good DSL for describing data migrations. I love my ORM's, but for schema work, T-SQL is perfectly adequate.

Again, these libraries focus on transitions (create table, add column), not states, so they're useful, unlike Visual Studio Database Edition, which isn't.

A picture of me

Welcome, my name is Paul Stovell. I live in Brisbane and work full time bootstrapping my own product company around Octopus Deploy, an automated deployment tool for .NET applications.

Prior to Octopus Deploy, I worked for an investment bank in London building WPF applications, and before that I worked for Readify, an Australian .NET consulting firm, where I was lucky enough to work with some of the best in the business. I also worked on a number of open source projects and was an active user group presenter. I've been a Microsoft MVP for WPF since 2006.

Stacy
Stacy
02 Dec 2010

Very topical for me at the moment.

Our latest project is very database-heavy.

We're going to be using tarantino to run state transitions (but we are using SQL Compare to generate the first hack of state change).

We tried DB Pro for about a day. It failed us with strange compile issues (even though the db schema is valid and creates a DB) and generated an 8MB xml file.

Rich Fantozzi
Rich Fantozzi
02 Dec 2010

Good article agree with alot of it. I wonder if you extend the versioning from the database to the code that is being run on it. For example in our CI thier are tests to look at code version vs database version and ensures that they can be run against each other. We found that you can have code that builds, tests that pass but something in the db is forgotten. So for each code version has to mapped to a valid db version. While it is not fool proof it has greatly limited the number of misses.

02 Dec 2010

It may even include management job definitions (backup, shrink), though they tend to be left to DBA's

And if it was, I hope if the DBA were any good they would know never to shrink the database (same goes for Auto-Shrink too).

-dave

04 Dec 2010

I largely agree with the approach taken by Paul, and have used it successfully on some very large systems.

Actions like keeping your database scripts under version control, having isolated development environments (including the databases), and having automated builds and deployments of the changes are the minimum of what I would recommend.

One distinction I do make is between the schema (mostly tables) and other (executable code) elements.

For any table that holds data, it is usually a non-trivial exercise to move from v1 -> v2 -> v3. Migrations need to be made in order and you cannot simply dump v3 on top of v1.

On the other hand, a stored procedure is much more similar to your client application code. You can easily jump from v1 -> v3, or even then jump back to v2 without the complications of potential data loss.

You can still have problems if the version doesn't match the schema -- the same problems if client code doesn't match -- but this is easier to fix than data loss.

I've also used differencing tools (DB Pro) and had some success, but did find scenarios where the comparison failed. Without a trustworthy tool I found myself using comparison to generate a change script which I then had to manually verify and update.

A differencing tool can fail if you make too many changes, e.g. if I change the name, type (size) and position of a column: following on from the above if I change the FullName column to DisplayName --

create table dbo.Customer (
  DisplayName nvarchar(100) not null, 
  Id int not null identity(1,1) constraint PK_CustomerId primary key,
  Address nvarchar(200) null
);
Antony P Joseph
Antony P Joseph
08 Dec 2010

Paul, Can you share any examples of Versions Table? What I meant is, some sample scripts and how / when do you insert data into the table etc!

Regards, Antony.

Ujjaval Suthar
Ujjaval Suthar
09 Dec 2010

Hey Paul,

Great post.

Just checking, When you say 'Source control the database', do you mean source control just scripts that defines the database or actually source controlling the database containing all master/test data into source control?

I can imagine that will be a really expensive approach if this is on top of DBA's regular backup strategies for databases.

Cheers, Ujjaval

15 Dec 2010

So, if you don't have a shared database, exactly what are the integration tests (on the build server) running against? Surely you don't want them to point to your local DB; they should point to a central dev DB?

31 Dec 2010

silky,

Good question. Ideally my integration tests would look like this:

[Test]
public void CanSaveCustomers() 
{
    using (var database = new TemporarySqlDatabase())
    using (var session = database.SessionFactory.OpenSession())
    {
        database.Deploy();

        // Do stuff with session
    }   
}

Where database.Deploy() creates a new database (with a GUID for the name) on the local SQL instance, deploys the schema/sample data using scripts, and does all the other things from above. database.SessionFactory would create an configure an NHibernate session factory conected to that database. Finally, the database.Dispose would delete the SQL database.

These integration tests would run on my dev machine (so long as I have a local SQL Server and the right permissions), and on the build server. The tests could also be run in any order.

Using a configuration option, I might be able to make database.Deploy act differently - in one configuration, it might create a new, empty database - in another, it might restore a scrubbed production database from backup and upgrade it. That way the same test suite could be run in a "new install" vs. "upgrade" scenario.

15 Jan 2011

That seems really inefficient though (i.e. each test does that?) And surely it means trouble if someone else has a test testing something slightly different?

I'll be honest; I don't think that is quite ... "ideal". I'd be intrigued to see if you find this system actually works (legitimately interested); seems too "heavy" to be practical. But I don't know; I haven't given it much thought since I posted my response, which is so long ago now that I've basically forgotten what I was thinking at the time :P

Anton Gogolev
Anton Gogolev
21 Jan 2011

Take a peek at octalforty Wizardby. This is my open-source project which actually provides a platform-independent framework for database versioning & schema migration.