If there's one thing I've learned from my experiences with women, it's that they secretly can't control themselves if a man is there to sweep them off their feet with tales of legacy data migration suavity. (If you're wondering why I didn't title this in the reverse, it's simply because I didn't feel that it was any particular secret that every man is most attracted to legacy data conversions.)
So, you're a few weeks away from launch day and you have a fair amount of data in an older database structure and you want it accessible to your new app. The possibilities here are really quite endless, and it depends entirely on a multitude of questions: do you need access to the old data in the old app? Is speed an issue? Can you fill records after the fact, or do you need them all immediately in the new format pre-launch? Is this a one-off or a regular thing? Will anyone notice if you instead just add a couple hundred thousand rows of quotes from Dr. Quinn, Medicine Woman instead of bothering with a possibly delicate transfer? These are all equally important questions. I recently did one such transfer, and here's my process.
I actually chose a few different routes. A lot of the data was going to be in a similar structure in the new system and the quicker the better, please. For those cases, you're best off to do everything via database dumps and raw SQL commands. By far, that's going to be the fastest way of doing things. Dropping out into Ruby land is going to be painfully slow. But in some cases, that's a tradeoff worth making. In this post, I'm going to detail the latter process. The benefit of pushing data through Rails is that you have access to model-specific validations, callbacks, and relationships. If you're dealing with complex data structures spread over multiple models and relationships (both in the old and new apps), sometimes the longer processing time is a decent tradeoff, particularly if you can kick off the process and slowly migrate post-deploy.
The first thing to do is make the data accessible to your new app. I've found it's easiest to create a temporary table that you can nuke when you're done. If we're migrating Users between totally different codebases, I'd use
mysqldump to dump your old
users table and import it with a prefix:
old_and_busted_users, for example.
To actually use this data I've seen a few different approaches. The full-blown route is to
script/generate model OldAndBustedUser, which gives you access to an ActiveRecord instance of the
old_and_busted_users table. That's fine, and it offers you some additional freedom in terms of testing and relationships, but I've found that in a number of cases you don't really need to scaffold yourself out a full-fledged model. For the most part, you just need access to the table and to some relationships. Besides which, generating an entirely new model for your app means you're more likely to keep this old code polluting your new code, which really kind of sucks if you're only going to be doing this once.
The trick I've used is to define all of your models inline. For example, if you're running this as a rake task:
namespace :import do desc "Import our old and busted users into the new hotness." task :users => :environment do class OldAndBustedUser < ActiveRecord::Base ; end class OldAndBustedProfile < ActiveRecord::Base ; end end end
It's a little funky defining classes like that, but for most scenarios you just don't need much scaffolding. If your task balloons to hundreds of complex, intricate lines, definitely think about refactoring quite a bit more. The model names should coincide to your temporary tables. If they don't, you can do so with
Once you have your temporary models set up, you can use them as you'd expect:
task :users => :environment do class OldAndBustedUser < ActiveRecord::Base ; end class OldAndBustedProfile < ActiveRecord::Base ; end OldAndBustedUser.find_each(:batch_size => 2000) do |old_and_busted_user| User.create(:name => old_and_busted_user.name, :email => old_and_busted_user.email) end end
find_each being important since we're in Ruby Land, where instantiating huge object arrays willy-nilly will deplete your available memory faster than the alcohol you experienced on your 21st birthday. You also can set up relationships as needed, too:
task :users => :environment do class OldAndBustedUser < ActiveRecord::Base has_one :profile, :class_name => 'OldAndBustedProfile', :foreign_key => 'profile_id' end class OldAndBustedProfile < ActiveRecord::Base belongs_to :user, :class_name => 'OldAndBustedUser' end OldAndBustedUser.find_each(:batch_size => 2000) do |old_and_busted_user| User.create(:name => old_and_busted_user.name, :email => old_and_busted_user.email, :twitter_handle => old_and_busted_user.profile.twitter) end end
Now for the Pro Tips: if you're doing a lot of data importing, you're bound to have to wipe your database and repopulate a gaggle of times. I found a lot of the manual steps in dumping data to be a pain, particularly in terms of importing into the temporary table. So I'd suggest making use of
sed for some fun file replacement (which sure beats having to pick through
vi multi-hundred-megabyte-db-dump every time). Assuming you're using regular
sed -i '' -e 's/`users`/`old_and_busted_users`/g' multi-hundred-megabyte-db-dump.sql
Your mileage may vary on that particular
sed command, though. OS X uses BSD
sed, Linux uses GNU
sed. I've found the difference to be the spacing after the -i and before the single quotes (denoting an in-place substitution, which is usually safe enough considering you can always re-dump from MySQL if the file gets corrupted).
Since we're dealing with Rails, we get all the fun magic methods, but they're not all fun and games: they'll overwrite your
updated_at fields. Have no fear;
record_timestamps is here:
ActiveRecord::Base.record_timestamps = false method_that_does_crazy_zany_stuff ActiveRecord::Base.record_timestamps = true
You'll also want to think about turning validations on or off (with a
user.save(false))- validations are good for ensuring the validity of your data, but it's another point of slowness, and if you're importing legacy data there's a good chance your data might break validations anyway.
If all else fails, IMDB maintains a startlingly large repository of Dr. Quinn quotes.