[OT] Database synchronisation

Do you have a question? Post it now! No Registration Necessary.  Now with pictures!

Threaded View
I have a database called LOCAL_DB in a local computer. This DB is managed
by third party software so I have no control over it. I have another
database called INET_DB in an Internet server. I design and maintain this
one (it's an extranet written in PHP).

I need to write an app that dumps certain fields and tables from LOCAL_DB
into INET_DB every night, using a DSL connection, so all changes made in
the office during workday are available at the extranet the following day.
Both DBMS are different so I’m accessing data through ODBC drivers.

The first simple bold algorithm I can think of is:

1)    Empty INET_DB
2)    Get records from LOCAL_DB
3)    For each record : Insert record into INET_DB

Its pitfalls are pretty obvious:

1)    It will take years to execute (128Kbps, dude)
2)    During the update process, the extranet is useless
3)    If my app crashes... well, go figure

My second attempt would be:

1)    Get primary keys from INET_DB
2)    Get records from LOCAL_DB
3)    For each record:
    3.1)    If PK exists in INET_DB: Update record in INET_DB
    3.2)    Else: Insert record into INET_DB
4)    Remove records from INET_DB where PK no longer exists in LOCAL_DB

Now, will someone stop me before I reinvent the square wheel? I'm not sure
this is yet the right approach. I’ve been googling in search of an
algorithm but I could only find solutions for very specific situations. I’d
appreciate a lot any hint or link.

-+ http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
++ Mi sitio sobre programación web: http://bits.demogracia.com
+- Mi web de humor con rayos UVA: http://www.demogracia.com

Re: Database synchronisation

Alvaro G. Vicario wrote:
Quoted text here. Click to load it

We do a similar thing here at my work.  We have a variety of data
imports to run and summary tables to build on a nightly basis.  Now, as
you indicated, you can get as fancy and efficient as you want by
comparing PK's and such, but we went with the pure brute-force method
you mentioned first.  Granted, this is all happening on our local
network, so we weren't concerned with data transfer rates, which could
turn out to be a problem for you depending on the quantity of data you
have to move.

As you mentioned, the obvious problem with the brute force method is
that between dumping the contents of the table and re-inserting, the
table (and anything that references it) is useless.  We got around this
by using temporary tables to do the processing, then moving them into
place seamlessly when finished.

For example (in psuedo-mysql, so you'll need to adjust for your DB),
assuming 'mytable' is the table you're trying to update:
DROP TABLE mytable_new // just in case it still exists
CREATE TABLE mytable_new LIKE mytable
//do whatever you need to do for inserting into 'mytable_new'
DROP TABLE mytable_old // just in case it still exists
ALTER TABLE mytable RENAME mytable_old
ALTER TABLE mytable_new RENAME mytable
DROP TABLE mytable_old

As you can see, we create a temporary table just like the original and
do our inserts into that one.  Then, when finished, move the old one
out of the way and put the new one in.  Depending on your preference,
you can keep the _old table around to have a revolving backup, or just
toss it.

This has the effect of solving two of your potential issues: 1) the
extranet is not 'useless' while the import is taking place because the
old data doesn't leave until the new data is ready, and 2) if your
import crashes mid-execution, the only thing that should be affected is
the _new table and you can just drop it and start the import again
later if that happens.  And it even might solve the issue of the speed
of data transfer, depending on how necessary it is to have yesterday's
data by start of business the next day.  If that isn't essential, then
since the data import is invisible to the end user, it doesn't really
matter how long it takes (as long as its still < 24 hours, of course).

Hope that gives you some ideas,

Re: Database synchronisation

mootmail-googlegroups@yahoo.com wrote:
Quoted text here. Click to load it

This is a good idea, using temporary tables.

Since you are somewhat bandwidth limited, it may be more efficient if
you dump the database to a text file (either comma seperated or SQL)
and load it locally on the remote machine.  This way you can compress
it before transfering (and the plain text will compress very nicely),
and it will execute much faster on the remote machine.

Also, building on rsync from another post, another option would be to
diff the old dump and the new dump, then send that to the remote.  This
would be usefull if, even after compression, you are still transfering
a lot of data.

In either event, you should be able to dump an SQL file and send that.
It'd be much faster than reading a row from local, inserting to remote,
rinse, repeat.

Re: [OT] Database synchronisation

Alvaro G. Vicario wrote:

Quoted text here. Click to load it

There's a thing on phpclassess which claims to replicate MySQL schemas (you
didn't say what DBMS) but it doesn't understand about indices and doesn't
replicate data.

Quoted text here. Click to load it

Assuming we're talking MySQL and you are just publishing the data, then the
bestway to do it would be to slave the remote DB from the local - but that
asssumes you can reconfigure the remote DB and connect across the MySQL

An alternative approach would be to run rsync - which is supposed to just
update the bits of a large file which have changed. But you'd need to to
shutdown the DBMS while it ran.

Failing that why not just shutdown the DBMS and transfer the files across -
it'll probably be a lot quicker than rebuilding the table one row at a

The best solution though would be a custom script which copies over any
changes however you need to make sure that:

1) you never delete data (well you can - you just mark it as inactive then
run a housekeeping job to delete things which have been inactive for, say 2
successful update cycles).

2) you need a timestamp and primary key field on every table.

(BTW - why bother to check if a record exists before you try and insert it -
if its already there and you've got a primary key, the DBM<S will tell you
at the time of insertion).



Site Timeline