Hi,
I am looking for some input please. I am designing a multi-user, mutli-vendor, n-tier database application.
Each local installation of this system must replicate its records to a single central server. This server is not directly available to the users of the installed database app and the Internet is the only route to it. The central db is only used for reporting so replication needs only be one-way to the central server.
The primary objective is to get the records to the central db as fast as possible to ensure a “pseudo real-time” snapshot.
Here what I’ve got so far:
Use web server on Internet (SOAP service) to upload compressed/encrypted files (XML data). “Central” db server polls the web server to collect and import the data at regular intervals.
Local db systems do not delete records, but mark them as deleted.
Each table in the database has a primary key using a GUID-like surrogate key (i.e. no auto-inc fields) to prevent key violations on the central db.
I have discounted using the client’s db server to run the extraction and upload process, as there will be precious little IT support for this system at customers sites so if the upload fails (due to local firewall/proxy server issues) then I want the user to know a.s.a.p. so they can get it sorted.
The extraction process needs only be run by one client application at a time as it will query the whole local database to extract changes.
The bit I am struggling with is the database structure to enable efficient (and accurate) extraction of recently changed records. Also this will happen whilst users are working i.e. need to be careful about a user having a record open for editing whilst the extraction is going on and need it to run “in the background” so as not to interrupt the user too much.
I.e. do I use a date & time based approach? Do I use some kind of status field, or even a seperate change-log table that can be queried and then cleared?
Apologies for the long post, but I’d value any experience anyone has about developing this solution.
TIA,
Stuart
-
You might want to take a look at how Slony for PostgreSQL does it.
Effectively, instead of querying the actual tables for changes it uses triggers on the tables to record the changed data into a separate logging table (be these inserts/updates/deletes). It then becomes straightforward with a increasing field (be it transaction id, time, whatever) in the logging table to fetch just the changes since last time and to delete the log entries once all interested parties have them. Note that this also removes your #2 point of modifying the original tables (and clients thereof) to understand rows that are present but logically deleted.
Of course there are several details that can complicate matters depending on your database engine. The original Slony design document does a good job of covering the issues and why it does things the way it does. Even if you're not using PostgreSQL it's probably worth a read.
0 comments:
Post a Comment