[jifty-devel] Data partitioning (ie multiple databases)

Edmund von der Burg evdb at ecclestoad.co.uk
Sun Dec 31 14:14:31 EST 2006


Hello,

Jifty allows you to create database backed web apps. However it
assumes that all users of this web app want to share the same
database, which in my opinion is a bad assumption. Hopefully I have
some ideas here that would make it trivial to partition the database
for each user or group of users.



Why would you want to use separate databases?

Many apps make it a feature that users all have access to the same
database. For example on a wine review site being able to access other
peoples reviews is a plus, as is being able to write reviews on wines
that others have added.

Some sites though never want to share the data between users. For
example an invoice management system for small businesses should never
let employees of one business see data belonging to another business.
Effectively each company's data should be partitioned off.

It is possible to achieve this in Jifty at the moment, you'd just add
a company_id field to each table and ensure that it always matched the
id of the user. This approach has several drawbacks though:

  * it is cumbersome - you need an extra field in each table

  * it is not fail safe - if you forget to check the company_id
matches up then data could leak through.

  * you are stuck with a single database for data that is unrelated -
this prevents easy use of several database servers for scaling.

The ideal solution would be to have some way to design the whole app
as if there was only one company using it. Once the app works well for
one company then it should be possible to easily change the app in one
place and suddenly have it work for as many companies as you want,
without having to worry about data leaking across from one company to
another.

This would be excellent as it would keep things simple during
development and it would allow you to easily take something that
initially was just an in-house app and change it so that it could be
used by others.

I think that this can be done, and I think that it might actually be
really easy to achieve.

Before suggesting my approach I'd like to add that running several
instances of Jifty is not a solution in my opinion. Firstly it is
wasteful of processes. It is also more complicated to administer and
it is difficult to set this up so that new instances can be created
for new users on the fly.

Basically what is needed is some way to select the datasource to use
at the start of the request and then that datasource is used for the
rest of the request. At the end of the request the datasource choice
is nulled and the next request needs to make the choice again. Note
that I say datasource and not database - more on this later.


Which datasource to use could be decided on one of many factors:

  * cookie - the datasource is added to the session cookie. This is
suitably signed to prevent tampering.

  * domain name - foo.example.com uses a different datasource to
bar.example.com. (I personally hate using subdomains for this sort of
thing but it is a valid approach)

  * username - bob_foo vs bob_bar. Probably a bad idea.

  * make the user select or specify the datasource when they login and
stick it in a seperate cookie. Clunky for the user and could leak
information about who else is using the system. Would allow anonymous
users though

My inclination is to put the datasource into the session cookie. Apps
that need this sort of feature will probably require a login so there
will be no anonymous users (although this is probably a bad
assumption). There could then be a master datasource with a table that
links users to the correct datasource. This table would only be
accessed at login time to work out where the user should be. The
actual auth data would be in the specific datasource.


I've used datasource because I don't think that having many seperate
databases would be the best approach. These are the possibilities that
I see for the datasources:


__Several databases__ - each datasource is actually a separate database.

This leads to very clean partitioning of the data and in is
conceptually easy to deal with. The down sides are that the number of
possible database connections grows quickly and could easily become a
resource drain. Also it is harder to administer as each server will
quickly end up with many databases on it.

The advantage is that it is really easy to implement. The databases
could be spread out over different physical servers leading to easy
scaling. Database connections cannot be cached for long but it might
not be an issue to open the connection afresh for each request.

Interestingly SQLite may not suffer from the drawbacks mentioned here
and so this could be the ideal approach for that database.


__Table name prefix__ - each table has a prefix that says which
datasource it belongs to.

This would allow a single database to be used for all the datasources
and so the connection caching issue goes away. The db abstraction code
would need to have the needed smarts though to change the name of the
table in the queries so the correct one though, eg 'invoices' might
need to become 'foo1234_invoices'.

The drawbacks are that it is tricky to code - the schema would need to
change for each datasource so that the references between tables is
correctly created. It would also be difficult to manually inspect the
tables as you'd need to remember to add the prefix to the start of the
table names. Any manually created sql would need to remember to add
the prefixes.

This is doable but I think would be a bad approach - there is too much
that needs to be remembered and that could go wrong.


__Seperate Schemas within a database__.

The SQL standard supports this to a certain extent and Postgres has
implemented it well. The idea is that inside a single database the
tables can belong to a specific schema, so instead of doing 'select *
from foo' you do 'select * from schema_name.foo'. This means that you
can have several tables with the same name as long as they are in
seperate schemas.

This is highly analogous to how the shell finds the binary you want to
run. When you run a query the database searches through its search
path for that table and the first one it finds is used. This is like
the shell working through $PATH, trying '~/bin', then '/bin', then
'/usr/bin' etc.

In postgres (which is the database I tend to use and so interests me
most) you can set the 'search_path' for tables for each connection.
This means that you nood not use the fully qualifed table name
'schema_name.foo' but can just use the table name 'foo'. It is
possible to change the search path during a connection and that change
is sticky. When a new connection is opened the search_path is reset to
the default. (There may be some transaction related issues that I've
not considered).

This means that what I wish to implement becomes easy to do. At the
start of the request the correct schema to use would be determined and
the database connection would be toggled over to having it in the
search path ('SET search_path TO schema_name' in postgres). Now all
the queries will use tables in that schema. Hence the rest of the code
can be blissfully ignorant of the fact that it is talking to a
specific schema. Also the database connection can be cached and
repurposed by changing the search_path - which is much faster than
creating a new connection.


Hybrid approaches

Having thought about this a bit more I am coming to think that it
would also be useful to support a sort of hybrid approach, where some
tables are global and some are limited access. Take for example the
invoicing system - some tables like the actual invoice and clients
should be limited to the individual company. Other data, like sales
tax data for US states, would be best shared. Perhaps the way to deal
with this is to add an attribute to the schema that says if it is
global (the default) of specific to some group.

(caveats - referential integrity across databases may not be possible.
Would need some way to access the master database with the global data
in, perhaps a different handle.)


Summary

This has mainly been a proposal of something that I think it would be
nice to have in Jifty. I know of several projects where being able to
partition the database for different users or groups of users would be
essential and would make the whole process much easier. If this also
would allow the separate datasources to be spread across servers then
so much the better.

I like postgres and so think that the 'many schemas in one database'
approach is best, although I don't know if it is possible to do that
in other RDBMSs. If not I would probably go with a seperate database
per datasource rather than some prefixing of table names. This would
allow this abstraction to reside in one place in the code and leave
the rest unchanged. It could then be something that gets enabled as
needed, and if it is intended to be used in production then postgres
could be recommended as the best approach, although other dbs (like
SQLite for testing) would also work.

There are plenty of issues that are valid and that I've not addressed.
There would need to be extra magic to create a new datasource as
needed. Also there is the need to map a user to the correct
datasource. There is the problem of upgrading each datasource whenever
the schema changes (although this could now be done in a more
controlled manner - rather than all at once if required). There is the
issue of creating lots of tables for users who may never return -
would need a way to reap them as required. How users and logins are
managed would need tweaking.

There are also some nice bonuses though. It would be possible to move
a datasource from one physical server to another which provides a way
to scale the database end easily. Datasources could be backed up one
by one and stored seperately, which might be an advantage. Inactive
datasources could be easily archived and then resurrected as/if
needed. The database schema is much simpler as it need not take into
account multiple instances. There is intrinsic security that helps
keep private data private. Querying the database might be faster as
the tables and indexes would be smaller.

I hope that I've given food for thought, if not convinced you
entirely. Please let me know what you think.

Cheers,
  Edmund.


-- 
In reality I'm evdb at ecclestoad.co.uk - http://ecclestoad.co.uk


More information about the jifty-devel mailing list