Sharding

I did temporarily give up on anti-together behaviour to study AR better. And took another task: sharding support. Result is attached as ZIP (several classes extending AR). Description following.

Main thing about sharding support was to give AR ability to decide on what connection to use dynamically. Depending on some object property or any environment condition. So we can switch to db-replica or concrete DB-node (where required entity is located). Unfortunately Yii can’t read our minds so we need a tool to let it know which connection (which server) we are going to use. Following Yii AR interface look choose() method was implemented.

To end-user It looks like:


$users = User::model()->choose($shard)->findAll();

To make it work we need logic to decide which connection to use depending on concrete $shard var (what’s inside is different in different cases). This logic is situated in DbConnectionManager. Sharding support library (or patch) includes CDbConnectionManagerAbstract defining it’s interface. In most cases you just need to implement


abstract protected function _createDbConnection($key);

That’s it: this function will get $shard as $key parameter. And supposed to return CDbConnection. To make it work DbConnectionManager should be defined in config next to ‘db’:




'dbConnectionManager' => array(

    'class' => 'application.tests.classes.ConcreteDbConnectionManager',

)



Usage of described extensions doesn’t break standard way Yii works either. We only use DbConnectionManager if choose() called. And it’s object-visible. Next model you create will start from default connection from config.

Besides the BC it makes possible creating "System tables" which is common requirement for any sharding implementation.

Everything I described is way more wider described in unit-tests attached to implementation. Tests include some test Models and CShardingTestCase is full of examples of what you get in your code to work with it.

One more thing about it: it’s impossible to make Sharding work out-of-box in most cases. And our implementation is way more complex then default one we can suppose. BUT! Nothing stops us from creating default bindings to implement most basic and completely automatic sharding in extension to our extension. And that’s my current vector.

Could you please check all this stuff and tell me what’s the best way to make this lib accessible to community and improve it further?

Thanks.

Absolutely devastating work!

Did you add separate component because extending CDbConnection would had been more difficult?

As I’m totally inexperienced with multi-server environments, I can’t confirm whether this might deserve considering or not:




'db'=>array(

  'class'=>'path.to.extendedDbConnection',


  'shards'=>array(

    'BlogEntries'=>array( // shard name can be specific model name

      'entry_id'=>array(

        '1'=>connestionString-0, // specific value

        '2-20'=>connectionString-1, // closed range

        '20-'=>connectionString-2, // open range

      ),

    ),


    // shard name can be something else, so multiple column watches can be placed into action

    'ForumPostsSharding'=>array(

      connectionString-3=>array(

        'post_id'=>'1-50000',

        'board_id'=>'1-100',

      ),

      connectionString-4=>array( // otherwise default conn

      ),



Is it worth considering?

By the way, to engage community as much as possible you should share this as an extension.

Separate component was created cause its semantically correct: it’s a thing that manages connections, not connection itself in fact.

Your config example can be used to implement sharding which works out of the box, based on my implementation in fact. Sharding which we can just install, setup with 2-3 lines of config is my dream. Current implementation makes sharding possible. Next step will be exactly what you described. So nice idea, thank you.

I hope I could help. As of components, I think that CDbConnection is not the connection itself either, but the manager layer on top of that.

But if another component feels more appropriate to you, it is just fine with me.

Great work! I’ve read through your code and also some articles about sharding and database partitioning in general.

It seems to me the overall requirement is to let AR to be capable of using multiple DB connections either manually or automatically. I like your idea of introducing the choose() method. I am not sure if we need CDbConnectionManagerAbstract, however.

Based on your choose() idea, I am thinking of the following use cases (note I replaced "choose" with "use" because it is shorter and easier):




Post::model()->use('db2')->findAll();

Post::model()->use('db2', 'db3')->findAll();



Here ‘db2’ and ‘db3’ refer to the DB connection IDs declared in app config. There is no connection manager involved or needed. What do you think? Perhaps instead of connection manager, we should have an AR manager which sets some global properties to govern the behavior of AR.

In your code, I saw you use transactions to ensure data integrity when multiple connections are involved during insertion and update. That is good.

Another thing needed perhaps is load balancing, i.e., to let AR to automatically choose a DB connection to perform read queries.

When multiple connections are used, how do we merge the data together when calling findAll()?

How to deal with relational queries?

This database scaling screencast (apart from rails-specific tips) contains introductory information about sharding map techniques.

AR models should know which related data is in the same database, and in those cases one can use eager loading. Otherwise the newly introduced anti-together lazy loading option should be used.

Edit:

If sharding takes effect, it is developers’ task to ensure that unaccessible tables are not involved (i.e. determining foreign keys before findAll).

I agree, use looks much better. It’s either reserved word in most cases so I wasn’t sure if you have plans on it. If you don’t, great, let’s use it.

We need ConnectionManager not to just find connection by it’s keyword but to resolve connection by any formula. It can be something like this:




Post::model()->use( $user->id )->findAll();



And inside:




$connection_id = $id % 2 ? 'db1' : 'db2';



And that’s simple case. One of most difficult thing about sharding is a way you spread your data. Cause it makes load on concrete shard higher or lower. Different users generate different load. You can have one node constantly going down with 100 000 active users and other having 0.5 LA with 1 000 000.

However simple selection like you proposed works well with straight replica or horizontal scaling (per column). So I suggest we make default connection manager not abstract but as simple as "return $connections[$shard]". It will make us one step closer to "out-of-box" solution and will not affect extension capabilities to use in heavy-sharded environment.

And this can be another method of ConnectionManager having "return array_rand($connections)" i.e…

I thought about it. It’s a stick with two ends like we say here in Russia. We never want to take data from all shards in production. If we want to have aggregation (like “Last Posts over all Blogs”) we use system table or Sphinx Search Engine (which works like a charm and is our choice) or DB is dead. If we want to have all posts of one user we use just one shard. However I can imagine some cases we could query all shards with search by primary key or query just two or three shards. Again, i can imagine. I did a lot of heavy loaded projects but never got in need to do that.

It may found use in console admin tools (or maybe admin web panel) or in very-very special case. So we better have it than not. But due to it’s rare use it should be separated and do not mess with default stuff. That’s why I suggest to implement special method with this behaviour: query given in use() shards and just array_merge result sets.

In my implementation we use only first connection in read queries. Multiple connections are in use to duplicate data on several shards with inserts and updates.

You mean joins etc? Sharding is supposed to have full DB schema on each server. We shard only data, not table entities. So we still can use JOINs and they work… but with local shard data. I guess we can decide on algo to emulate cross-server join behaviour, but I’m sure it gonna be useless. Universal algo will be dramatically slow. And sharding is all about heavy load. Sharding in it’s nature is about to have all things linked to sharded entity on same server. So you don’t split each table, only central one. And you store all associates on same shard.

It’s a thing application developer should care about. I can’t see correct way to help him with this except documenting.

Do you have some extension to use Sphinx Search Engine with Yii ?

Wow! Great job, gentleman. This is really very useful!

I’ll love to see ‘use’ implemented!

Incredible lovchy!! Have you update your sharding code with new features you mention in the recent time?

Is it applicable for Master/Slave[n] replication too?

Lovchy, regarding merging findall query issue: we are planning to build a multisite platform where each user has his own “site”. But at the end, we would want to create an “aggregation central site” with some of the data/content/information from all the users… Is this a special rare rare case you mentioned?? I hope it doesn’t. How does ebay do it for example?

BTW Qiang, I would love to see this awesome features in core too!!

Best regards,

Alex

what is the status on this one? Is there a wiki/document describing usage?