Relational AR efficiency

Info block from the definitive guide:

Quote

Info: The AR implementation in Yii is very efficient. When eager loading a hierarchy of related objects involving N HAS_MANY or MANY_MANY  relationships, it will take N+1 SQL queries to obtain the needed results. This means it needs to execute 3 SQL queries in the last example because of the posts and categories properties. Other frameworks take a more radical approach by using only one SQL query. At first look, this approach seems more efficient because fewer queries are being parsed and executed by DBMS. It is in fact impractical in reality for two reasons. First, there are many repetitive data columns in the result which takes extra time to transmit and process. Second, the number of rows in the result set grows exponentially with the number of tables involved, which makes it simply unmanageable as more relationships are involved.

I'm writing quite big portal now, so I'm very concerned about efficiency…

I'm using InnoDB tables (see mysql performance blog, it is faster, than MyISAM and I have tests for it)

That's why I was running some tests. Let's take the live example:

There is a site about clothes. There are categories (dresses, t-shirts etc), each category has size categories (dress sizes, generic sizes, UK sizes, european sizes etc), each size category holds sizes itself. For example category "generic sizes" holds S,M,L, XL, XXL.

See attached scheme.

Now running 2 tests in the actionIndex of the controller:

$t = microtime(true);


		$cmd = Yii::app()->db->createCommand('SELECT s.*, sc.*


		FROM Size s


		INNER JOIN SizeCategory sc ON s.sizeCategoryID=sc.sizeCategoryID


		INNER JOIN CategorySizeCategory csc ON csc.sizeCategoryID=sc.sizeCategoryID


		WHERE csc.categoryID=:cID');


		$cmd->bindParam(':cID', $this->categoryID, PDO::PARAM_INT);


		$sizes = $cmd->queryAll();


		$interval = microtime(true)-$t;


		$t = microtime(true);


		$category = Category::model()->with(array('sizeCategories' => array('sizes')))->findByPk($this->categoryID);


		$szs = $category->sizeCategories;


		$interval2 = microtime(true)-$t;


		$this->render('index', array('sizes' => $sizes, 'int' => $interval, 'sizes2' => $szs, 'int2' => $interval2));

First variant - low-level SQL pus binding params.

Second - using relational AR queries.

Here are relations for:

Category:

public function relations()


	{


		return array(


			'sizeCategories'=>array(self::MANY_MANY, 'SizeCategory', 'CategorySizeCategory(categoryID, sizeCategoryID)', 'order' => 'sizeCategory'),


			'products'=>array(self::HAS_MANY , 'Product', 'categoryID', 'order' => 'name'),


			'tags'=>array(self::HAS_MANY , 'CategoryTag', 'categoryID', 'order' => '??.clicksNum', 'limit' => Yii::app()->params['maxCategoryTags']),


		);


	}

SizeCategory:

public function relations()


	{


		return array(


			'sizes'=>array(self::HAS_MANY, 'Size', 'sizeCategoryID', 'order'=>'size'),


		);


	}

Here is DB structure for these tables:

-- -----------------------------------------------------


-- Table `everystyle`.`Category`


-- -----------------------------------------------------


CREATE  TABLE IF NOT EXISTS `everystyle`.`Category` (


  `categoryID` INT(5) NOT NULL AUTO_INCREMENT ,


  `name` VARCHAR(100) NOT NULL ,


  `clicksNum` INT(11) NOT NULL DEFAULT 0 ,


  `gender` ENUM('male','female','boys','girls') NOT NULL ,


  PRIMARY KEY (`categoryID`) )


ENGINE = InnoDB;





-- -----------------------------------------------------


-- Table `everystyle`.`SizeCategory`


-- -----------------------------------------------------


CREATE  TABLE IF NOT EXISTS `everystyle`.`SizeCategory` (


  `sizeCategoryID` INT(11) NOT NULL ,


  `sizeCategory` VARCHAR(100) NULL ,


  PRIMARY KEY (`sizeCategoryID`) )


ENGINE = InnoDB;





-- -----------------------------------------------------


-- Table `everystyle`.`Size`


-- -----------------------------------------------------


CREATE  TABLE IF NOT EXISTS `everystyle`.`Size` (


  `sizeID` INT(11) NOT NULL AUTO_INCREMENT ,


  `size` VARCHAR(45) NULL ,


  `sizeCategoryID` INT(11) NOT NULL ,


  PRIMARY KEY (`sizeID`) ,


  INDEX `fk_Size_sizeCategory` (`sizeCategoryID` ASC) ,


  CONSTRAINT `fk_Size_sizeCategory`


    FOREIGN KEY (`sizeCategoryID` )


    REFERENCES `everystyle`.`SizeCategory` (`sizeCategoryID` )


    ON DELETE NO ACTION


    ON UPDATE NO ACTION)


ENGINE = InnoDB;





-- -----------------------------------------------------


-- Table `everystyle`.`CategorySizeCategory`


-- -----------------------------------------------------


CREATE  TABLE IF NOT EXISTS `everystyle`.`CategorySizeCategory` (


  `categoryID` INT(11) NOT NULL ,


  `sizeCategoryID` INT(11) NOT NULL ,


  INDEX `fk_CategorySizeCategory_Category` (`categoryID` ASC) ,


  INDEX `fk_CategorySizeCategory_sizeCategory` (`sizeCategoryID` ASC) ,


  PRIMARY KEY (`categoryID`, `sizeCategoryID`) ,


  CONSTRAINT `fk_CategorySizeCategory_Category`


    FOREIGN KEY (`categoryID` )


    REFERENCES `everystyle`.`Category` (`categoryID` )


    ON DELETE NO ACTION


    ON UPDATE NO ACTION,


  CONSTRAINT `fk_CategorySizeCategory_sizeCategory`


    FOREIGN KEY (`sizeCategoryID` )


    REFERENCES `everystyle`.`SizeCategory` (`sizeCategoryID` )


    ON DELETE NO ACTION


    ON UPDATE NO ACTION)


ENGINE = InnoDB;

Results of test is:

0.00061607360839844 - SQL


0.028649091720581 - ActiveRecord

Active record is 46(!!!) times slower, than simple SQL!

Here is the secret of "efficiency" (stated in the definitive guide). While SQL query uses 1 query with 2 joins, relational query with active record uses 9 (!!!) queries. See application log:

2008/12/16 11:03:53 [trace] [system.db.CDbCommand] query with SQL: SELECT s.*, sc.*


                FROM Size s


                INNER JOIN SizeCategory sc ON s.sizeCategoryID=sc.sizeCategoryID


                INNER JOIN CategorySizeCategory csc ON csc.sizeCategoryID=sc.sizeCategoryID


                WHERE csc.categoryID=:cID


2008/12/16 11:03:53 [trace] [system.db.CDbCommand] query with SQL: SHOW COLUMNS FROM `SizeCategory`


2008/12/16 11:03:53 [trace] [system.db.CDbCommand] query with SQL: SHOW CREATE TABLE `SizeCategory`


2008/12/16 11:03:53 [trace] [system.db.CDbCommand] query with SQL: SHOW COLUMNS FROM `Size`


2008/12/16 11:03:53 [trace] [system.db.CDbCommand] query with SQL: SHOW CREATE TABLE `Size`


2008/12/16 11:03:53 [trace] [system.db.CDbCommand] query with SQL: SELECT `Category`.`categoryID` AS t0_c0, `Category`.`intName` AS t0_c1, `Category`.`name` AS t0_c2, `Category`.`clicksNum` AS t0_c3, `Category`.`gender` AS t0_c4 FROM `Category`  WHERE `Category`.`categoryID`=1


2008/12/16 11:03:53 [trace] [system.db.CDbCommand] query with SQL: SHOW COLUMNS FROM `CategorySizeCategory`


2008/12/16 11:03:53 [trace] [system.db.CDbCommand] query with SQL: SHOW CREATE TABLE `CategorySizeCategory`


2008/12/16 11:03:53 [trace] [system.db.CDbCommand] query with SQL: SELECT `Category`.`categoryID` AS t0_c0, t1.`sizeCategoryID` AS t1_c0, t1.`sizeCategory` AS t1_c1 FROM `Category` LEFT OUTER JOIN `CategorySizeCategory` sizeCategories_t1 ON `Category`.`categoryID`=sizeCategories_t1.`categoryID` LEFT OUTER JOIN `SizeCategory` t1 ON t1.`sizeCategoryID`=sizeCategories_t1.`sizeCategoryID` WHERE `Category`.`categoryID`=1 ORDER BY sizeCategory


2008/12/16 11:03:53 [trace] [system.db.CDbCommand] query with SQL: SELECT t1.`sizeCategoryID` AS t1_c0, t2.`sizeID` AS t2_c0, t2.`size` AS t2_c1, t2.`sizeCategoryID` AS t2_c2 FROM `SizeCategory` t1 LEFT OUTER JOIN `Size` t2 ON t2.`sizeCategoryID`=t1.`sizeCategoryID` WHERE t1.`sizeCategoryID` IN (2, 1) ORDER BY size

Any comments on this?

Yes, it's slower because AR  needs to know your table reference. If you use a cahce method (APC, MemCache, etc), the AR will query the table definition the first time and then it will use the cached table definition.

Try it with APC!

First, the efficient of AR is relative to other AR implementations, not with direct SQL.

Since you are joining clothes with two HAS_MANY/MANY_MANY relations, AR would use 3 queries. The other 6 queries are for retrieving the meta information about DB tables, which can be saved if you turn on CDbConnection::schemaCachingDuration as described in the Guide.

Now the question is whether these 3 queries will be slower or faster than your single big query. This actually depends on the underlying data. If your DB contains only a little test data, your single big query would certain be faster because it requires less queries. However, let's say you have 1000 clothes, each belonging to 3 categories each having 10 sizes. Your single query approach will bring back 30K rows of data. As you know, this number will increase rapidly. Using AR's approach, we only need to bring back (1000+number of category+number of size) rows of data.

Thanks for clarifying it!

Yes, I missed that thing with caching table metadata, it should be extremely useful feature :)

Well, I understand about 3 queries, but currently I simulated real DB I'll have (I wrote data generator). Data is meaningless, but number of records and relations are correct. And SQL seems to be much more efficient. It has 10K records in products table and around 200-500 in others (sure, MUCH more in cross-tables).

I think, that joins are better in most cases when you have correct indexes, it does a good job helping DB to join tables.

By the way, when I was testing JOIN vs 2 queries approach (having field with comma-delimited IDs instead of cross-table), 2 queries approach appeared to be only 2-5% faster, but only with a table, that had 500K+ of records. And was slower until number of records reached 300K.

I think, if there is an option in active records, how to compose queries - large with joins or some small ones, that would be excellent!

Well, you have to consider the time of processing those rows. If AR needs to read 10K rows of data, remove duplicated ones (inevitable in the single query approach), and generate the corresponding AR objects, it may take much more time than the current algorithm. Another drawback with single query approach is that the LIMIT feature will be difficult (or inefficient) to implement.

Yes, but PHP processing is much faster, that SQL SELECT. However, I agree, that is it non-trivial and generic algorithm may take longer…

By the way, I posted this in another thread, how about allowing SQL queries in relations? Is it possible to make such changes in AR, so instead of passing a number of arguments, user puts simply SQL code, that selects all fields needed for the correspondent AR.

Is it possible to implement this option in the future? I you tell me how, maybe I can help in this, cause I like Yii and this AR idea… Simply want to make it more efficient :)

Do you mean allowing using raw SQL for relational queries?

Yes. Raw SQL, where I'll be able to use some placeholder like [ID] for the value of the primary key of the parent record and that SQL code will load all fields needed for the AR instance needed.

For example, we have AR User and AR Post, but want to load only the posts, which don't have comments. So we write something like



public function relations()


{


return array(


'posts' => array('Post', 'SELECT p.* FROM Post p INNER JOIN Comment c ON p.postID=c.postID WHERE count(c.commentID)=0 AND p.userID=[userID] GROUP BY p.postID')


);


}


All AR does is executing that query, but substituting actual userID instead of [userID].

Yes, there will be a problem when we have a collection of users, we'll have to execute a query for each user… But for one user it saves the convenience of AR, when you can write

$posts = User::model()->findByPk(10)->posts

and at the same time doesn't limit you in writing the queries…

A for me, SQL is much more simple, but I know - that’s not AR’s problem :)

Could you please create a ticket about this? I will think about it and perhaps add some other convenience design about using RAR. Thanks.

ooops, I've created an issue, it is marked as defect. Could yo change it to enhancement? I didn't find where I can set this…

I think this is probably the limit of the AR design, push it any further and it will degenerate into a complex beast. All designs have their trade-offs, for AR  the trade-off is in the joins and aggregation.

There are many other queries that is not feasible to manage in an AR design, for example, sub-queries, unions. Some sql variants allow you to do the following:

http://thedailywtf.c…-Madlebrot.aspx

I think this could be something that SQLMap could help. An interesting task would be how to integrate SQLMap with AR, at least partially.