Cdataprovideriterator

I just learned about CDataProviderIterator and just wanted to remark that, in my opinion, this is entirely the wrong approach.

The problem it solves, is not a database problem - that problem was actually solved by databases, which (in general) are very good at delivering a large result set in a progressive way, without keeping them all in memory.

The problem is a limitation of the Active Record implementation, which by design is only capable of constructing the complete result-set in-memory and returning it as a whole.

The documentation neglects to point out an important limitation of this work-around: if you’re going to iterate over millions of users in the database, for example, to send them each a personalized e-mail, you would have to order them by an auto-increment primary key, or date/time created, in ascending order. Since the iterator actually performs independent queries for “pages” of results, if you were to order the queries by, say, date of birth, and a new user signs up while this lengthy batch of queries is running, some unlucky user would get skipped.

A safer (and more memory-efficient, bandwidth-efficient, and CPU-efficient) approach, would be to construct the query for the root entity in such a way that you avoid multiple tuples for the same root entity. For example, if Article has a many-to-many relation to Category, do not join the article-table to the category-table, since this causes the article data to repeat for every category - sure, you save a round-trip to the database, but the database also has to construct a substantially larger result-set, serialize it, send it over the network (if the DB server is separate from the app server); the client then has to unserialize a larger result-set, AR has to detect and ignore duplicate data, wasting more memory and CPU time.

All those issues aside, this of course also means you can’t read through the result-set and construct entities progressively, which is my initial point.

Take a peak at for example NotORM, which avoids heavy result-sets and complex join-queries by using a much simpler approach.

The idea that fewer queries automatically means better performance is an old myth, and while in some rare cases that may be true, for the large majority of queries, it just isn’t true - unless your DB server is on an entirely different network from your app server, in terms of performance, the number of roundtrips is most likely the least of your concerns.

And who uses DataProvider template for batch jobs? it is dedicated for UI components which presents data in paginated way. It is also fine for cases when you must define how to fetch some data and you are not sure it it will be fetched at all (you are preparing data for UI layer). In such cases - DataProvider keeps information on how to fetch data until it is really needed and fetched.

there is a switch ‘together’ which says how related data should be fetched: in one query with duplicated records which must be processed like you said, or with second separate query.

Why don’t use raw SQL queries for such a tasks like sending email to millions of user? IMHO, AR wasn’t designed for something like this.

+1

you can also iterate through SQL result and instantiate AR objects per resultset row. This way you can use AR on larger data set, however PHP garbage collector is not very good so you will end up with large memory usage anyway.

I’ve heard the explanation about AR not being “designed for this” before, and I don’t buy it. This particular implementation clearly was not designed with this in mind, but the AR design-pattern as such does not prevent this.

Once you have your tables mapped to classes, and those classes implement lots of accessors and integrate with other odds and ends (formatting etc.) throughout your application, leaving all of that behind and dropping down to raw SQL is unacceptable - I would have to duplicate a long list of functionality that is already neatly attached to classes and available throughout the application.

Having to circumvent AR defeats the purpose of using AR in the first place - you wanted abstraction for lots of good reasons, and those reasons didn’t go away when the need to iterate over a large collection arose.

As for the ‘together’ switch, I am aware of this feature - but this particular root has a long list of BELONGS_TO relations, all of which need to be read for every record, so that would mean millions of queries.

By the way, I have solved this problem a few times before, in various ways - I’m not currently stuck with this problem. I only brought it up because the the solution is never easy to implement, and the result is never elegant or optimal - and isn’t that why we use frameworks in the first place?

Others have solved this problem before, so it can be done. A half-baked solution was added to the Yii core, so the need is real. Let’s quit talking about why this can’t be done or doesn’t need to be done, and let’s talk about how to do it?

Here’s one approach I just thought of: run the query initially selecting only the primary key of the root entity, nothing else. Do not page or limit that result-set. Now iterate over those results progressively, and every time you have collected a list of e.g. 100 primary keys (or whatever is left at the end of the result set), construct another query with the same JOIN-clauses, but with a new WHERE-clause with “primary_key IN (…)” - execute this query normally, constructing all 100 root objects along with all related/aggregate objects, the way it normally happens.

Implement this pattern as a (forward-only) iterator-class, and we should be good to go.

Any thoughts on why this wouldn’t work?

FYI, current PHP garbage collector (5.3+) works fine, as long as you avoid circular references.

To demonstrate the idea above with a real example, let’s say we have User, State and Country - User references a State, and State references a Country.

Let’s say we’re going to query 1 million User records, we want their State and Country, for some reason we want only female Users, and we want them ordered by date of birth. For some reason, we only want Users in an “active” State or Country (perhaps something to do with target markets).

Currently we get a query similar to this:




SELECT

  u.*, s.*, c.*

FROM

  user u

  INNER JOIN state s ON (...)

  INNER JOIN country c ON (...)

WHERE

  u.sex = 'F'

  AND c.active = 1

  AND s.active = 1

ORDER BY

  u.date_of_birth DESC



With the current data-provider iterator, we get similar queries with differing LIMIT clauses.

Instead, I’m proposing we start by selecting only the primary key of the root - something like this:




SELECT

  u.id -- note the change here!

FROM

  user u

  INNER JOIN state s ON (...)

  INNER JOIN country c ON (...)

WHERE

  u.sex = 'F'

  AND c.active = 1

  AND s.active = 1

ORDER BY

  u.date_of_birth DESC



Now we open this query, and progressively read through it - and every time we have a batch of 100 user IDs, we execute a separate query, like this:




SELECT

  u.*, s.*, c.* -- now we fetch the actual data

FROM

  user u

  INNER JOIN state s ON (...)

  INNER JOIN country c ON (...)

WHERE

  u.id IN (...) -- note the change here!



Note that the ORDER BY clause was removed - the order in which this batch of 100 users is going to be returned is already known. This query can be executed as normally, constructing all 100 objects in-memory (and discarding the previous 100 records).

Thoughts?

With these good explanations I now understand more about this, thank you. Yes, you are right, this approach would be helpful in such a cases.

To tell the truth, I haven’t looked into source code of DataProvider-related files for much time. Am I right that in order to implement what you suggest, we have to change not only CDataProviderIterator, but also other things in core package?

I’m not sure. I haven’t had a lot of time to look into it - I poked around in CActiveRecord and CActiveFinder very briefly, and I think they could probably be enhanced to support this pattern, with the addition of a new iterator-type.

@mindplay: you are confusing DataProvider tamplate with ActiveRecord. I agree that it should be possible to use ActiveRecord even in tasks you described, but PHP simply is not good enough for such tasks. Read about PHP object implementation and how much memory every object instance consume and you will notice it is impossible to handle millions of objects in PHP, Yii has nothing to do with it.

On the other hand - DataProvider is meant to provide data for visual components not for batch processing. Read API reference to IDataProvider interface:

Unfortunately there are no solutions for iterating resultsets in Yii AR, because it is very difficult to achieve that as there are options like ‘together’ which allows you to fetch related HAS_MANY and MANY_MANY relations with single query. You can emulate some of such functions with (populationg relations will be a problem):




$recordset = ....;

foreach( $recordset as $record ) {

 $model = Model::model()->populateRecord($record,$true);

 ...

}



or you can stick with your DataProvider usage but lock tables for updates during the time of data processing.

Very interesting discussion.

@redguy

What do you think about CDataProviderIterator? What could be the use cases for it?

If I’m not mistaken, you and mindplay share the same opinion that it’s not very useful at least under the current implementation.

I think it’s very nice if we could deal with a very large set of records using CActiveRecord.

CDataProviderIterator is not even documented in official API reference :)

I think it was designed to solve such problems, but as mindplay said - there are issues you should keep in mind.

I think you could use it in scenario like this:




lockTables( 'xxxx', 'yyy', 'zzz' ); //pseudo-code


iterate CDataProviderIterator {

}


unlockTables( 'xxxx', 'yyy', 'zzz' ); //pseudo-code



locking could be done either with MySQL "lock table", or "select for update" or similar solution…

this would ensure you that data will not be modified during processing.

Another thing is that when you use joins with relation HAS_MANY or MANY_MANY and ‘together’ pagination can work with unexpected results (for example returning all record instead of just following ‘limit’). This is because in such case records can be multiplied and merged back when processing results… for example CActiveRecord::find() which should return only one instance limits query results only when there are no relations used:




protected function query($criteria,$all=false) 

{ 

    $this->beforeFind(); 

    $this->applyScopes($criteria); 


    if(empty($criteria->with)) 

    { 

        if(!$all) 

            $criteria->limit=1; 



as you can see limit will be applied if you are looking for single object without any relations (empty ‘with’)

I think you’re completely missing my point.

Of course you can iterate over millions of objects in PHP, billions for that matter, in PHP as well as in any other language - of course you can’t have a million records in memory in PHP, anymore than you can (or should) do that in any other language. That’s the problem I’m trying to address. This is not a language issue by any means, just a matter of designing a data-access pattern that serves the requirement.

Locking your tables will most likely just cause new problems - besides not addressing the real problem (or at least not addressing the root of problem), if you’re going to lock your user-table for a 3-hour batch-job, I hope you’re doing it during night hours when nobody is trying to sign up.

The together()-option must be ignored in the first query (as demonstrated in the SQL examples above) when you ask for a result-set to be fetched in batches - it can be applied to the individual batches just fine though, as far as I can figure.

It may be that CDataProviderIterator was not designed to address this issue - if so, I’m not sure what problem it is expected to solve though.

This problem can be solved - I know this for a fact because I’ve done so in the past with another object/relational-mapper, and others have solved the same problem with other ORMs.

@mindplay: I understand what you mean, but to clarify my position:

[list=1]

[*]current AR in Yii does not support iteration through resultset, solution which you have found is some kind of workaround to this prodlem but it is not ideal (like you have noticed already).

[*]you can try further workarounds (like table locking, copying data to temporary tables, etc) to achieve your goal with Yii and its ORM but I would not count on merging such solutions to framework, because it would be incompatible with some features which are already implemented (like ‘together’). Yii chose this way and there is no turning back.

[*]you can use Yii with another ORM, but you will probably loose compatibility with Zii widgets, or you will have to write some wrappers to provide interface compatible with Yii AR. If it is OK for you - go this way.

[*]for batch processing of millions of records, especialy when they are wide (many columns) - I would use plain DAO, because PHP with its memory allocation and garbage collector is not able to efficiently handle creation and disposal of such number of objects. Once I written very simple ORM iterator (like the one I described earlier) and even with explicit object unsetting at the end of the loop I ended up with 5-6GB(!) of memory consumption and few hours of script working time when similar job could be done with DAO in less than 1 hour and <1GB memory… unfortunately this is not Java…

[/list]

  1. It’s not a work-around, it’s a solution - you need a different data-access pattern to handle this number of records with AR. Just a matter of finding in, and implementing it.

  2. together() enables a different data-access pattern, which you can elect to use or not. Tis is no different.

  3. Now you’re saying use a different ORM? If there’s another ORM that can do this, why do you insist on debating whether it’s feasible in the first place?

  4. I gave reasons why plain SQL is not an acceptable solution. I too have written an iterator for a (custom) ORM - it did not leak memory, and neither do any of the solutions I have written for Yii thus far - if you use circular references, or if you ORM implements an object identity registry, those are possible causes. Modern PHP is not Java, but it does have working garbage collection.

If you insist on making up more reasons not to solve this problem, I’ll be happy to keep shooting them down. Are you done yet? :mellow:

Clearly - you do not understand my point. I am not saying you are wrong with your idea, but:

  1. CURRENT solution in Yii is workaround. I am not saying here that EVERY iterator is workaround.

  2. yes - but since there are such possibilities you have to maintain them in next releases or you will break backward compatibility in framework.

  3. I you have an ORM that is working fine for you - use it. Why the whole discussion? read my summary below…

in fact I gave you at least two solutions: workarounds with locking tables (or copying data to temporary table) and moving to other ORM which you said works better for you.

to summarize there are two main points:

  1. I would also be happy if there was real AR iterator in Yii, but I doubt it is possible to have it as part of the framework itself. Rather there will be some workarounds or third party solutions using other ORMs or cutting some AR functionalities because with Yii there are issues complicating iterator pattern…

  2. On the other hand for very big amounts of data processed in batch I would almost always go with SQL query, either with PHP/Yii DAO or even putting the processing logic in embedded DB language, because it will be simply MUCH faster and less resources consuming.

  1. Cutting "together()" from a batch-query is not a sacrifice - "together()" is one type of data-access pattern, and "batched()" is a different data-access pattern. You simply choose the best one for your situation.

  2. For a large batch-job, who cares if it takes an hour or an hour and a half? In most situations, the objective is to perform a large task, and since you’re rarely serving end-users (directly), performance (assuming it’s adequate to get the job done on time) is usually less of a concern.

Also consider the cost of development, and the more error-prone approach of maintaining duplicate functionality for two different access-patterns. CPU time and memory is cheap - usually a lot cheaper than developer hours. I’m not saying this is an excuse to avoid optimizations, but there’s times when optimizing the software for performance is not a good long-term investment - often it’s better long-term to invest in features and functionality, and software with many optimizations gets increasingly more prone to failure, and harder (and more expensive) to maintain.

By the way, I did recently evaluate the possibility of integrating another ORM into Yii, and concluded that it’s simply not worth it - Yii is too integrated for this to be feasible with more ORMs. Sure, you can do it - but because things such as validation are integrated into a base-class (CModel), suddenly there are a lot more responsibilities your ORM integration has to take over, or support. I evaluated Propel for integration, and I had to conclude it’s not feasible, at least not in any reasonable amount of time.

Since Yii is so integrated, it’s not very easy to take over one area of responsibility, without some form of “hacking”. I could move to another framework of course, but I already have a pretty substantial investment in Yii - lots of components, behaviors, etc…