Asset manager

Javache posted about improvement Ruby on Rails team made to their assets system. And I think it’s a very good feature to have in our asset manager as well.

The basic idea is to use file contents + directory hash instead of just filename + directory hash for assets published.

What for

We will be able to set expire headers for all CSS/JS/images to cache these forever. Since name will change when file content will change, new version will be refreshed immediately and will be cached the same way then.

Moreover, it’s much more reliable than relying on file mtime as we’re doing now.

Performance

For 20—30 CSS files using hash_file with sha1 isn’t significantly slower than just getting mtime of a file.

+1 :)

What if a css file is referring to some images while the names of these image files are modified using this strategy?

You had to modify your css file to reflect the changes, so it shouldn’t be a problem.

//EDIT:

Now I’ve got it…

qiang

Very good question.

Rails implementation can use SCSS and a special helper named image_url() that’s used instead of url(). This helper method gives correct paths in the end.

When using regular CSS it’s required to use placeholder variable url(<%= asset_path… %>).

Another ideas from the same Rails pipeline implementation:

  1. Allow combining/compressing assets into a single file out of the box in realtime + caching it.

  2. Allow pre-processing assets in formats like SCSS into another common format like CSS.

About #1 I think http://www.yiiframework.com/extension/extendedclientscript is pretty close.

Wouldn’t this pollute the assets folder?

If every time you change an asset file and it gets published under a new name, what happens to the old one?

I think instead of modifying the file names, we can add a new ‘version’ property to CApplication. CAssetManager can include this version parameter in the directory name generation. As a result, if we change the ‘version’ property, all asset files will be re-published.

In practice, the version property can be modified by a console script based on the current SVC version number (or some other better mechanism).

Each time we update the production server, this script will be run automatically.

This approach is “coarser” than the file name modification approach. However, it has the benefit that it doesn’t need to change our usual workflow (i.e., fixing references in CSS/JS files, which could be very tricky).

qiang

I don’t like fixing references in CSS/JS files either. Your solution will work for medium and large deployments where we can use version control hooks but what about small ones? Of course, you can change version number manually but it’s a bit too much for development.

What do you mean small ones? And why is this needed during development?

Small ones are these hosted on shared hosting where you have access to FTP and nothing more. No SSH, no cron.

Typically assets are used for 2 things:

  1. Make it easy and reliable to publish non-webroot files.

  2. Compress/combine JS/CSS automatically.

During development you generally need only #1 but since compression and combining can hurt it’s good to enable it during development or at least during final testing stage.

Anyway, your version solution looks good and currently I don’t have a better one.

I feel like calculating a hash on each call is no good.

Another idea is implemented in Xenforo avatars - they add ?timestamp=1234 to the end of the url of an image and keep the date the avatar was changed.

grigori

I’ve tested hashing the whole file with sha1 and filemtime (I guess that’s what is used in xenforo) and can say that both are equally acceptable if you have about 30 assets per page. Calculating simpler hash from a not so long string is even faster.

I will briefly describe the approach I use for dynamically generated images (e.g. resized etc.) - I think this approach may potentially be applicable to other types of resources.

I create a folder in the root consistent with the name of a controller - for example, I might create a physical "/images" folder in the public folder, but this will initially be empty.

I then create an ImageController, and a catch-all route for files not physically present in that folder - e.g. "/images/<.*:filename>".

The action method in the ImageController now looks at the filename that was passed, e.g. everything after "/images", parses it in some way, and produces a physical file.

For example, the first time you ask for "/images/320x200/kitten.jpg", the ImageController might parse the width and height (320, 200), load the "kitten.jpg" image from a protected folder, resize the image, create the "320x200" folder and save the resized "kitten.jpg" in that folder.

The idea is that missing resources can be filled in by a controller the first time they are requested - on subsequent requests, the controller (and PHP at all, for that matter) is never even invoked.

This gives theoretically the best possible performance, with the limitation that you have to manually invalidate (erase) cached files - so if the original "kitten.jpg" is modified, you have to find "320x200/kitten.jpg", and potentially other versions, and delete them. They will then get regenerated upon next request, which will be routed to the controller.

During development, you can bypass the function that saves the file, effectively regenerating the file on every request - this is usually more desirable during development.

As mentioned, I believe this pattern is applicable to other types of resources - you can think of a compressed JS or CSS file as a "resized" version of the original resource.

I want to add an example of assets manager usage we do in real world (I will describe first and then comment on it).

A month-month and a half ago we started developing the WEB part of a new project witch had to be multi site in essence - same code, different look. Ofcourse theming comes to mind. The tricky part is than every site copy we need is different. Not in “change color scheme, move around some things”, but “that’s different website with it’s own design and layout - only the backend functionality is the same”.

Our initial thought was that it will require some clever code and hacking. Guess what? Nothing of the sort! The Asset Manager is a charm - just pass a directory to publish() and it will do the whole directory with all subdirectories. Ha! So we just put JS, images and CSS into a dir in theme and publish it all in one go - all relative paths for CSS and JS stay in place. Every theme just has it’s own layout and views - so we are able to tune the HTML for every site individually and they do not have any common HTML, JS or CSS (probably we will figure someting out later with code reusage when we start to make additional sites), but still we have our thing done.

What I’m thinking of is how do you republish the assets if the asset is a whole dir with many files and 1 or 2 files change from the whole set - how do you detect that change? Even more clever question about publishing folders name in assets - it’s based only on basename($dir). So even if we make that if we modify some files in folder, make a touch on folder to change it’s modification time - nothing will be published under new assets url (like versioning).

Checking for file modifications inside the dir we want to publish is obviously out of the question, but framework cound take into account the filemtime of the dir so it can be updated automatically and under new URL (so we can make assets resources caching permanent or for some long time, witch you usually want). Implementing a touch on a dir on deployment is really a no brainer, and the benifits are hard to ignore. Besides it will make thing even more powerfull :)

A couple of thoughts regarding asset management.

First off, we could offer a wealth of features by integrating a third-party component like this one - no point in reinventing the wheel.

The other thing that occurred to me today is, the whole idea of publishing files from a protected folder into a public folder, with generated filename/path based on a hash of the original filename and modification time.

I it’s a good time to scrutinize this idea, and wonder if this design is really good or right.

For one, what happens when you clear your assets folder? Currently, any page that has cached an asset URL, is going to get a 404 if the URL of that asset changes.

Perhaps atypical, and you may not care, but in terms of search-indexing, the changing URLs could also be an issue - let’s say Google indexes an image you published, perhaps your company logo, and the URL keeps changing, with the old image becoming a 404. (might even affect search engine ranking, not sure.)

As mentioned above, I like the idea of producing assets on-demand, via a controller.

Let’s say the URL “http://site.com/assets/28b3cc8e/yii/jquery.js” mapped to an actual controller, which then generated that file. In the case where that file is updated, when the hash no longer matches, that controller could instead return a permanent redirect, thus solving the issue with 404 errors.

In effect, you can put whatever you want in the hash then, it’s no longer important - search-engines can update as needed, and browsers showing a cached page will be redirected to get the latest version of each asset.

Of course, this means there has to be some kind of unique identifier for the asset-collection (e.g. module-name) in every URL, since the hash no longer means anything and is only used as a means of forcing cache updates on browsers and other agents.

Thoughts?

Two things that come to my mind:

  1. Permanent redirect might be the wrong answer. If a resource has been changed, it is necessary to force the client to fetch the new version. Don’t know if there are appropriate headers for this purpose.

  2. Runtime overhead. Links to published resources should be handled by webserver only, without the need to invoke php, yii or anything else. A fallback to invoke an action for assets that do no longer exist could help in some cases. But this requires URL rewriting and not everyone might have access to the web server config. So I think this should only be an optional feature.

Might be, I’m not 100% sure. My thinking here, is that once the resource has been updated, it’s URL changes to something new, and so the new URL is “permanent” until (or if) the resource changes again. There may be a more appropriate 3xx status code for this.

See my explanation above regarding dynamically generated images - I use this technique currently, for precisely that reason: there is no runtime overhead, PHP doesn’t even load, except the very first time a client asks for a resource.

I generate the image on-demand the first time, and on subsequent requests, Apache sees that the file now exists, and PHP never gets invoked; if you purge the folder of generated resource, subsequent requests will simply cause those file to regenerate on the next request - in other words, it’s perfectly safe to purge the cache folder.

The same pattern would work for any kind of resource - though some resources, such as striped and minified JavaScript resources, would require some configuration, e.g. to specify which files to stripe, and how to name the collected resource… When the resource is purged, the controller goes back to that configuration and figures out how to assemble the requested resource - this can be as labor-intensive as it needs to be, since it’ll only happen once…

Thanks for considering the best way to implement this guys. If there could be a way to tie this functionality into AWS’ S3 and Cloudfront that would be FANTASTIC!