Designers, Developers, System Administrators and companies spend a lot of time trying to optimize their networks and content delivery systems to achieve optimal performance with the minimum resources possible. At first, there was a simple browser/proxy cache schema, then the “If-Modified-Since” and ETags HTTP headers where introduced. Nowadays CDNs seem to blooming and all the major content providers (e.g. Google, Microsoft, Yahoo) have theirs setup not only to optimize their content but even with copies of popular pieces of “content” which is to say, popular script libraries like jQuery or YUI.
It isn’t yet a perfect system, mainly because browsers use URLs to match content and that implies that the browser will assume a content URL not found on the cache is something completely new. This forces the browser to cache two copies of the same content which might very well be exactly the same. Since sites usually link to specific version of a library, to avoid breaking the code due to an update, this will always be a bit of problem even when the file names or paths are similar.
ETags can’t provide a global solution to identify files uniquely because ETags are generated differently by each HTTP server. Also, the “If-Modified-Since” HTTP header is based on the date and time and there is no way we can ensure all online servers have synchronized times even if timezone was always accounted for. The most promising, and fairly secure against tampering, solution seems to be to hash a file while caching it. This can provide us with a unique, and global way of identifying a file and, if coupled with the file size, can become pretty resistant to possible hashing collisions (while also increasing the security a noch or two).
Following current schemas, the browser would then need to send the hash of the file while requesting it. This, however, isn’t practical since sending a list with the hashed for all the cached files is 1) leaking browser history and 2) completely impractical due to the shear number of cached files.
To uniquely identity a piece a content across a system its a common practice to use UUIDs (or GUIDs in Microsoft lingo). UUIDs are used by various OS components (eg. file-systems) and other (distributed) systems (e.g. databases). So, it’s safe to assume the same could easily be in use to identify a script across multiple CDN networks while being easily 1) managed by content authors, 2) backward compatible and 3) easy to implement by users. Ideally, the HTML script tag (among others) could have an attribute to allow the definition of a UUID when importing content. However, in the short-term it could be possible to use a simple convention like “script.js?randomparams&cache_guid=…” that browsers could use to bootstrap the process.
On the practical side of things, a single UUID wouldn’t suffice for most users/authors. Having a single UUID would mean a single (universal) version of the content piece that the site couldn’t be sure about it’s freshness. The author would need to provide several UUIDs based on versions of the content for example, one for the major version and another for the minor version ensuring developers could lock the browsers to a single minor version or major version. The list of possible alternates would be sent to the server, allowing the best match to be sent.
The server would, however, need to be aware of the UUID for each content. This could be achieved by allowing authors to place a .uuid file with the same name as the content to be picked by the server and used to make the matches.
The whole process could be summarized as:
- A user visits a site that provides the script identified by its hash, file size and GUID.
- Another page requests a script with the same hash, file size and GUID on any URL and the browser sends the three pieces on data on the header.
- Optionally. The exact script version requested isn’t cached but it’s found on the list of alternates.
- Just like ETags, the server is able to send an updated copy or send a 304 (Not Modified) status code.
There are apparent issues in the presented schema the first being the fact that the process requires server support to work. This is entirely dependent of (both server and browser) software vendors to implement.
The second issue is that unless all the CDNs (and sites) updated the libraries at the same time, there would be times where the browser would send the hash, get a “newer” file, just to repeat the process on another server in an infinite loop until they are both the same version. This, however, would not be the case because by sending the hashes of all the available version in the cache, the server could simply reply with the hash of the version to be used.
Finally, a third issue arises from the proposed solution to the previous problem. If the browser is sending the hashes of the available cached versions of a file, this information could be used to match the hashes against a list of know sites and creating a problem similar to the a:visited exploit. This, however, is only an issue if abused by developers because with ETag and If-Modified-Since available for local or private resources there is no (big) point in implementing this schema. This cache system would be meant for pieces of content that should be widely available and, not for content that enables third-parties to uniquely identify a user based on it’s browser story or patterns.