Caching for Fun and Profit
If you're a content publisher or are responsible for coding a content retrieval engine, you should know about how web content is cached. Especially if you publish RSS/Atom feeds, and ESPECIALLY if you are writing an RSS client. Unfortunatley, really understanding the interactions between the server, client, and proxy servers/web caches can be confusing. Fortunately, Mark Nottingham (he's also very active in the Atom community) has written an excellent document called "Caching Tutorial for Web Authors and Webmasters" that is required reading for any feed aggregator coders or blog hosting sites out there.
Having written a few web caching engines before, I'd just like to add a few suggestions that I've found useful:
- The "last modified" date should always reflect the server's clock, while a "last fetched" (for secondary cache expiration) should always use the client's clock. Don't get lazy: if you're requesting a remote resource, hold onto the last modified date that the server returns to you and store it so you can use it to set your If-Modified-Since header next time around. Don't use your local clock.
- If you're on the server side of things and need to provide a Last-Modified header, truncate the date value you store (and send down) to the second level. I picked that tip up from Jason Hunter's book Java Servlet Programming. That means if you've just grabbed the local system time to mark when a page or item has been modified, you should do one of these tricks before you set the value and send it down in the HTTP header:
long modifiedTime = System.currentTimeMillis() / 1000 * 1000; - There's nothing wrong with setting the ETag header on a response to be the same as the Last-Modified header in quotes.
One of the smart things that most of the blogging engines do (thank you Blogger for establishing the precedent) is they provide the illusion of dynamic content by generating static pages, and then let the smart folks who wrote the HTTP server figure out the caching parameters to set. That doesn't mean that the server configuration can't be tweaked, though, especially with the Cache-Control header.
Oh, don't forget about supporting gzip compression.