Making My Weblog Cacheable

Before things start getting too nuts here at work, I wanted to post about some things I did yesterday to make my website more cacheable by Google. I recently asked here if anyone had a clue what I was doing wrong, and I finally got the answers I was looking for from Mike Moran. He sent me a detailed email educating me on the topic with some great pointers and helped me do the head-smack which should fix the problem:

Hi Russ. I was just skimming your entry[1] and you mentioned google caching. I thought I'd have a quick look at that page via wget:

Resolving www.russellbeattie.com... done.
Connecting to www.russellbeattie.com[198.78.65.25]:80... connected.
HTTP request sent, awaiting response...
1 HTTP/1.1 200 OK
2 Date: Mon, 03 Feb 2003 09:09:43 GMT
3 Server: Orion/1.5.2
4 Content-Location: http://www.russellbeattie.com/notebook/
5 Content-Length: 10300
6 Set-Cookie: JSESSIONID=BOAKLABJFBIO; Path=/
7 Cache-Control: private
8 Connection: Close
9 Content-Type: text/html
...

There are two things to note here: "Set-Cookie" and "Cache-Control: private" Both of these can make your page uncachable. Particularly, "Cache-Control: private" means that "this page can only be cached in client-side caches". Whether google, or some off-the-shelf software it uses, regards its cache as being "client-side", I don't know. Anyway, you might want to consider setting it to "Cache-Control: public" (I think)

I ran the notebook section through a Cachability checker[2] and it came up with advice for http://www.russellbeattie.com/notebook/:

"This object will be considered stale, because it doesn't have any freshness information assigned. It doesn't have a validator present. It won't be cached by proxies, because it has a Cache-Control: private header. This object requests that a Cookie be set; this makes it and other pages affected automatically stale; clients must check them upon every request. It doesn't have a Content-Length header present, so it can't be used in a HTTP/1.0 persistent connection."

If you have safari, you may want to check out the ora book "Web Caching"[3].

I'm not sure how useful this advice is if you only want to be cached by google. However, I would say that if you are generally cachable by a proxy then google will cache you.

There you go... Oh, interesting blog by the way.

[1]: http://www.russellbeattie.com/notebook/20030203.html#010018
[2]: http://www.ircache.net/cgi-bin/cacheability.py
[3]: http://www.oreilly.com/catalog/webcaching/

Mike

Thanks a TON Mike! The ircache.net site was incredibly useful for tracking down the issues and giving me a clue. You really saved my day yesterday!

Ready for the principal problem? I have Sessions enabled on the website by default! Duh! Every single person or aggregator (ugh) has been receiving a session cookie along with my pages. This is a pretty dumb-ass thing to do and as soon as I saw the cookie info above I realized what was going on. I'm not sure why in the past SIX months this hasn't dawned on me, but it took all of 10 minutes to test and fix. This was the big problem and then with the info above I tweaked the pages a bit more to make the pages seem even more static, and thus more cacheable by Proxies like Google.

So here's what I did: First I turned off the default session in the JSP, by using the page param, and bumped up the buffer so that the OrionServer can calculate the size of the page and send out a content-length header before streaming:

<%@ page language="java" import="java.sql.*;..." session="false" buffer="60kb"%>

That change alone made dramatic improvements in the proxies. Now the pages aren't being marked as "private", cookies aren't automatically being sent (or appended to the URL in case you have cookies off) and the content length for my main page comes up instead of being blank. Very nice. This should also affect the general memory on my server and speed of the pages being sent. This is a pretty basic thing to do and I should have thought of months ago, but for some reason escaped my mind.

I did have to make a few simple changes to the code because any checks for the session object would return erors now that the session wasn't automatically being created. Basically in the two places I check for the session (when posting) I added "session = request.getSession(false);" which will initialize the session object and allow me to check if there's one available or not. I think the implementation of this code would change a bit depending on the container, but that's how I ended up doing it on Orion.

The final piece of the puzzle is the Last-Modified timestamp. Since I have the posting information from all the pages anyways, I simply added a couple lines in where I iterate through my posts:

if(count == 1){
    modified = rs.getTimestamp("created").getTime();
    response.setDateHeader("Last-Modified", modified);
}

This makes the last modified date the same as my last posting, which is nice and accurate. I also overrode the getLastModified() servlet method and added the code there, but that doesn't seem to do much, I'm not sure why.

So, this helped a lot in some ways, but created problems in others. First, I'm not done yet. If you check out my cacheability you'll see that for some reason the last modified date isn't able to be "validated" which I'm not sure why. But I think it's good enough to start getting cached, so I'm not horribly worried about it. Secondly, now that the pages seem more static, Mozilla isn't updating as much as it should be - specifically the number of comments under each post and the links that come up for adding/editing after I log in. This is probably because of my fake last-modified date which I'll have to play with. But I'm making headway, which is nice.

The final thought on my mind is HTTP 1.1 conditional downloading. It would be nice if OrionServer only sent a page of data if it was needed, especially to aggregators, but because everything is dynamic, I'm sure the data is always being produced and sent regardless of what the client asks for in the HTTP GET.

Any Java server experts out there have comments?

-Russ

< Previous         Next >