Sorry, for the brain dump that covers so many facets of caching all in
one really long E-mail. I spent the day wading around in the cache
manager code and ended up with a list of burning questions.
Well, I left my cache max size at 100 so far today and I haven't gotten
any nibbles. It looks like the LastAccessTimeout errors yesterday don't
want to happen again. A lot of our weird errors come from screwy bot
traffic (like one today that was truncating our proxy URLs to 255
chars). The errors yesterday sure seemed to be scattered across regular
users, but we dump out a cookie value in our bug report and it was
undefined in every instance of the error. That either means the clients
that caused the erroring requests didn't accept cookies, or it was their
first hit to the site.
I set up some tests on dev to pummel the cache manager for a while to
try and see if it was a race condition. I set my max cache size to 50,
and created a simple template that expired the entire cache, forced a
reap, shoved 500 random things in the cache, then expired it all again
and forced another reap. (I temporarily modified reap() to always run
regardless of last reap time)
I ran the template up to 8 times at once looking for concurrency
problems. I did find a couple things, but the LastAccessTimeout error
didn't happen at all.
Concurrency issue: I found that if anything is looping over the cache
meta data (like the cache policy) while reaping is going on, an item may
be deleted from the cache meta data causing an error when that item is
reached in the looping concurrent thread. All cache policies do this in
their execute method:
var objStruct = getCacheManager().getObjectPool().getpool_metadata();
This returns a pointer to the pool meta data that the reap method could
be clearing out via ObjectPool.clearKey(). One solution to this would
be to change the cache policies to run this code:
var objStruct = getCacheManager().getpool_metadata();
since the getpool_metadata method in the cacheManager duplicates the
structure first.
Several of the methods in the cache manager are probably susceptible to
race conditions since they access the pool metadata like
expireByKeySnippet(). Obviously, duplicate wouldn't work there since
the actual meta data needs to be modified.
The alternative (and probably better) solution would be to put an
exclusive named lock around all changes to the cache meta data and a
read only lock around the places that simply use it to make sure an item
isn't being cleared at the same time another method is trying to read or
modify it. This would avoid the overhead of duplicate and prevent dirty
reads from the meta data.
The other thing I found while running my tests and watching stack traces
isn't very fair since I was placing far more items in the cache than a
normal application would. Nevertheless, I would highly recommend
dumping use of the createuuid() method. Expiring a key calls it to get
a unique thread name in MTCacheManager. When a lot of cache expirations
were happening, the performance absolutely sucked and the stack traces
were consumed by that sordid createUUID function. I temporarily
replaced it with org.safehaus.uuid.UUIDGenerator and it helped a
significant amount. Another place I saw in my stack traces when I was
creating ungodly amounts of cache entries was the createUUID function
call in the frameworkSupertype.cfc inline pseduo-constructor.
Two more quick question about how cache eviction and reaping is handled.
The first is about how the LRU chooses the best object to expire from
the cache. Assuming LRU still means "Least Recently Used", I would have
expected it to sort the meta data on the lastAccesed key since the
object with the oldest lastAccesed timestamp would be the least recently
used. Instead, it sorts on the LastAccessTimeout key even though an
object in the cache with a small LastAccessTimeout value could still
have been used very recently. Even worse, most of my objects in the
cache use the default LastAccessTimeout timeout which causes them to
sort the same. Therefore, the LRU policy essentially is randomly
choosing an item to expire regardless of when it was last used. Am I
missing something about the LRU policy?
Additionally, I noticed that if you are putting objects in the cache
faster than reaping is running, the cache will grow larger than its
maximum size and never shrink back down until the objects expire
naturally. For example, let's say your eviction policy is LRU, your max
cache size is 100 and you have a full cache with 100 items in it. A
page is run that places 25 objects in the cache. since the cache is
full, each call to the cachemanager.set() will in turn call
instance.evictionPolicy.execute() which will expire 1 item per call.
So in theory, one item will be evicted for every item I place in the
cache after it is full so once the cache is reaped, I will not be over
my max cache size. The problem is, the cache policies do not check to
see whether an object is already ready for reaping. Subsequently, it
will expire the same item over and over again until it is reaped. So if
I add 25 items to a full cache, 1 object (the one that sorts first) will
be expired 25 times and then reaped at a later date leaving 124 items in
the cache even though the limit is 100. It seems that the eviction
policy should skip over cache objects that are already expired since
they are dead men walking already and keep looking until it finds the
next best match that isn't expired. A drawback of this is the fact that
the cache policy would need to duplicate the reaping logic to know which
items are ready to be reaped. Alternatively, instead of faking the
timeout and datecreated (which keeps the FIFO policy from working
correctly) there could be another piece of meta data called "evicted"
which defaulted to false and was set to true when the cache policy axed
an item. The reap method would pick those up and they would be readily
identifiable by the cache policy so they wouldn't be evicted needlessly
again. This would help ensure that one good cache item was let go for
each new item that was added over the limit.
Ok, I lied-- one more question. Is it a correct statement that the max
cache size doesn't apply to objects with a timeout of 0? If I have a
max cache size of 100, and proceed to cache 200 objects with a timeout
of 0, the eviction policy would get run 100 times, but would never
expire anything since it skips objects with no timeout. This would
leave the cache at a size of 200. Is that correct?
Ok, I'm done. Let me know what you think and feel free to correct me if
I have wrongly interpreted any of the code.
~Brad