We ran into an issue recently where our use of session locks was causing problems with the flash plugin. We had the flash defaulted to use the session scope and noticed that sometimes when we’d have concurrent requests for the same session that we’d see lock timeouts within the framework code on a second request due to the lock that our application’s code still held on the first request. This usually happened when an AJAX request was long-running for whatever reason.
Our application’s code is split between ColdBox and model-glue currently; we’re in process of converting it all to ColdBox, but we have limited resources and have to maintain the legacy model-glue code until the conversion is completed. Because of this we’re unable to get away from session locks in our code until we’re completely on ColdBox and can then look at a better way to architect it, possibly leveraging a CB feature to avoid session locks and still ensure safety with same-session concurrency for certain code paths.
Our solution was to try switching the flash scope from ‘session’ to ‘cache’. We tested that out on our development machines and didn’t notice any problems, and so rolled the change out to our production servers. After that the session lock timeouts went away, but we then began seeing intermittent issues where requests would error in system\web\flash\AbstractFlashScope.cfc, usually on line 84 but occasionally on line 108. We’re running ColdBox version 3.6.0. The error message is: “The value returned from the getFlash function is not of type struct.”
In looking into the code there, here’s what I think is going on. Since we’ve configured it to use ‘cache’, I believe the call to getFlash is hitting ColdboxCacheFlash.cfc’s implementation of getFlash. I suspect that the flashExists call on line 77 returns true, and that the ‘get’ call on line 78 is invoking the get function in CacheBoxProvider.cfc which explicitly returns nothing in the ‘miss’ case. This would definitely cause the value returned from getFlash to not be a struct and cause the exception we’re seeing.
We have two main applications that we maintain, and are seeing the errors from both of them (for one of the applications we are not yet using the flash plugin in our application code, so I suspect it’s just ColdBox’s internal use of flash). We have on the order of dozens up to a few hundred concurrent users at a time, and we’ve only seen this error a total of 60-70 times over the last 24 hours since we switched the flash configuration from session to cache. We also note in our logs that it always happens on an AJAX request - often the same event, but sometimes other AJAX events. Because of this I think it’s a case of ‘perfect storm’ timing on multiple concurrent requests from a given session whenever it occurs.
I suspect that the flashExists check on line 82 in AbstractFlashScope returns true for a thread, and that before it retrieve the scope by calling getFlash on line 84 another thread runs removeFlash on line 96 so that when the first thread resumes that the resource is now gone. There are a couple of named locks (one in ConcurrentSoftReferenceStore.cfc -> lookup, for the flashExists call, and another in ConcurrentStore.cfc -> get, for the getFlash call) but those are separate and the first one releases before the call to getFlash begins, so neither lock ensures that the resource is preserved from the time the flashExists call returns until the time the getFlash call returns.
One solution might be to modify the getFlash implementation in ColdboxCacheFlash.cfc to verify that the value returned from the get call is a struct, and if not, then return structnew() as it does in the case that flashExists returns false. This feels like a workaround, though, since it only addresses this one case and there are several places that call flashExists and follow it with code that assumes that it does exist, where for concurrent requests it may not. It seems like maybe a better approach would be a more comprehensive thread-safe one. Some of this is just guessing, since I’m not super-familiar with the internals of the framework.
Is there a way to either configure ColdBox or the flash storage to avoid the error happening under a high load of same-session concurrent requests? As we continue to make changes in our application we’ll be moving more things to AJAX requests which I think will increase the likelihood of concurrency situations happening.
thanks in advance