RE: [ColdBox 3.5 Final] Strange behaviors after upgrade

Ok, after a significant amount of testing I believe I have determined the source of my server instabilities after our upgrade to 3.5 as well as the change in 3.5 that triggered it.

To start with, I devised a jMeter test that only hit the ColdBox sample applications randomly at about 5 req/sec to test outside of my application. I am attaching the .jmx file for anyone who wants to try it. (You’ll want to edit the property defined in the “test Plan” node that sets the server used for all the HTTP samplers.) I then monitored running requests, memory usage, and stack traces with SeeFusion.

The symptoms were completely reproducible hitting the ColdBox sample apps with ColdBox 3.5 and when I rolled the ColdBox version back to 3.1, I could run the test overnight without a blip in performance.

To review, the symptoms where hanging requests (always trying to access the CGI scope, or path info which is pulled from IIS via the JRUN connector), high memory usage, and the server eventually refusing requests and having to be restarted.

If you open the attached image called SeeFusion graphs.png, you will see a test using CB 3.1 and 3.5 side-by-side.

  • In the 3.1 test, it was run overnight with a steady 5-6 requests processed a minute and memory showed a healthy saw-tooth pattern that did not climb.

  • In the 3.5 test, the number of running requests tapered off over the course of an hour until the server was unresponsive. Memory usage was erratic and climbed very high during the test.
    Also, not shown in this screenshot, are 3 requests that hung for 5 minutes apiece during the 3.5 test. Open the attached file “Example long-running request with ColdBox 3.5.txt” to see an example of what they were stuck on while they hung. You can see that CF is trying to connect to IIS to get the real path of the template being processed.
    The breakthrough came when I opened up the CF Server Monitor to take a look at the CFThreads that were running (SeeFusion doesn’t show these unless you look at the full stack trace).
    Look at the attached image called “CFThread queues with ColdBox 3.5.png” and you will see that literally thousands of CFThreads were being spawned by ColdBox, all but 10 of which were being queued of course. After about 5 minutes, the queues would clear out, but memory usage would climb substantially during this time and performance would degrade.

To figure out what was going on during those spikes, refer to the attached file “example full thread dump with ColdBox 3.5.txt” and look at thread “cfthread-3”. You’ll see this was a thread spawned to reap the object store in CacheBox. Before it reaped, it tapped LogBox to log its operation which hit the asychRollingFileAppender. The file appender decided to rotate the log file and when it tried to create the zip.cfc utility, it hung trying to access the CGI scope, which sort of makes sense since we’re inside of a CFThread, and those don’t have access to the CGI scope after the initial HTTP request has closed when connecting to the web server via a connector. (http://help.adobe.com/en_US/ColdFusion/9.0/CFMLRef/WSc3ff6d0ea77859461172e0811cbec22c24-75a8.html)

When the logging CacheBox reap hung, it help open its exclusive named lock on my CacheBox instance, which in turn, blocked every other CFThread that was spawned in an attempt to reap as well. See the following threads in the full thread dump: cfthread-11, cfthread-16, cfthread-15, cfthread-14, cfthread-8, cfthread-12, cfthread-10, and cfthread-7.

The long-running requests I was seeing SeeFusion which weren’t threads would happen at the same time. I’m not sure why since a regular HTTP thread should have access to CGI vars, but I assume it was due to some sort of contention in the JRUN connector. I’m open to the idea that we have something configured wonky there. (CF9.01, IIS7.5)

So, what changed in ColdBox 3.5? Good question. The answer is the addition of lines 587 and 651 in system\cache\providers\CacheBoxProvider.cfc that log an info message. Funny thing is, there’s nothing really wrong with those two lines of code. If you happen to have the asynch rolling file appender configured, then you’ll eventually hit line 73 of system\core\util\Util.cfc which is where the reaping thread would hang every time.

instance.ioFile = CreateObject(“java”,“java.io.File”);

What the hey, there’s nothing related to the CGI scope in that line. We’re just setting something into the variables scope, and variables is searched prior to CGI. Go back to “cfthread-3” in “example full thread dump with ColdBox 3.5.txt” and you’ll see that since variables.instance doesn’t exist yet, CF is searching through all the scopes it can before it gives up and creates variables.instance.

This may be the most subtle and odd fix ever, but I was able to make all of the ColdBox 3.5 issues completely disappear by adding this line into zip.cfc
variables.instance = {};

This defined instance in the variables scope, which stopped the built-in scope hunting from ever reaching CGI, which allowed the log file to rotate, which allowed CacheBox to log its reap, which didn’t backlog thousands of reap threads, that didn’t eat up all my memory and cause other random requests to to also hang when accessing the IIS connectors.

Takeaways:

  • For starters, I’d say that CF shouldn’t be trying to scope hunt in the CGI scope when it’s in a CFThread. I’m not sure why CF is hanging when it does, but I’d love to get feedback on anything that might be screwy with our JRUN/IIS connectors
  • My change to zip.cfc doesn’t really fix the issue, it just avoids it really. It makes me nervous that anything that kicks off scope-hunting inside a CFThread can have this kind of results.
  • In theory, CacheBox also shouldn’t spawn thousands of reaping threads. Perhaps the object store should save some sort of state to say, “Hey, I’m reaping now, no more threads please.” Then, when reaping completes, reset that flag back. In my instance, the object store would still never get reaped since the thread eventually errors out, but at least it wouldn’t create thousands of waiting threads.
  • Anything else I’m missing, or any input people want to share? I’d love to see some people grab the jMeter test and give it a try. I want to know if the CGI access inside a thread hanging like that is normal or not.
    Thanks!

~Brad

Example long-running request with ColdBox 3.5.txt (12.2 KB)

example full thread dump with ColdBox 3.5.txt (80 KB)

hit coldbox samples.jmx (32.2 KB)

WOW, very interesting indeed Brad.

I added the fix as well. or the avoidance as well. The CacheBox reaps only occur on frequencies, not sure if it should keep track of reaping and only reap if the other reap has not finished?

Interesting theory though, can you create a ticket on exploring reaping contention and monitoring.

Luis F. Majano
CEO
Ortus Solutions, Corp
www.ortussolutions.com

ColdBox Platform: http://www.coldbox.org
Linked In: http://www.linkedin.com/pub/3/731/483
Blog: http://www.luismajano.com
IECFUG Manager: http://www.iecfug.com

Social: twitter.com/lmajano facebook.com/lmajano

Brad, I committed the changes to the dev branch. Can you try now?

Luis F. Majano
CEO
Ortus Solutions, Corp
www.ortussolutions.com

ColdBox Platform: http://www.coldbox.org
Linked In: http://www.linkedin.com/pub/3/731/483
Blog: http://www.luismajano.com
IECFUG Manager: http://www.iecfug.com

Social: twitter.com/lmajano facebook.com/lmajano

Can I also add that I am sure this was also the cause of my test failing sometime ago as well, with ORM stuff. I too had an issue with the caching, where I would get error messages or not, sometimes IIS hung sometimes ColdFusion hung, all with different error messages.

But the most common was with too many concurrent connections to the DB using ORM.

There was a very lengthy thread on this around August or September, and I am wondering if it is related as well. As Earlier versions of ColdBox also did not have this issue, in the end I gave up because the site wasn’t going to reach the amount of traffic.