Coldbox Upgrade

I have a site that has been running on Coldbox 2.6 for the last couple
years and running smooth. We recently did an upgrade to 3.1, added
injection, rewrote handlers into script, went through the code and got
queries out of the view and into the structure and stored procedures,
and various other code cleanup moving a great deal of processing to
SQL. Now here is the issue, we released on Tuesday night and
basically since then, we have been doing 24/7 restarts on Coldfusion.
We are having uptime ranging from 6 minutes to 1 1/2 hours. I have
implemented Fusion Reactor on the server and basically what is
happening is it will be running and then boom, memory will spike in
3-4 server requests to over 80% and garbage collection will never kick
in on the Heap. We have logged the requests and don't have any single
page that is called frequently before the crashes. I even built a
tool that runs every 3 seconds that checks free memory and when below
75% it forces garbage collection, and it helps keep it up for an extra
hour or so, but the problem is it will be at 37% and go to 80+% in 3-5
requests in in 1-2 second range and then freezes so the tool can't
run. Does anybody have any suggestions? I am thinking I missed some
setting in 3.1 somewhere or did something stupid in refactoring the
code.

Here are some specs

We are running in a cloud, currently have 2GB allocated to CF, running
with 4 processors.
Traffic is running around 5-8K page requests / hour on average and
when the site is up, average response time is under 500 ms.

Here are the JVM settings

java.args=-server -Xmx2048m -Xms2048m -Xmn512m -
Dsun.io.useCanonCaches=false -XX:PermSize=192m -XX:MaxPermSize=192m -
XX:+AggressiveHeap -XX:+UseParallelG

Any suggestions would be greatly appreciated.

An update, we tried taking RAM up to 12GB and put 1GB in for perm gen. Site is running pretty consistently at 1GB of Ram and then BAM, in less than 30 seconds Old Gen Jumped to over 10 GB. Don’t know if this helps, and not sure if this is Coldbox related, but, maybe you have some suggestions.

Have you enabled the coldbox settings for caching the models, handlers, etc?

Nolan Dubeau

Load .,8,1

Yes, we are caching the models and handlers. It is something weird.
I have seen it go anywhere from 6 minutes to 2 hours and then Memory
jumps from +/-1 GB to out of memory. Also, note, in this transition,
we are going from a 32 bit to 64 bit Windows platform.

I think a 1gb for perm gen is too much. Do you have an analyzer for the jvm spaces to see the percentages?

Also, I would try running varscoper and making sure I am not creating memory leaks by unscoped variables. I would also check my session timeouts and make sure session an application scope are not being abused.

I have found giving away session variables to bots or spiders is not good and will explode your ram.

I believe the new fusion reactor has the jvm spaces demarcated and you can track them before a crash.

I’m working on the same project as Steve.

We switched the JVM to 1.6.0_27 and we ran for about 2.5 hours with an average of about 2 transactions per second before it crashed.
Then we restarted and it ran for 40 minutes.

I have attached screenshots of fusion reactor graphs showing what happens.

Our current JVM settings are:

java.args=-server -Xmx4096m -Xms4096m -XX:MaxPermSize=256m -XX:PermSize=256m

We were just throwing memory at to see what differences it would make.

I will try to run var scoper. We have a 3.1 application running fine on our 32bit server. This is our first app on a 64 bit server.

crash-graphs.pdf (311 KB)

Have you tried matching the spikes with other logs?

I would be looking at what is being called, at the time of the spike, are queried being held in memory longer than they need to be. Other things to look at, if this is an ORM application, what information is being returned at that time. Is it returning more information that it should be returning.

Other areas that need to looking at is page execution times, this can sometimes show that a query might not be as optimised as it could be, especially if using ORM objects.

There are hundreds of different combinations of things that can cause something like this, but it is a process of elimination that can sometimes be only achieved by matching the logs of different areas to find out what exactly might be happening.

I just finished doing something where on ColdFusion 8 a friend of mine, who’s website was getting a heap of GC errors. The problem was that emails and I mean the entire email, styles and everything was stored in the database, and over a period of time this easily got unmanageable because of a rouge query. The query didn’t need this information but returned it when returning a lot of records, and hence brought the server down.

Without knowing information like was it a sudden spike in users, or if search engines are getting to areas that can kill a site, makes for a lot of unknowns.

I also doubt that any missing var scoped variables will be the cause of this, as this would mean more of trashed data than anything else, but its maybe a good idea to at least eliminate it though.

Other common things is with storing information in the session, and if it is a lot of information, then when a search engine or bot does come through, they usually hit with multiple request that would/could very easily bring a site down very quickly.

One of the reasons I was so against the use of session, application variables being pulled out and then stored in the request scope instead of actually just using a lock on around the variables.

But do keep us informed of your progress, I have a number of small sites running live on ColdBox and am not far from releasing a major application on a site that is around 20 thousand hits a month. But this problem you’re facing is something that I have done my best to not happen, and I am sure there is something that I may have missed too. Especially with bots.

It’s not an ORM application. I have services calling stored procedures on SQL. We have analyzed and resolved any slow running queries prior to trying to move to 64bit and coldBox 3.1.

I ran var scoper and found some issues with app. I just made changes and I am going to test. I ran var scoper on coldBox but I haven’t gone into that yet as I am going to assume it’s my app first.

I have a spreadsheet where I have captured what was running as I saw memory usage to start to climb. The problem is there was nothing consistent. It varied for each crash.

The amazing thing is that when I switched the JVM to a new version my VIEWPATH issues went away (or at least I didn’t have any during hat 2.5 hour run).

Regarding bots I am doing this in my application.cfc. Since they don’t get a cookie they get a 2 second session. Any feedback on this would be great.

<cfif cgi.HTTP_COOKIE eq “”>




<cfset this.SessionTimeout = CreateTimeSpan( 0, 0, 0, 2 )>

<cfset this.sessionTimeout = CreateTimeSpan( 0, 2, 0, 0 )>

Thanks.

Jonathan

Is your user traffic doing something different to what your load tests are testing against?

If you can at least replicate it in load tests, you have a starting point.

I guess I should ask - do you have load tests?

Mark

I do not have load tests. I am open to suggestions on how to do a load test.

I am also open to suggestions on some good software, most I have looked at seem complicated and expensive. And being a solo developer who does a lot of small work, can’t really afford anything top of the range.

So all suggestions are welcome here too.

I’ve been using Jmeter for years now, and have always liked it.
http://jakarta.apache.org/jmeter/

It does all sorts of stress testing, not only just HTTP requests, and is very flexible.

The structure of setting up tests took me a while to wrap my head around at first, but now I have, it’s really easy to implement. It even comes with a Proxy web server to recording your mouse clicks in a browser to help you get started building a test suite.

I hope you guys (and everyone else reading) walk away from this realising that changing a large part of your application, and not putting it under some reasonably strenuous load testing was a really bad idea, and something you will endeavour to avoid in the future.

You have to replicate the load your application would get on a server that is as close to production as you can get it, for exactly this reason - unit tests, functional tests and just general clicking around don’t cover what will happen under real load, and unfortunately, under real load, is where all the weird *^%^&$ happens.

(Apologies if that was harsh, but I see this happen a lot, and it is avoidable, so just wanted to drive the point home)

Mark

Mark, no arguments from me. I have just been slack on this part for a number of years, mainly because most stuff I have worked on has been low user base.

But its a good time as any to get into it, for the exact reasons you have mentioned.

To all, thank you for your input and Mark, you are not too harsh, unfortunately, we have a client who is demanding release and we just have run out of time to have effectively running by Wednesday. Through great amounts of education this weekend, I stumbled on this document, http://training.figleaf.com/curriculum/upload/AdminCF800_Unit05_configuringPerformance.pdf, and it is helping. I still have some questions, but we held up for almost 7 1/2 hours after making changes suggested, and given that our record before was 2 hours, that is a great stride.

Jonathan also fixed some var scoping issues we found in some of the corners of the project (mostly old code) and the site is really flying. I firmly believe after all of the research this weekend, this is a JVM issue dealing with garbage collection.

Thank you to all of you for your suggestions and if you have any others, we are greatly interested.

And mark, I am looking at JMeter right now.