VMware Communities > Blogs > VMware Communities Blog > 2008

Blog Posts

VMware Communities Blog : February 2008

Previous Next
2

All Systems Go: Feb 26

Posted by RDellimmagine VMware Moderator Feb 26, 2008

We've fixed the last open issues since Friday's system upgrade:

1. Inbound emails are now getting posted. So if you receive an email for a community or thread, you can now once again reply to that email, and your reply will be posted.
2. New account registration (the "register" link if you're not logged in) is now working. So newbies can now create accounts and start to participate.
Also, we've done some additional tuning on the system, and we have been seeing overall better performance. There is still more we need to do on site performance, but I believe we have made improvements in the last week. Tell us know what you're seeing: Has Communities Site Performance Improved?.

2 Comments Permalink
0

System Status: Feb 25

Posted by RDellimmagine VMware Moderator Feb 25, 2008

Some participants are reporting better performance since we rolled out the third node. As I've said before, we still have some additional tuning to do, so I expect to make additional improvements as we identify and remove performance bottlenecks going forward.

Private messaging is once again now working, as we corrected a misconfiguration that inadvertently disabled it.

We are still working on fixing in-bound email (replying to an email notification), so please do not use that feature until I announce here that it is working again. Also, registration for new accounts is not working, but I expect to have that fixed in the next day or so.

0 Comments Permalink
0

Performance appears to be back at standard levels. We are continuing to monitor the system closely and assess tuning opportunities, as we believe faster performance should be achievable shortly.

Outbound email notifications (e.g. when you click 'Receive email notifications' in the Actions box) are now working again. However, if you reply to these notifications, the replies are currently not getting posted. I expect this to be fixed by Monday at the latest.

Private messages may be disabled for a couple of days. Sorry if this causes any inconvenience.

0 Comments Permalink
0

As of 9am Pacific on Saturday, February 23, we've run overnight with the third node added to the VMware Communities application cluster.

We are working on some errors in one of the three nodes that I reported in last night's blog post; and performance is somewhat slow, although I am finding the site usable, and community members are posting content. First priority is to monitor and improve performance, which should benefit from the additional compute power in the cluster.

We also see that private messaging and email notifications are not working. We believe this is also a configuration issue, and we are working on that as well.

Otherwise, all functions appear to be available. Please post in VMware Communities Feedback if you see any other issues.

Thanks, Robert

0 Comments Permalink
0

Additional Compute Power

Posted by RDellimmagine VMware Moderator Feb 23, 2008

Tonight we added a third node to the cluster running the VMware Communities application. Performance is good as I write this, but we are investigating an error we are seeing on one of the three nodes. I will update before Monday with the status on resolving that error as well as the overall performance we see this weekend. Thank you for your patience with the downtime required for this change.

0 Comments Permalink
0

We are on track to add a third node to the cluster on Friday, February 22 at 6pm Pacific time, which will require a maximum of 30 minutes downtime for some configuration changes. Caveat: We are still doing some testing, and will have a go/no-go meeting on Friday afternoon.

Update on System Status

1. Load Balancer Configuration: The load balancer configuration change described in the previous post removes one cause of gateway errors, but not all causes. I believe the other causes are related to a timeout occurring when the system is under load, but we are still investigating root cause.
2. Redirecting Google Traffic: We've confirmed that we are not getting spikes in traffic caused by Google since the load balancer configuration changes were made. So this is a positive step.
3. Third Node: As stated above, we are on track to roll this out Friday, February 22, and expect it to improve baseline performance.

0 Comments Permalink
0

We continue to work to improve VMware Communities performance and ensure consistently acceptable page load times. We are tuning the system in the following key areas:

1. Load Balancer Configuration: Today, we updated the F5 load balancer configuration to direct requests directly to the two clustered Tomcat instances running the communities application, bypassing Apache. This will remove the cause of the Gateway Timeout messages many of you have reported. If one of the nodes in the application cluster is down, the load balancer will now redirect all traffic to the functioning node.
2. Redirecting Google traffic: The performance slowdowns of the last several weeks correlate directly to increased load caused by automated services like Google or RSS readers indexing / accessing the VMware Communities site. We had been aware of this issue since December when we updated the robots.txt file on the servers to disallow crawling by Google and reconfigured Apache to redirect Google crawls to a mirror of the site. This fix appeared to work; however, in the last couple of weeks, we have seen that when the system is under load, Google traffic affects the application anyway. The load balancer configuration changes described in #1 above should help resolve this: instead of having Google redirected from Apache running on the same server as the application, it is now redirected from the load balancer. This should remove the cause of the recent performance slowdowns while ensuring that Google continues to index the VMware Communities content. As I write this, we are still in the process of investigating whether there are other changes required to isolate Google and other automated service traffic, so I will update later then week when I have confirmed this fix.
3. Adding a third node: This will add 50% more processing capacity and should allow the application to handle traffic peaks better. We are taking advantage of this change to review all system configuration settings across the cluster. We will implement the third node in the next two weeks.
In addition, we have made two application setting changes to temporarily increase performance:
a. Query caching: Many of you have noticed the "Your message was posted successfully, but there will be a short delay before it is viewable in the thread" message when you post. We turned on query caching, which reduces system load about 20% by not requiring the application to rebuild the thread when a new message is posted. The query cache was originally set to 10 seconds, but we reduced it to 5 seconds, which should reduce how often community participants see the message. Our current thinking is to remove query caching when we stabilize performance.
b. Status Level Calculator: The status level calculator refresh rate has been set to 12 hours, which reduces system load. We will reset the status level calculation refresh rate to a shorter interval when we determine that doing so won’t negatively affect system performance.

0 Comments Permalink
4

We are changing the the UserPointLevelCache timeout from 1 hour to 12 hours. This will reduce the number of times that point level calculations are made for each user, and the resulting reduction in cpu load is expected to improve system performance. It also means that when you post to a community or are awarded points for a helpful or complete answer, you will not see your point total updated for up to 12 hours. I expect this to be a temporary change until we have stabilized performance, and then we will return the timeout to 1 hour.

4 Comments Permalink
0

On-Going Tuning

Posted by RDellimmagine VMware Moderator Feb 6, 2008

I wanted to give an update on VMware Communities performance. Overall, we have been seeing some intermittent slowness again this week, and we are working on a couple of areas to address it:

1. Networking errors: We have identified and are in the process of debugging some networking issues between the two nodes in the cluster.

2. Private network: We are currently testing and plan to roll out a private network between the two nodes, which will increase the transfer rate of cluster information between them.

3. Automatic restarts: From early December until mid-January, the communities application ran with consistent performance, but then started slowing down. This problem was resolved when we rebooted at the operating system level, which makes us think there is a resource leak. We are therefore implementing an automated restart at the operating system level on a periodic basis.

4. Resourcing: We are putting additional Java experts on these problems to get them fixed quickly. Performance and system stability is the top priority.

We've had system monitoring in place since early December, which includes automatic restarts of the application when performance thresholds are reached. And the root cause analysis that we do at every application restart consistently is pointing us at the issues listed above.

0 Comments Permalink

VMware Communities Blog

Status updates and the behind-the-scenes story of VMware Communities