Notify Events Stopping
Basically the problem is to do with notifies stopping working for no apparent reason, logs show nothing out of the ordinary, so I'm pretty stumped as to what's causing it, just wondering if anyone's seen this before or has any ideas on what could cause it. Atm it's a bit of a blocker, as we can't really use the space for anything useful for any length of time.
I've seen this randomly occurring mostly on a 2 machine linux setup, and also less frequently on my windows machine (but that doesn't run continuously). Restarting the space from scratch always resolves the issue, but of course isn't a very good solution. I've seen it happen after 5 minutes and after 2 weeks, so it's seems a bit odd as to what triggers it.
It affects notifies both in the embedded space between local and cluster PU's and also other PUs and remote client connections to the cluster. Normal space I/O seems to still work perfectly for everything involved, it's just as if the code inside the cluster that sends out the notifies to all listeners local or otherwise has completely gone into meltdown. Once this has happened any new connection to the space has the same problem also, cpu and memory usage on all machines involved looks average. Oh and I once saw this only happen to one of the GSC's so I was left with half the cluster working, it's extermely odd.
The setup I have it running is 2 partitions, 1 per GSC. We've had this problem almost since day 1, atm we're using the latest java premium 6.6.4 version with java 6.
Just as an example, we have tomcat running which might have say 5 sessions connected to the space at once, one session writes and object to the cluster, the local embedded pu that should have been notified of the write never gets the event. As I say this works 100% fine normally, until the space gets in a twist and everything grinds to a halt.
Space setup is as follows.
<os-core:space id="space" url="/./spaceFX" mirror="true" external-data-source="hibernateDataSource">
<os-core:filter-provider ref="serviceExporter"/>
<os-core:properties>
<props>
<prop key="cluster-config.cache-loader.external-data-source">true</prop>
<prop key="cluster-config.cache-loader.central-data-source">true</prop>
<prop key="space-config.engine.cache_policy">1</prop>
<prop key="space-config.external-data-source.usage";>read-only</prop>
<prop key="-Dcom.gs.cluster.cache-loader.central-data-source">true</prop>
<prop key="cluster-config.mirror-service.url">jini://*/mirror-service_container/mirror-service</prop>
</props>
</os-core:properties>
</os-core:space>
<os-core:giga-space id="gigaSpace" space="space" tx-manager="transactionManager" />
<os-core:local-tx-manager id="transactionManager" space="space" default-timeout="5000">
<os-core:renew pool-size="2" duration="3000" round-trip-time="1500" />
</os-core:local-tx-manager>
<os-core:distributed-tx-manager id="distTransactionManager" default-timeout="5000" />
<os-sla:sla cluster-schema="partitioned-sync2backup" number-of-instances="2" number-of-backups="1" max-instances-per-vm="1" />
<os-core:giga-space-context />
<os-core:giga-space-late-context />
<os-core:giga-space id="gigaSpaceCluster" space="space" clustered="true" />
As you can see each partition deals with both local and dist transactions.
General layout of the space is embedded space services process work which comes purely from event notifies, these object write/updates that trigger the notifies come from 2 places, deployed PUs or remote tomcat connections to the space.
The tomcat sessions use the space via code rather than any spring wiring if that makes any difference, although we used to use spring, and i'm fairly sure we had this issue then also.
The archiving to DB still works after notifies stops working, so this doesn't appear to be the cause. We do rely on a lot of notifies, could this be overloading things? the only thing I can think of trying is moving everything to polling containers rather than notify containers, and also moving from tomcat to the embedded jetty support.
-
- md5-488c8f2eac8b91335e1c3e99685ac423
Having changed the embedded space to use a polling container, those seem to keep on working fine from what I can tell.
I can confirm that after leaving the app running for 30 minutes (~65,000 notifies in total, split 50:50 more or less accross the cluster), the 3 clients I had connected (changed to share 1 notify this time) failed on partition #1 of 2.
This in effect means the polling container received 65,000 writes of obj class A (~32k on each partition), then generated 65,000 updates to some class B objects, the 1 notify container on the tomcat side got events for these class B space entry updates (must have been something less than 65,000) until half of the cluster went wrong, leaving notifies only coming from the 2nd partition GSC. I imagine if I leave it longer both partitions will stop sending notifies to the tomcat client.
I'm thinking it's perhaps linked to notify delivery volume to the tomcat notify container?
This thread was imported from the previous forum.
For your reference, the original is available here