Welcome to the new Gigaspaces XAP forum. To recover your account, please follow these instructions.

Ask Your Question
0

Notify Events Stopping

Basically the problem is to do with notifies stopping working for no apparent reason, logs show nothing out of the ordinary, so I'm pretty stumped as to what's causing it, just wondering if anyone's seen this before or has any ideas on what could cause it. Atm it's a bit of a blocker, as we can't really use the space for anything useful for any length of time.

I've seen this randomly occurring mostly on a 2 machine linux setup, and also less frequently on my windows machine (but that doesn't run continuously). Restarting the space from scratch always resolves the issue, but of course isn't a very good solution. I've seen it happen after 5 minutes and after 2 weeks, so it's seems a bit odd as to what triggers it.

It affects notifies both in the embedded space between local and cluster PU's and also other PUs and remote client connections to the cluster. Normal space I/O seems to still work perfectly for everything involved, it's just as if the code inside the cluster that sends out the notifies to all listeners local or otherwise has completely gone into meltdown. Once this has happened any new connection to the space has the same problem also, cpu and memory usage on all machines involved looks average. Oh and I once saw this only happen to one of the GSC's so I was left with half the cluster working, it's extermely odd.

The setup I have it running is 2 partitions, 1 per GSC. We've had this problem almost since day 1, atm we're using the latest java premium 6.6.4 version with java 6.

Just as an example, we have tomcat running which might have say 5 sessions connected to the space at once, one session writes and object to the cluster, the local embedded pu that should have been notified of the write never gets the event. As I say this works 100% fine normally, until the space gets in a twist and everything grinds to a halt.

Space setup is as follows.

<os-core:space id="space" url="/./spaceFX" mirror="true" external-data-source="hibernateDataSource">
        <os-core:filter-provider ref="serviceExporter"/>
        <os-core:properties>
            <props>
                <prop key="cluster-config.cache-loader.external-data-source">true</prop>
                <prop key="cluster-config.cache-loader.central-data-source">true</prop>
                <prop key="space-config.engine.cache_policy">1</prop>
                <prop key="space-config.external-data-source.usage";>read-only</prop>
                <prop key="-Dcom.gs.cluster.cache-loader.central-data-source">true</prop>
                <prop key="cluster-config.mirror-service.url">jini://*/mirror-service_container/mirror-service</prop>
            </props>
        </os-core:properties>
</os-core:space>

<os-core:giga-space id="gigaSpace" space="space" tx-manager="transactionManager" />
    <os-core:local-tx-manager id="transactionManager" space="space" default-timeout="5000">
        <os-core:renew pool-size="2" duration="3000" round-trip-time="1500" />
    </os-core:local-tx-manager>
<os-core:distributed-tx-manager id="distTransactionManager" default-timeout="5000" />

<os-sla:sla cluster-schema="partitioned-sync2backup" number-of-instances="2" number-of-backups="1" max-instances-per-vm="1" />

<os-core:giga-space-context />
<os-core:giga-space-late-context />

<os-core:giga-space id="gigaSpaceCluster" space="space" clustered="true" />

As you can see each partition deals with both local and dist transactions.

General layout of the space is embedded space services process work which comes purely from event notifies, these object write/updates that trigger the notifies come from 2 places, deployed PUs or remote tomcat connections to the space.

The tomcat sessions use the space via code rather than any spring wiring if that makes any difference, although we used to use spring, and i'm fairly sure we had this issue then also.

The archiving to DB still works after notifies stops working, so this doesn't appear to be the cause. We do rely on a lot of notifies, could this be overloading things? the only thing I can think of trying is moving everything to polling containers rather than notify containers, and also moving from tomcat to the embedded jetty support.

    • md5-488c8f2eac8b91335e1c3e99685ac423

Having changed the embedded space to use a polling container, those seem to keep on working fine from what I can tell.
I can confirm that after leaving the app running for 30 minutes (~65,000 notifies in total, split 50:50 more or less accross the cluster), the 3 clients I had connected (changed to share 1 notify this time) failed on partition #1 of 2.

This in effect means the polling container received 65,000 writes of obj class A (~32k on each partition), then generated 65,000 updates to some class B objects, the 1 notify container on the tomcat side got events for these class B space entry updates (must have been something less than 65,000) until half of the cluster went wrong, leaving notifies only coming from the 2nd partition GSC. I imagine if I leave it longer both partitions will stop sending notifies to the tomcat client.

I'm thinking it's perhaps linked to notify delivery volume to the tomcat notify container?

This thread was imported from the previous forum.
For your reference, the original is available here

asked 2009-05-20 04:21:05 -0500

aparry gravatar image

updated 2013-08-08 09:52:00 -0500

jaissefsfex gravatar image
edit retag flag offensive close merge delete

2 Answers

Sort by » oldest newest most voted
0

Has this been resolved?
Shay

answered 2009-05-21 21:42:38 -0500

shay hassidim gravatar image
edit flag offensive delete link more
0

Think I may have sorted it, was an uncaught exception on a notify container in one of the embedded services, which caused GS to silently throw a org.openspaces.events.adapter.ListenerExecutionFailedException which I think may have eventually killed the space ability to run notification of events. It wasn't being logged anywhere so I never saw it.

Will post again if I can confirm for certain that was the cause, as it's most likely to be a bug in GS code.

answered 2009-05-20 08:41:43 -0500

aparry gravatar image
edit flag offensive delete link more

Comments

Yes I think it has Shay, where the problem was happening very frequently it's now not been seen for at least 2 days.

I think it's almost certainly something to do with ListenerExecutionFailedException being generated/thrown on an embedded service due to uncaught exceptions from the service @SpaceDataEvent method, and due to the large volume of notifies it eventually kills the notify system somehow. That's the theory anyways.

aparry gravatar imageaparry ( 2009-05-22 02:35:32 -0500 )edit

I think you should think about moving the notification piece to a collocated notify container. The notifications will be very fast and there would not be any network overhead. Each partition will notify its notify container to perform whatever needed. This will be more scalable and simple to manage. Shay

shay hassidim gravatar imageshay hassidim ( 2009-05-22 08:37:10 -0500 )edit

Thanks for the advice Shay, yeh I'll admit we do rely on notifies too much tbh, but the events have to get to tomcat somehow due to the real-time nature of the system and it's not something we'll be looking at improving anytime soon.

Anyways, can confirm that this problem can be triggered by just uncaught exceptions on either a colocated container or as I've just experienced 1 uncaught exception on a regular PU notify container, resulted in notifies on 1 partition stopping working.

Stack trace was as follows

[providerQuoteNotifyContainer] Execution of event listener failed org.openspaces.events.adapter.ListenerExecutionFailedException: Listener method 'onRequestNotify' threw exception; nested exception is java.lang.NullPointerException at org.openspaces.events.adapter.AbstractReflectionEventListenerAdapter.onEventWithResult(AbstractReflectionEventListenerAdapter.java:191) at org.openspaces.events.adapter.AbstractResultEventListenerAdapter.onEvent(AbstractResultEventListenerAdapter.java:79) at org.openspaces.events.AbstractEventListenerContainer.invokeListener(AbstractEventListenerContainer.java:136) at org.openspaces.events.notify.AbstractNotifyEventListenerContainer.invokeListenerWithTransaction(AbstractNotifyEventListenerContainer.java:682) at org.openspaces.events.notify.SimpleNotifyEventListenerContainer$NotifyListenerDelegate.notify(SimpleNotifyEventListenerContainer.java:203) at com.j_spaces.core.client.NotifyDelegator.notify(NotifyDelegator.java:233) at com.j_spaces.core.client.RemoteEventListenerExporter$TransientDelegator.notify(RemoteEventListenerExporter.java:94) at com.j_spaces.core.lrmi.LRMIRemoteEventListener.notify(LRMIRemoteEventListener.java:91) at com.j_spaces.core.Notifier.notifyEvent(Notifier.java:243) at com.j_spaces.core.Notifier.operate(Notifier.java:143) at com.j_spaces.core.server.processor.RemoteEventBusPacket.execute(RemoteEventBusPacket.java:60) at com.j_spaces.core.Notifier.dispatch(Notifier.java:62) at com.j_spaces.core.Notifier.dispatch(Notifier.java:31) at com.j_spaces.kernel.WorkingGroup$TaskWrapper.run(WorkingGroup.java:60) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.NullPointerException ...

I'd write a small test case but am too busy atm.

aparry gravatar imageaparry ( 2009-05-22 09:17:12 -0500 )edit

I would suggest to report this problem via the support portal.
Shay

shay hassidim gravatar imageshay hassidim ( 2009-05-22 09:25:00 -0500 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2009-05-20 04:21:05 -0500

Seen: 171 times

Last updated: May 20 '09