Welcome to the new Gigaspaces XAP forum. To recover your account, please follow these instructions.

Ask Your Question
0

Strange crash with very late recover

Hi, We experienced a strange crash today around 2PM JST noticed in a GS client that a 3rd party connects to. The issue appeared at 2:02 disconnecting the client yet the first exception that appeared was:

07 May 2015 14:51:31,498 JST ..... Error during message processing
...: org.openspaces.core.RemoteDataAccessException: Can not locate an available primary space. Total attempts: [20]; nested exception is java.rmi.RemoteException: Can not locate an available primary space. Total attempts: [20]
        .....
Caused by: org.openspaces.core.RemoteDataAccessException: Can not locate an available primary space. Total attempts: [20]; nested exception is java.rmi.RemoteException: Can not locate an available primary space. Total attempts: [20]
        at org.openspaces.core.exception.DefaultExceptionTranslator.internalTranslate(DefaultExceptionTranslator.java:108)
        at org.openspaces.core.exception.DefaultExceptionTranslator.translate(DefaultExceptionTranslator.java:49)
        at org.openspaces.core.DefaultGigaSpace.read(DefaultGigaSpace.java:316)
        at org.openspaces.core.DefaultGigaSpace.read(DefaultGigaSpace.java:308)
        at org.openspaces.core.DefaultGigaSpace.read(DefaultGigaSpace.java:304)
        ......
        ... 1 more
Caused by: java.rmi.RemoteException: Can not locate an available primary space. Total attempts: [20]
        at com.gigaspaces.internal.client.spaceproxy.router.old.DefaultLoadBalanceExceptionHanlder.handleBackupRetriesExceeded(DefaultLoadBalanceExceptionHanlder.java:151)
        at com.gigaspaces.internal.client.spaceproxy.router.old.LoadBalanceRequest.prepareForMethodCallImpl(LoadBalanceRequest.java:286)
        at com.gigaspaces.internal.client.spaceproxy.router.old.LoadBalanceRequest.prepareForMethodCall(LoadBalanceRequest.java:269)
        at com.gigaspaces.internal.client.spaceproxy.actions.SpaceProxyImplReadTakeAction.readTakeBasic(SpaceProxyImplReadTakeAction.java:221)
        at com.j_spaces.core.cluster.ReadTakeBroadcastPacket.executeAction(ReadTakeBroadcastPacket.java:35)
        at com.j_spaces.core.cluster.BroadcastExecutionPacket.execute(BroadcastExecutionPacket.java:120)
        at com.gigaspaces.internal.client.spaceproxy.SpaceProxyImpl$ParallelProxyConsumer.dispatch(SpaceProxyImpl.java:559)
        at com.gigaspaces.internal.client.spaceproxy.SpaceProxyImpl$ParallelProxyConsumer.dispatch(SpaceProxyImpl.java:531)
        at com.j_spaces.kernel.WorkingGroup$TaskWrapper.run(WorkingGroup.java:67)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)

We are using GS 9.0.0 and: 2 machines: 1 GSM/LH 2 GSCs on each

<os-sla:sla cluster-schema="partitioned-sync2backup"
            number-of-instances="2"
            number-of-backups="1"
            max-instances-per-vm="1">
</os-sla:sla>

Some parts from the logs:

Machine 1:
gsm

2015-05-07 00:05:44,117 GSM INFO [com.gigaspaces.grid.gsm.provision] - Allocated processing unit instance [my-processor.2 [2]] on [GSC pid[62427] host[myprod2/10.20.8.21]]
2015-05-07 00:05:44,183 GSM INFO [com.gigaspaces.grid.gsm.services] - Managing Service: [my-processor.2 [2]] at host: [10.20.8.21]
2015-05-07 00:05:44,433 GSM INFO [com.gigaspaces.grid.gsm.provision] - Allocated processing unit instance [my-processor.1 [1]] on [GSC pid[61914] host[myprod2/10.20.8.21]]
2015-05-07 00:05:44,451 GSM INFO [com.gigaspaces.grid.gsm.services] - Managing Service: [my-processor.1 [1]] at host: [10.20.8.21]
2015-05-07 00:05:44,674 GSM INFO [com.gigaspaces.grid.gsm.provision] - Allocated processing unit instance [my-processor.1 [2]] on [GSC pid[47433] host[myprod1/10.20.8.20]]
2015-05-07 00:05:44,694 GSM INFO [com.gigaspaces.grid.gsm.services] - Managing Service: [my-processor.1 [2]] at host: [10.20.8.20]
2015-05-07 00:05:44,818 GSM INFO [com.gigaspaces.grid.gsm.provision] - Allocated processing unit instance [my-processor.2 [1]] on [GSC pid[47825] host[myprod1/10.20.8.20]]
2015-05-07 00:05:44,835 GSM INFO [com.gigaspaces.grid.gsm.services] - Managing Service: [my-processor.2 [1]] at host: [10.20.8.20]
2015-05-07 14:01:33,484 GSM INFO [com.gigaspaces.grid.gsm] - Unregistered GSC - [GSC pid[61914] host[myprod2/10.20.8.21]], count [3]
2015-05-07 14:04:06,306 GSM WARNING [org.openspaces.pu.container.servicegrid.PUFaultDetectionHandler] - Suspecting failure of service: [my-processor.1 [1]] pid[61914] host[myprod2/10.20.8.21] - RTT[196.3 sec]. Retrying to reach service.; Caused by: java.rmi.ConnectIOException: LRMI transport protocol over NIO broken connection with ServerEndPoint: [NIO://10.20.8.21:59324/pid[61914]/3543236829081722_3_1495612997316588651]; nested exception is:
        java.net.SocketTimeoutException: connect timed out
2015-05-07 14:07:02,632 GSM WARNING [org.openspaces.pu.container.servicegrid.PUFaultDetectionHandler] - Detected failure of service: [my-processor.1 [1]] pid[61914] host[myprod2/10.20.8.21]. This service cannot be reached.; Caused by: java.rmi.ConnectException: Connect Failed to [NIO://10.20.8.21:59324/pid[61914]/3543236829081722_3_1495612997316588651], handshake failure; nested exception is:
        java.io.IOException: Connection reset by peer
2015-05-07 14:08:49,448 GSM INFO [com.gigaspaces.grid.gsm.provision] - Attempting to allocate processing unit instance [my-processor.1 [1]] on [GSC pid[62427] host[myprod2/10.20.8.21]]
2015-05-07 18:34:41,109 GSM INFO [com.gigaspaces.grid.gsm.provision] - Allocated processing unit instance [my-processor.1 [1]] on [GSC pid[62427] host[myprod2/10.20.8.21]]
2015-05-07 18:34:41,128 GSM INFO [com.gigaspaces.grid.gsm.services] - Managing Service: [my-processor.1 [1]] at host: [10.20.8.21]

gsc1

2015-05-07 14:04:09,234 my-processor.1 [2] INFO [com.gigaspaces.replication.channel.mySpace1.primary-backup-sync-1.mySpace1_1] - Incoming Replication Channel lost connection to source:
[  source: mySpace_container1:mySpace  ]
[  target: mySpace_container1_1:mySpace ; source url: mySpace_container1:mySpace  ]
[  source machine connection url: Not available  ]
[  replication group name: primary-backup-sync-1 ]
2015-05-07 18:34:37,144 my-processor.1 [2] INFO [com.gigaspaces.cluster.active_election] - [mySpace_container1_1:mySpace] Identified primary [mySpace_container1:mySpace] failure on [10.20.8.21
] machine.
2015-05-07 18:34:37,145 my-processor.1 [2] INFO [com.gigaspaces.cluster.active_election] - Current space instance [mySpace_container1_1:mySpace] has been elected to be Primary.

gsc2

2015-05-07 09:14:36,092 my-processor.2 [1] INFO [com.gigaspaces.localcache] - LocalCache#5-mySpace was created successfully. Total time: 70 milliseconds
2015-05-07 18:30:39,103 my-processor.2 [1] INFO [com.gigaspaces.client] - Accessed an inactive backup space mySpace_container1_1:mySpace, retrying [1/20]
2015-05-07 18:30:40,104 my-processor.2 [1] INFO [com.gigaspaces.client] - Accessed an inactive backup space mySpace_container1_1:mySpace, retrying [1/20]
2015-05-07 18:30:41,108 my-processor.2 [1] INFO [com.gigaspaces.client] - Accessed an inactive backup space mySpace_container1_1:mySpace, retrying [1/20]
2015-05-07 18:30:42,109 my-processor.2 [1] INFO [com.gigaspaces.client] - Accessed an inactive backup space mySpace_container1_1:mySpace, retrying [1/20]
....
2015-05-07 18:32:01,151 my-processor.2 ...
(more)

asked 2015-05-07 06:11:27 -0600

lukeh gravatar image
edit retag flag offensive close merge delete

3 Answers

Sort by ยป oldest newest most voted
0

Hi, We need to see the GSC's logs in order to understand what happen to the server side and why did the primary space crashed. Can you please open a support case in our support portal? And one of our representatives will investigate and get back to you.

answered 2015-05-10 03:05:26 -0600

inbarc gravatar image
edit flag offensive delete link more
0

Hi, We'll contact support, yet we need to speak with our contact cause we didn't use that in vrey long time and there were a lot of changes in our company including changing email addresses. Meanwhile we thought we had fixed it by giving more ram but unfortunately it happened again. If there is a way to get more help here based on the following before we go to support please do:

GC log (note is not Full GC)

Desired survivor size 1179648 bytes, new threshold 1 (max 15)
 [PSYoungGen: 21728K->896K(21888K)] 400324K->379852K(707968K), 9906.3520390 secs] [Times: user=9594.11 sys=0.00, real=9904.84 secs]

Process:

build     59823 18.8  1.7 24860016 1122536 ?    Sl   00:02 187:31 /usr/java/jdk1.7/bin/java -server -showversion -XX:+AggressiveOpts -XX:MaxPermSize=256m -XX:+HeapDumpOnOutOfMemoryError -Xmx6096m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintCommandLineFlags -XX:+PrintFlagsFinal -Xloggc:logs/20150515-000201.gc.log -DagentId= -DgsaServiceID= -Xbootclasspath/p::/home/build/gigaspaces-xap-premium-9.0.0-ga/bin/../lib/platform/xml/serializer.jar:/home/build/gigaspaces-xap-premium-9.0.0-ga/bin/../lib/platform/xml/xalan.jar:/home/build/gigaspaces-xap-premium-9.0.0-ga/bin/../lib/platform/xml/xercesImpl.jar:/home/build/gigaspaces-xap-premium-9.0.0-ga/bin/../lib/platform/xml/xml-apis.jar -cp ::/home/build/gigaspaces-xap-premium-9.0.0-ga/bin/../lib/platform/sigar/sigar.jar:/home/build/gigaspaces-xap-premium-9.0.0-ga/bin/..:/home/build/gigaspaces-xap-premium-9.0.0-ga/bin/../lib/platform/boot/gs-boot.jar: -Dsun.rmi.dgc.client.gcInterval=36000000 -Dsun.rmi.dgc.server.gcInterval=36000000 -Djava.rmi.server.hostname=10.20.8.21 -Djava.rmi.server.RMIClassLoaderSpi=default -Djava.rmi.server.logCalls=false -Dcom.gs.jini_lus.groups=crossprod -Dcom.gs.jini_lus.locators=10.20.8.20,10.20.8.21 -Dcom.gs.logging.debug=false -Djava.util.logging.config.file=/home/build/gigaspaces-xap-premium-9.0.0-ga/bin/../config/gs_logging.properties com.gigaspaces.start.SystemBoot com.gigaspaces.start.services="GSC"

We switched to new hardware recently, better with more ram then previous, and we didn't had similar issue before. It ran before with less ram and the only other difference is that before we used custom ports (we had on the old hardware -Dcom.gigaspaces.start.httpPort=7267 -Dcom.gigaspaces.system.registryPort=7266 -Dcom.gs.multicast.discoveryPort=7265 -Dcom.gs.onewaynotify=false -Dcom.gs.transport_protocol.lrmi.bind-port=7268-7285) which we now abandoned and use default. Any possible issue here? Also, we are running with java 7 update 03 cause we had issues with the version of GS we are using with newer java versions but that was the same on the old hardware (something related to serialization, the issue was posted around here). Xmx was increased after we had the issue last time. (I didn't touched :MaxPermSize=256m as it came from setenv.sh can it be the problem? from JConsole I can see ~11800 loaded classes after that long gc ended, again same on old hardware)

Thanks, Lucian

answered 2015-05-15 03:32:33 -0600

lukeh gravatar image
edit flag offensive delete link more

Comments

Can you please confirm all read/take/readMultiple/takeMultiple/etc using template/SqlQuery with indexed fields?-------- Original message --------From: forum@ask.gigaspaces.orgDate:05/15/2015 4:31 AM (GMT-05:00)

shay hassidim gravatar imageshay hassidim ( 2015-05-15 04:15:44 -0600 )edit

Most likely not, we have lot of space classes without indexes. Better, can you give me an explicit example on what we shoudn't do and what we should do to make sure we are on the same page. PS: recovered my support account ( na6.salesforce.com ). Should I just link there this thread?

lukeh gravatar imagelukeh ( 2015-05-15 04:25:11 -0600 )edit

Please open a support case and include all info required.Please review your application code and space classes and check all relevant fields are indexed properly.

Remember - bigger/less than queries on numerical field should use an extended index.Additional indexes might consume additional heap.

shay hassidim gravatar imageshay hassidim ( 2015-05-15 04:35:47 -0600 )edit

I have this for example:

protected SQLQuery<OrderData> getSellOrders(Double lowRange, Double hiRange) {
    if (sellOrders != null) {
        sellOrders.setParameter(3, lowRange);
        sellOrders.setParameter(4, hiRange);
        return sellOrders;
    }
    String querystr = "symbol=? AND side=? AND state='new' AND price>=? AND price<=? AND manipulatorType=? ORDER BY price ASC , addTime ASC";
    //String querystr   = "symbol=? AND side=? AND state='new' ORDER BY addTime ASC";
    sellOrders = new SQLQuery<OrderData>(OrderData.class.getName(), querystr);
    sellOrders.setParameter(1, symbol);
    sellOrders.setParameter(2, Side.SELL);
    sellOrders.setParameter(3, lowRange);
    sellOrders.setParameter(4, hiRange);
    sellOrders.setParameter(5, manipulatorType);
    return sellOrders;
}

where:

    @SpaceProperty(index = SpaceProperty.IndexType.EXTENDED)
public Double getPrice() {
    return price;
}

but no index on addTime

Do you mean something like this can cause the GC described. Should I just index the addTime too? What about other simple pojo template querries, cause I belive this is the only sql with comparation and order by in the project. Edit: actually that's the only SQLQuerry, all other read/take are with POJO templates with values and nulls. Opened support case.

lukeh gravatar imagelukeh ( 2015-05-15 04:53:20 -0600 )edit

How many objects returned from this query?

The fact you have the price field indexed is good. Having the symbol , state and manipulatorType indexed will help as it may speed up the query execution. XAP use the smallest candidate set (evaluated via the indexes) to perform the matching.

The problem might not be related to these missing indexes - still , it will be good to index these and see if the problem still exists.

shay hassidim gravatar imageshay hassidim ( 2015-05-15 10:19:22 -0600 )edit
0

It really seems that we actually ran into this issue: https://groups.google.com/forum/#!top... We still need to run a few more tests on upgraded kernel to be 100% sure but the fact the processes recover after jstack/jmap -F is consistent with that issue. And the architecture + kernel version were as described there.

answered 2015-05-19 04:53:26 -0600

lukeh gravatar image
edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2015-05-07 06:11:27 -0600

Seen: 798 times

Last updated: May 19 '15