Welcome to the new Gigaspaces XAP forum. To recover your account, please follow these instructions.

Ask Your Question
0

NIO broken connection error

Has anyone seen the following error messages? We are using GigaSpaces XAP 5.1. We saw this issue in production which caused writes from a number of processes across the network to stop for a period of 5 minutes. No other network issues were detected during this time. ] java.rmi.ConnectIOException: LRMI transport protocol over NIO broken connection with ServerEndPoint: [NIO://10.33.24.35:58878/1214515242332_1]; nested exception is: java.net.SocketTimeoutException: connect timed out at com.j_spaces.obf.hx.invoke(SourceFile:251) at com.j_spaces.obf.he.invoke(SourceFile:83) at $Proxy10.checkIfPendingAnswer(Unknown Source)

java.rmi.ConnectIOException: LRMI transport protocol over NIO broken connection with ServerEndPoint: [NIO://10.33.24.35:58878/1214515242332_1]; nested exception is: java.net.SocketTimeoutException: connect timed out at com.j_spaces.obf.hx.invoke(SourceFile:251) at com.j_spaces.obf.he.invoke(SourceFile:83)

java.rmi.ConnectIOException: LRMI transport protocol over NIO broken connection with ServerEndPoint: [NIO://10.33.24.35:58878/1214515242332_1]; nested exception is: java.net.SocketTimeoutException: connect timed out at com.j_spaces.obf.hx.invoke(SourceFile:251) at com.j_spaces.obf.he.invoke(SourceFile:83) at $Proxy9.cancelLocalXtn(Unknown Source) at com.j_spaces.core.lrmi.LRMIRemoteSpaceImpl.cancelLocalXtn(SourceFile:574)

h4. Attachments

[MonitorWorker.java|/upfiles/13759707132090726.java]

{quote}This thread was imported from the previous forum. For your reference, the original is [available here|http://forum.openspaces.org/thread.jspa?threadID=2427]{quote}

asked 2008-06-29 10:23:18 -0500

minorhuffman gravatar image

updated 2013-08-08 09:52:00 -0500

jaissefsfex gravatar image
edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
0

Seems you are using transactions.
How are you getting the transaction manager? calling new or the getInstance method?
What is the expirationtimeinterval ? Change this to 10000.
How many concurrent clients you have? Are does calling read/take with timeout > 0 ? Can you shrink the timeout period?
What is the transaction timeout? Can you shrink it?
Can you run JConsole and count the LRMI threads you have running at the space side?
Can you provide more details about the space topology? clustered? persistent?
Shay

Attachments

  1. MonitorWorker.java

answered 2008-06-29 10:45:01 -0500

shay hassidim gravatar image
edit flag offensive delete link more

Comments

Hi Shay,

The main question here is what would casue "java.net.SocketTimeoutException: connect timed out" and as a result "NIO broken connection"? This happened all of a sudden during a normal trading day when everything was business as usual.

To answer your questions, yes we use transactions. We use the getInstance() method to get an instance of the TransactionManager. The leaseTime for the transaction is 5 mins as set during the TransactionFactory.create(). Is this the transaction timeout you referred to? How would you set the expiration_time_interval for a transaction?

The first instance of the exception is thrown during the space.take operation. Please see below java stack tracce. Subsequently we also got the same exceptions during a space.write() operation.

There can be about 20 concurrent clients writing to the space at any given time. The space take() timeout is set to 30 secs. The write is set to Lease.FOREVER. Is this a good setting?

The space is remote to the clients, and we do NOT use clustered space and persistence.

More details on the stack trace and source code (IP and port removed):

java.rmi.ConnectIOException: LRMI transport protocol over NIO broken connection with ServerEndPoint: [NIO://IP:port/1214515242332_1]; nested exception is: java.net.SocketTimeoutException: connect timed out at com.j_spaces.obf.hx.invoke(SourceFile:251) at com.j_spaces.obf.he.invoke(SourceFile:83) at $Proxy10.checkIfPendingAnswer(Unknown Source) at com.j_spaces.core.lrmi.LRMIRemoteSpaceImpl.checkIfPendingAnswer(SourceFile:350) at com.j_spaces.core.client.JSpaceProxy.read(SourceFile:1573) at com.j_spaces.core.client.JSpaceProxy.take(SourceFile:374) at com.j_spaces.core.client.JSpaceProxy.take(SourceFile:362) at com.oms.messaging.OMSMessageBusSpaceImpl$Poller.run(OMSMessageBusSpaceImpl.java:224) Caused by: java.net.SocketTimeoutException: connect timed out at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:519) at com.j_spaces.obf.ea.timeoutOccured(SourceFile:50) at com.j_spaces.obf.iz.a(SourceFile:368) at com.j_spaces.obf.iz.b(SourceFile:353) at com.j_spaces.obf.iz.a(SourceFile:255) at com.j_spaces.obf.ji.run(SourceFile:177)

/* * relevant code / private TransactionManager txnMgr; private IJSpace space; space = (IJSpace) SpaceFinder.find(msgSpaceUrl); txnMgr = LocalTransactionManager.getInstance(space); Transaction.Created created = null; OMSMessageRouter router = null; try { created = TransactionFactory.create(txnMgr, 1000605); router = (OMSMessageRouter) space.take(template, created.transaction, 100030); if (router != null) { listener.onOMSMessage(message); } else { LOGGER.log(Level.FINEST, id + " interrupted: " + isInterrupted()); } } catch (Throwable e) { LOGGER.log(Level.WARNING, "",e); }

Edited by: Phil Jung on Jun 29, 2008 12:51 PM

Edited by: Phil Jung on Jun 29, 2008 12:52 PM

Edited by: Phil Jung on Jun 29, 2008 12:53 PM h4. Attachments

[MonitorWorker.java|/upfiles/13759707136396667.java]

philjung gravatar imagephiljung ( 2008-06-29 12:51:01 -0500 )edit

Any special reason for the 5 min tx timeout?

I suggest to shrink this to the real max period of time the tx would take (few second I guess - 35 second probably). U can also renew the tx using the LRM in case u need such and can't estimate the max tx time.

Check the space scheme for changing the expiration_time_interval value.

The write lease time is OK.

Please rollback the tx in case null returned from the take operation. This might be the reason for the connection leak.

Shay h4. Attachments

[MonitorWorker.java|/upfiles/13759707133486998.java]

shay hassidim gravatar imageshay hassidim ( 2008-06-29 13:05:49 -0500 )edit

Hi Shay,

Thanks for the suggestions. Are you suggesting that the connections were not being closed due to the transactions not being rolled back, hence the connection leak? If this is the case then it would imply that this should only happen when the client is trying to initiate a new connection?

It seems though that the "java.net.SocketTimeoutException: connect timed out" exceptions were thrown during the take and write operations.

Just FYI, to answer your previous question, the expiration_time_interval is set at 60000.

Thanks -Phil h4. Attachments

[MonitorWorker.java|/upfiles/1375970714641562.java]

philjung gravatar imagephiljung ( 2008-06-29 14:58:27 -0500 )edit

It is hard to determine the exact reason for the problem you are facing with. It might be a result of a combination of a bug with the old build u are using and activity you are conducting explicitly (not rolling back the transaction).

Having a test case reproducing the problem would be very helpful. Another thing that could help is monitoring the space open tcp connections via tools such as netstat and monitoring the lrmi threads within the jvm running the space. Correlating the above with the application activity could help. We can "attach" to the space a worker that will dump the amount of tcp connections and lrmi threads using some iworker implementation. We can help you with such.

I suggest you to submit a support case with a test case.

Check fixed bugs of 5.1 and 6.x releases. We might have relevant issues fixed.

One more configuration change I would suggest is to increase the kernel thread pool.

can you reproduce the problem easily? Can you rerun the application using rollback call after getting null as a result of the take , shorter tx timeout , bigger kernel thread pool (see max_threads space schema) and shorter lease manager time interval? Let's see if you have better behavior.

Needless to say that moving to 6.5 would be the best solution. We have re-architect our transport layer since 5.x so I doubt you will encounter this problem with 6.5. There is much better control on open connections compared to old releases.

Shay

Edited by: Shay Hassidim on Jun 29, 2008 6:52 PM h4. Attachments

[MonitorWorker.java|/upfiles/13759707143695442.java]

shay hassidim gravatar imageshay hassidim ( 2008-06-29 18:03:49 -0500 )edit

Attached a small IWorker implementation that can monitor the process memory and java processes connections.

The space schema should include:
<MonitorWorker>
    <enabled>true</enabled>
    <class-name>com.gigaspaces.util.MonitorWorker</class-name>
    <arg>70</arg>
    <description>MonitorWorker </description>
    <active-when-backup>true</active-when-backup>
    <shutdown-space-on-init-failure>false</shutdown-space-on-init-failure>
    <instances>1</instances>
</MonitorWorker>

Shay

Attachments

  1. MonitorWorker.java
shay hassidim gravatar imageshay hassidim ( 2008-06-29 20:35:50 -0500 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2008-06-29 10:23:18 -0500

Seen: 2,047 times

Last updated: Jun 29 '08