# maxLeaseDuration causes 100%CPU on LUS for hour(s)

We were experienceing peek CPU load (100%) on the LUS over hours after deploying our application on a cluster with 15 GSCs, 1 LUS, 1GSM per machine on a total of 2 strong machines (24 logical cores). So in total this measn 30 GSCs on both machines.

We are using quite a lot of PUs (6 in total) with all of them having primary/backup schema with 1 backup and sync replication and each PU having a dedicated GSC.

<os-sla:sla cluster-schema="partitioned-sync2backup" number-of-instances="2" number-of-backups="1"
max-instances-per-vm="4">


The reason was that I tried out a max lease duration of 3 sec in combination with low LRMI settings as described in the "Failure Detection" doc.

-Dcom.gs.transport_protocol.lrmi.request_timeout=3s
-Dcom.gs.transport_protocol.lrmi.connect_timeout=3s
-Dcom.gs.transport_protocol.lrmi.inspect_timeout=1000
-Dcom.gs.jini.config.maxLeaseDuration=3000


My question is:

1. Most importantly: Why does this cause peek CPU load (and high network load) on the LUS for an hour?
2. What is effected by the "maxLeaseDuration" setting?
3. Is the remote proxy connectivitiy affected?
4. Are transactions affected? for example is it ok to have a total tx time of 5 sec but maxLease of 3 sec?
5. I was reading something about the GSC is dropped from the cluster after not responding within maxLeaseDuration, is this right? ( http://ask.gigaspaces.org/question/53... )

Note: We havn't profied our max GC pauses yet but I don't think they extend 3 sec.

edit retag close merge delete

Sort by » oldest newest most voted

David

Firstly, this type of question should probably be raised as a support case. Given as how you should consider how many cores underlie your GSCs it doesn’t seem as though you have enough resources; I would recommend not going below two for an 8G GSC, and trying not to go below one for a 4G GSC, all of which is dependent on garbage collection, which we will return to later.

Given the number of GSCs and PUs you have, there is an awful lot of network activity taking place. You could check this by running netstat, looking at both established and listening sockets. Trying to use such LRMI settings in such an environment places a lot of strain on CPU and network.

maxLeaseDuration is covered here:

Large clusters would be anything over 16,1; perhaps you should use 10000/20000.

Remote proxies would also be affected in such a scenario, as they will need keepalives to all the partitions to which they are connected, further complicating the picture. Transactions should not be affected directly, by they may suffer from follow-on effects such as SpaceUnavailableExceptions.

Lastly, given that you are unsure as to the garbage collection performance it is probable that this is seriously affecting system performance. Failure to set -Xms == -Xmx, for example, results in incremental allocations which can slow the collector by double, depending on other settings. Not allocating sufficient new generation space can result in new generation objects being tenured prematurely, and therefore not being collected during minor collections, all of which can result in frequent major collections, and is a good candidate for a root cause.

Please review the ‘Moving into Production Checklist’ on our wiki, paying special attention to the System and JVM tuning sections. Let us know if this improves overall performance and whether the CPU issue is resolved, or at least mitigated.

Regards

John

more

Thanks for the answer, I will create this as a support case but It would be nice if you could answer me the following questions: 1) you say we should consider at least 2 cores for a 8GB GSC? (currently we have 8 vCPUs for 16GSCs with max 2GB) 2) you say a large cluster is anything over 16 PUs with 1 backup per PU? is this correct? or what do you mean by 16,1? 3) You say remote proxies can get SpaceUnavailableExceptions if the LRMI keepalives/heartbeats fail/timeout/TCP closed?

( 2015-08-17 03:20:23 -0500 )edit

1) More or less. It depends on how much processing you do in the grid. Having a number of polling containers would increase the computational requirements.

2) 16,1 means sixteen primary partitions with each one having a backup. At and above that size special tuning activities will probably be required for a stable grid.

3) Yes. Even extended garbage collection pauses could cause this exception if they lasted more than 20 seconds.

more