Welcome to the new Gigaspaces XAP forum. To recover your account, please follow these instructions.

Ask Your Question
0

Recommendations for efficient sync. replication on high traffic?

Hi there,

We are loading testing a data model PU from multiple remote clients (4) in which we have an expectation that roughly over 4 million objects per hour (equivalent to 1 million events from a business logc perspective ) will be inserted (and a lot more updates) into the space hosted by the data model PU.

The data model PU itself consisted of four partitions (on four GSCs with 1.8GB memory each) with a synchronous backups enabled via partitioned-sync2backup. The hardware that's it's running on is a 8 Core Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz, 32GB RAM, RedHat Linux. The data model objects themselves are very trivial given that they map to database entitities so there's no embedded objects within them (i.e. they're flat). I also followed the Gigaspaces tuning guide ( http://wiki.gigaspaces.com/wiki/displ... ) in terms of increasing the open file handles, processes count (16,0000) and network connection backlog settings which helped a lot on initial performance load tests.

On the initial test run, it only got to process 400K events on the hour before what I believe is due to it's the primary PUs failing because the backup spaces cannot keep up with the replication, in some cases, the backups went down which is likely why the following error started to occur.

Error in replication while replicating [key=9765837 data=UPDATE: EntryPacket[typeName=com.[xxxx.xxxx.xxx.xxxx],uid=-1226133799^48^NickFeedWAE1^0^0,version=607396,operationId=ID:4756768322011589611.32916854]]; Caused by: com.gigaspaces.cluster.replication.ReplicationException: Timeout exceeded [10000ms] while waiting for packet to be consumed

With replication disabled and the same above setup, it understandly performed a lot better processing 920K events. Therefore, in order to address the replication bottleneck, given the somewhat limited resources that I have (particularly hardware having the one server), I tweaked around with the lab and when I changed the Data Model PU to have three partitions instead, modified GSC JVM options (1.8GB RAM-> 3.8GB to compensate for reduced GSC and also given that a primary and backup can co-reside in a GSC), reduce the concurrent client feeders to two, updated the replication settings (consume-timeout being admittedly a workaround but I doubled the redo log capacity) for the PU as follows:

            <prop key="cluster-config.groups.group.repl-policy.sync-replication.target-consume-timeout">21000</prop>
            <prop key="cluster-config.groups.group.repl-policy.redo-log-capacity">300000</prop>
            <prop key="cluster-config.groups.group.repl-policy.redo-log-memory-capacity">300000</prop>
            <prop key="cluster-config.groups.group.repl-policy.on-redo-log-capacity-exceeded">drop-oldest</prop>

.. the resulting combination lead to over 720K events being processed per hour which is much better. To me, it seemed all about finding the optimal balance between CPU usage (75%-85% usage avg.)+ open LRMI connections (no. client feeders) + GSC JVM settings (3.5GB per GSC roughly) + Data Model number of partitions (3)+ hardware (8 core CPU, 32GB RAM) vs number of lab(s) available (1).

When I tried anymore partitions on the Data Model on this lab, the CPU usage went up so 95%-99% which I'm thinking starves the resources for replication to occurr or perhaps the resulting increase in LRMI connection threads so the above ReplicationTimeout occurred much quicker on load tests. I'll be honest and state that the above replication parameter tweaks that I did is a workaround hack than a solution in itself and perhaps only what it does is delay the impending ReplicationException timeout exception for another hour or so (in which it did) but this simply could boil down to requiring at least another server to spread the workload. The other changes did certainly improve the situation though as I had the same consume-timeout of 21000ms for other load test runs.

So, in a real-world scenario where I gather the above lab setup should be distributed across at least 2-3 servers rather than just the one, what recommendations could you give me in terms ensuring that our Data Model PU cantake a high volume of traffic comfortably with sync. replication enabled and no unplanned downtime . I know 97% of this comes down to tweaking around with various setups ourselves like we did in the above case but there's hopefully some strategies (e.g. partitioning setup, GSC setup with primary/backup,etc...) that we can use as a guideline to begin with. I'd be gratfeul for any assistance given on this matter.

Kind regards, Tom

{quote}This thread was imported from the previous forum. For your reference, the original is [available here|http://forum.openspaces.org/thread.jspa?threadID=4122]{quote}

asked 2013-07-23 04:07:59 -0500

coleman5 gravatar image

updated 2013-08-08 09:52:00 -0500

jaissefsfex gravatar image
edit retag flag offensive close merge delete

2 Answers

Sort by ยป oldest newest most voted
0

Tom,

Changing sync replication settings would not generate a massive performance boost.

These would allow you to improve the performance dramatically: - Better data model - consider embedded relationship model when possible instead of having separate space objects. - use readById - will avoid query data. Will provide access directly to the data without any broadcast calls to the entire cluster. - use indexed - if you need to query have proper indexes on all fields used with matching or sql query. For bigger/less than/between queries use the extended index. - Better serialization - consider implementing externalizale for nested objects or use binary serialization with large collections. - Use Batch operation - consider using writeMultiple , etc when relevant. - Usage paging - use the IteratorBuilder when accessing large amount of objects. - Use delta update - consider the GigaSpace.Change operation. It will have a drastic performance boost when updating data. - Use async operation - consider using asyncChange , ONE WAY write modifier , etc when available. It will generate a massive performance boost. - Use partial read - consider using query projections to retrieve only the specific portions needed. - Co-locate data and business logic - implement Task / Distributed Task to be used with the GigaSpace.execute operation or use colocated notify/polling container to move processing business logic to the data side. This will avoid serialization and network usage. This should have a dramatic performance boost. - Intelligent Partitioning - Partition data based on the business logic and not based on some unique value. This will allow the collocated logic to access its data without any network calls. If needed run a local cache/local view to store "reference data" within each PU instance together with the transnational data.

Shay

answered 2013-07-25 19:38:24 -0500

shay hassidim gravatar image
edit flag offensive delete link more

Comments

Thanks Eitan and Shay,

Yes, we are incorporating quite a few of the suggestions that you have made but there's some very interesting points made. I also found the following Gigaspaces guides on Capcity Planning http://wiki.gigaspaces.com/wiki/display/SBP/CapacityPlanning , JVM Tuning for 64-bit OS http://wiki.gigaspaces.com/wiki/display/XAP91/TuningJavaVirtualMachines and Production env. tuning http://wiki.gigaspaces.com/wiki/display/SBP/MovingintoProductionChecklist to be of great reference too. Much appreciated.

Thanks again, Tom

coleman5 gravatar imagecoleman5 ( 2013-07-26 04:49:04 -0500 )edit
0

Hi Tom,

Can you post your benchmark or something similar to it?
What's your email so we can continue this thread off line.

Eitan

answered 2013-07-24 07:14:33 -0500

eitany gravatar image
edit flag offensive delete link more

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2013-07-23 04:07:59 -0500

Seen: 644 times

Last updated: Jul 25 '13