Welcome to the new Gigaspaces XAP forum. To recover your account, please follow these instructions.

Ask Your Question
0

Nested or recursive broadcast remoting call performance

I have two questions here.

  1. I have discovered that there appears to be some overhead in a remote call which grows with the number of nodes in the cluster. On our Sparc T2 test machine, with 32 partitions deployed to 32 GSCs, I've noticed about a 24 ms call time for a remote call. I haven't been able to track down the source of the latency.

  2. I have setup a remote service which calls itself (broadcast to the cluster) to benchmark the performance of nested remote calls. I have discovered what appears to be that call time increases exponentially with the number of recursions. This is troubling as we were planning on making use of nested remote calling. Is making nested broadcast calls a bad idea?

Larry

{quote}This thread was imported from the previous forum. For your reference, the original is [available here|http://forum.openspaces.org/thread.jspa?threadID=2451]{quote}

asked 2008-07-11 10:02:12 -0600

larrychu gravatar image

updated 2013-08-08 09:52:00 -0600

jaissefsfex gravatar image
edit retag flag offensive close merge delete

1 Answer

Sort by » oldest newest most voted
0

Larry,
I presume you are using sync-remoting. This should be much more faster and scalable than async-remoting.
Can you increase the following as part of the cluster config and see if you get better results?
<proxy-broadcast-threadpool-max-size>
make it 128. The default is 64.

Shay

answered 2008-07-11 13:21:19 -0600

shay hassidim gravatar image
edit flag offensive delete link more

Comments

Thanks for the hint. I'll try the async remoting after I try increasing the thread pool parameters.

You'll have to forgive me, because i don't understand how to properly set the configuration. I tried adding -Dcluster-config.groups.group.load-bal-policy.proxy-broadcast-threadpool-max-size=128 system property to startup gsc.sh, and adding <prop key="cluster-config.groups.group.load-bal-policy.proxy-broadcast-threadpool-max-size">128</prop> to the pu.xml, and cluster-config.groups.group.load-bal-policy.proxy-broadcast-threadpool-max-size=128 to gs.properties, but none of these have affected performance.

I know the documentation states that the gs.properties "custom properties file" needs to be explicitly specified, but it doesn't state how to specify it.

larrychu gravatar imagelarrychu ( 2008-07-11 16:47:51 -0600 )edit

In case u missed this: make sure u use sync-remoting (implemented via takemultple + filter) and not async-remoting (implemented via 2 pairs of write + take).

Are u using 6.0?

To configure the proxy-broadcast-threadpool-max-size u should change the relevant cluster xsl file located at the config/schemas.

Changing cluster config via XPath supported only with 6.5.

Are u sure u are not fully utilizing machine CPU?

What is the latency with less partitions? How do u measure it?

Are u sure the relevant business logic at the service is not accessing remote spaces or non indexed data?

Do u have a service with empty business logic as part of your benchmark?

Shay

shay hassidim gravatar imageshay hassidim ( 2008-07-11 17:32:25 -0600 )edit
shay hassidim wrote:

In case u missed this: make sure u use sync-remoting (implemented via takemultple + filter) and not async-remoting (implemented via 2 pairs of write + take).

Yes I am using sync-remoting by specifying
&lt;os-remoting:sync-proxy id="gridRecursionServiceProxy"
giga-space="gigaSpace"
interface="com.my.GridRecursionService"
broadcast="true"&gt;
&nbsp;&nbsp;&lt;os-remoting:result-reducer ref="gridRecursionServiceReducer"/&gt;
&lt;/os-remoting:sync-proxy&gt;

I didn't know that async-remoting was slower.

> Are u using 6.0?

No I use 6.5. Does overriding via gs.properties work with 6.5?

> Are u sure u are not fully utilizing machine CPU?

Well, actually I am noticing high CPU utilization with the deeply recursive calls. However since the service does so very little, it is puzzling why the CPU is so heavily loaded. CPU Utilization spikes between 50% and 100% with > 1 level deep recursion.

> What is the latency with less partitions?

I experience less latency with fewer partitions

> How do u measure it?

I have a client program that runs on my desktop and calls the service on the cluster which is the single Sparc T2. Latency is measured on the client side as the time the call takes to execute. After deploying the 32 partitions to 32 GSCs, I run the client program. The client program first "warms up" by doing 5 recursive calls each of 0, 1 and 2 levels deep recursion. Then it executes 50 calls each of 0, 1, and 2 levels deep recursion in sequence with a 600ms gap between each call to allow settlement of garbage collection. It then reports the median time of each level of recursion as the final result.

These are some sample results:
Recursive Query: Level 0 Median: 18 ms
Recursive Query: Level 1 Median: 146 ms
Recursive Query: Level 2 Median: 1,614 ms

I have also recorded a median time of a 3rd level recursion at ~65,000 ms.

I welcome any input as to how to conduct a better benchmark.

> Are u sure the relevant business logic at the service is not accessing remote spaces or non indexed data?

The service is implemented as a method taking a single parameter that determines the number of remaining recursions to execute and it returns its execution time on the server.

public interface GridRecursionService {
&nbsp;&nbsp;public long call(int numberOfRecursions);
}

public class GridRecursionServiceImpl implements GridRecursionService {
&nbsp;&nbsp;private GridRecursionService proxy;

&nbsp;&nbsp;public long call(int numberOfRecursions) {
&nbsp;&nbsp;&nbsp;&nbsp;long start = System.nanoTime();
&nbsp;&nbsp;&nbsp;&nbsp;if (numberOfRecursions &gt; 0) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;proxy.call(numberOfRecursions - 1);
&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;long end = System.nanoTime();
&nbsp;&nbsp;&nbsp;&nbsp;return end - start;
&nbsp;&nbsp;}

&nbsp;&nbsp;public void setProxy(GridRecursionService proxy) {
&nbsp;&nbsp;&nbsp;&nbsp;this.proxy = proxy;
&nbsp;&nbsp;}
}

> Do u have a service with empty business logic as part of your benchmark?

When the service is called with zero as the number of recursions, there is virtually no business logic, each ...(more)

larrychu gravatar imagelarrychu ( 2008-07-11 19:20:09 -0600 )edit

Larry, Lets look on the results you are getting: - Recursive Query: Level 0 Median: 18 ms - Recursive Query: Level 1 Median: 146 ms - 8 times slower compared to level 0. Theoretically it should be only 2 times slower compared to level 0. - Recursive Query: Level 2 Median: 1,614 ms - 90 times slower compared to level 0. 11 times slower than level 1. Theoretically it should be only 3 times slower compared to level 0.

In theory in a perfect world using a machine with unlimited cores and resources the results should have been: - Recursive Query: Level 0 Median(1 cycle): 18 ms - Recursive Query: Level 1 Median(2 cycles): 18 X 2 = 36 ms - Recursive Query: Level 2 Median(3 cycles): 18 X 3 = 54 ms Which means linearity in terms of the response time.

Still , as we know this is not what is going on.

Lets look on the numbers from a different angle - the thread usage view: - Level 0 means - using 32 concurrent threads simultaneously to invoke the service + at least 32 concurrent threads at the service side to respond - ~ 64 concurrent threads needed - Level 1 means - (using 32 concurrent threads simultaneously to invoke the service X 32) + (at least 32 concurrent threads at the service side to respond) X 32 = 1024 X 2 = 2048 concurrent threads needed - X 32 more compared to level 0 - Level 2 means - Same as Leve1 X 32 = 65536 concurrent threads needed - X 1024 more compared to level 0

The above means you stress the machine very much , having huge thread context switch overhead which impacts the overall response time.

So the results you are getting are impressive. There is no linear or exponentially decrease with the response time , but in fact much smother decrease. When looking on level 1 – the decrease is in factor of 4 – we calculate the difference between the time it takes to perform level 1 compared to level 0 (8) and the amount of threads needed to calculate level 1 compared to level 0 (32). Remember the theoretical drop should be factor of 2. So you are not in such bad shape.

I suggest you to run such tests on multiple different machines. This should provide better results. You are getting better numbers with fewer partitions since you are utilizing fewer threads and there is less thread context switching. In addition I suggest you to tune the TCP config on the machine , the JVM settings and the OS threading behavior. The out of the box settings are not optimized for such activities.

Shay

shay hassidim gravatar imageshay hassidim ( 2008-07-12 06:39:24 -0600 )edit

Shay,

Thanks for the analysis. It is reasonable to conclude that because we have saturated the CPU, thus as we increase the number of recursions, the number of threads required to service those requests will increase exponentially such that we should expect exponential increase in response time.

Would it be safe to conclude that if low response times are an important requirement, then we should avoid recursive remote broadcast calls because of the exponential growth of remote call overhead? Furthermore, consumption of network and OS resources also grows exponentially with broadcast remote call depth.

Best Regards, Larry

larrychu gravatar imagelarrychu ( 2008-07-15 12:15:55 -0600 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2008-07-11 10:02:12 -0600

Seen: 44 times

Last updated: Jul 11 '08