Welcome to the new Gigaspaces XAP forum. To recover your account, please follow these instructions.

Ask Your Question
0

Performance of GROUP BY

If I execute a JDBC GROUP BY over a clustered space I get about 2 business requests per second.

If I execute a Distributed Task and then inside that task a JDBC GROUP BY to the local space I get about 25 business requests a second.

If the grouping is done on a WHERE the data is a couple thousand records then it's that slow, if the WHERE is a handfull it jumps back to 1200 business requests a second.

Any ideas?

Edited by: John Smith on Dec 17, 2010 2:08 PM

{quote}This thread was imported from the previous forum. For your reference, the original is [available here|http://forum.openspaces.org/thread.jspa?threadID=3563]{quote}

asked 2010-12-17 13:52:05 -0500

infectedrhythms gravatar image

updated 2013-08-08 09:52:00 -0500

jaissefsfex gravatar image
edit retag flag offensive close merge delete

1 Answer

Sort by ยป oldest newest most voted
0

The JDBC GROUP BY performance will be improved with future versions to leverage indexes in a better manner.
To improve your Distributed Task performance (that is calling the JDBC GROUP BY), deploy the space with more partitions than you currently have.
For example - If you currently use 3 partitions across 3 GSCs try 15 partitions across 3 GSCs.
Shay

answered 2010-12-17 16:27:52 -0500

shay hassidim gravatar image
edit flag offensive delete link more

Comments

15 partitions even on the same vm? How do you create multiple partitions on 3 gscs?

I found another way to do my group by also. Wich is just do a select and then loop through the array returned by read multiple and insert unique records into a hashset. And it's alot faster.

infectedrhythms gravatar imageinfectedrhythms ( 2010-12-19 19:20:35 -0500 )edit

If i understand it correctly I will do...

<os-sla:sla cluster-schema="partitioned-sync2backup" number-of-instances="15" max-instances-per-vm="5" number-of-backups="0"/>

infectedrhythms gravatar imageinfectedrhythms ( 2010-12-20 08:37:03 -0500 )edit

Yep. The GSM will evenly distribute the partitions across all the GSCs.

shay hassidim gravatar imageshay hassidim ( 2010-12-20 08:41:32 -0500 )edit

Still to slow. I tried at 12 partitions 4 per machine as I do not have the RAM for more. Unless you are saying that 15 is the magic number :p

We are talking about 40 000 records and the field that is slow is about 13 000 records out of the 40. I'm pretty sure I have the right indexes.

This is the executor of my task...

It's a local task that uses the injected clustered space to create a clustered JDBC connection.

public String execute() throws Exception {

String query = "SELECT field1, field2, field3 from com.mycom.domain.model.MyTable WHERE field4 = 'blahblah' GROUP BY field1, field2, field3";

Connection con = getConnection();
Statement stmt = con.createStatement();

ResultSet rs = stmt.executeQuery(query);

HashSet<String> field1 = new HashSet<String>();
        HashSet<String> field2 = new HashSet<String>();
HashSet<Long> field3 = new HashSet<Long>();

while(rs.next())
        {
            field1.add(rs.getString(1));
            field2.add(rs.getString(2));
            field3.add(rs.getLong(3));

}

return "Count1=" + field1.size() + ",Count2=" + field3.size() + ",Count3=" + field3.size() + ",";
}

public Connection getConnection() throws Exception
    {
GConnection connection = null;

Class.forName("com.j_spaces.jdbc.driver.GDriver");

connection = GConnection.getInstance(clusteredSpace.getSpace());
connection.setUseSingleSpace(false); //false = cluster, true = single

return connection;
}

And I execute as follows in my servlet...

AsyncFuture<String> future = space.executorBuilder(new ConcatReducer())
        .add(new MyTask())
        .add(new MyOtherTask())
.execute();

String result = future.get();

I found another way to be allot faster but more cumbersome to implement...

I basically use a distributed task and query locally using standard WHERE WITH NO GROUP BY then use hash tables to reproduce the "group by" and the count I require. So I create a list of hashsets and then reduce those and then just get the size.

infectedrhythms gravatar imageinfectedrhythms ( 2010-12-20 09:08:47 -0500 )edit
infectedrhythms wrote:

I basically use a distributed task and query locally using standard WHERE WITH NO GROUP BY then use hash tables to reproduce the "group by" and the count I require. So I create a list of hashsets and then reduce those and then just get the size.

This is very similar to the optimizations we are planning to add.

Shay

shay hassidim gravatar imageshay hassidim ( 2010-12-20 10:30:24 -0500 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2010-12-17 13:52:05 -0500

Seen: 32 times

Last updated: Dec 17 '10