Welcome to the new Gigaspaces XAP forum. To recover your account, please follow these instructions.

Ask Your Question
0

Gigaspaces XAP application with Cassandra data store

I'm currently looking at gigaspaces for a large scale data processing application. The application will need to process large amounts of binary data, partitioned (and preferably replicated to some degree) over a cluster of machines. Typical data sets will be too large to fit in main memory and will need to be read from persistent storage.

Processing of the data set will be distributed across the processing nodes and I am looking to achieve data locality so that processed data is read from local storage rather than requiring network access.

I am looking to Cassandra for the persistent storage but was wondering how data locality would be achieved with gigaspaces. For example, I understand that space data can be persisted by using the CassandraSpaceDataSource. Is there anyway correlation between the partitioning of space data and data locality within Cassandra? For example, if one were to use gigaspaces to run a map/reduce operation over a Cassandra data store, would there be a way of achieving this data locality requirement?

Many thanks for your help.

{quote}This thread was imported from the previous forum. For your reference, the original is [available here|http://forum.openspaces.org/thread.jspa?threadID=4064]{quote}

asked 2013-04-18 12:08:05 -0500

gwynn gravatar image

updated 2013-08-08 09:52:00 -0500

jaissefsfex gravatar image
edit retag flag offensive close merge delete

1 Answer

Sort by » oldest newest most voted
0

Gareth,

The general recommendation is to have Cassandra nodes in the same ratio as the number of machines (1:1) as these are usually hitting the disk as their main resource , and have GigaSpaces nodes in the same ratio as the number of cores (or in the same magnitude).

There are 2 main options how to persist data :
- Archiver – good when you need to persist new data that don’t need to be updated. This will remove also the data from the data grid.
- Mirror – good when you need to persist data that is new , updated or removed and keep it for some time within the data grid.

Map-Reduce activity can be done over the data grid or over Cassandra. These are not necessarily should be bundled with each other.

Please let me know if this is what you are looking for.

Shay

answered 2013-04-18 13:51:11 -0500

shay hassidim gravatar image
edit flag offensive delete link more

Comments

Thanks for your reply. So if I understand correctly, in my scenario on each machine there would be multiple gigaspaces nodes running and a Cassandra node would be deployed there also. In this case would there be a way of ensuring that data was always read from the local Cassandra node?

Thanks.

gwynn gravatar imagegwynn ( 2013-04-19 04:03:12 -0500 )edit

If all your Cassandra nodes running on the same machine where the space is running I don't understand why you are asking about local Cassandra nodes.
All Cassandra nodes will be "local" as they will be on the same machine as the space.

Do you plan to have a Task or a Service collocated with the space that will read directly from Cassandra? Is this how you are planning to implement your map-reduce against Cassandra?

Note Cassandra might consume large amount of CPU , so having Cassandra nodes running on the same machine where GigaSpaces data grid is running might not be a good idea.

shay hassidim gravatar imageshay hassidim ( 2013-04-19 07:53:00 -0500 )edit

Perhaps I didn't explain clearly. Each machine in the cluster will run Cassandra as well as the Gigaspaces processes which will perform the data processing. The dataset is potentially very large which is why we're using Cassandra to distribute over a number of machines so only a partial set of the data will be locally available to a particular machine.

Thanks,

Gareth

Edited by: Gareth Wynn on Apr 19, 2013 10:49 AM

gwynn gravatar imagegwynn ( 2013-04-19 08:20:26 -0500 )edit

Gareth,

To "align" the Cassandra partitioning to match the GigaSpaces partitioning you will need to change the GigaSpaces routing behavior to match Cassandra (override the hashcode method for user defined class that is used as a routing field as part of the space class) or change Cassandra partitioning to match GigaSpaces one (using the hash value of the routing field to calculate the target partition ID) as described below: http://wiki.gigaspaces.com/wiki/display/XAP95/RoutingInPartitionedSpaces

The above is doable - but it will not allow you to leverage the elastic behavior of GigaSpaces and Cassandra. Still , with very large data sets allowing the space partition to interact with a local Cassandra node could improve the performance as there would not be a remote call to a Cassandra node running on a different machine , especially when writing data. There will be a background activity (depends on the Cassandra cluster config) that will replicate data between the different Cassandra cluster ring nodes. This should be taken into consideration.

I suggest you work with our services team to review this topic in a detailed manner.

Shay

shay hassidim gravatar imageshay hassidim ( 2013-04-24 09:05:55 -0500 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2013-04-18 12:08:05 -0500

Seen: 172 times

Last updated: Apr 18 '13