Welcome to the new Gigaspaces XAP forum. To recover your account, please follow these instructions.

Ask Your Question
0

Non-stable XAP clustering in VM environment - how to configure

Dear Sir,

We are implementing a PoC using gigaspaces-xap-premium-7.1.2-ga (Java) on top of VM clusters built by VMWare ESX4.0.0. There are two physical servers sitting in same subnet; each server runs 4 VMs (CentOS5.5, 1GB Memory each). We are using a startup script to kick-off the gsa execution (call "${GS_FOLDER}/bin/gs-agent.sh gsa.gsc 1 gsa.global.gsm 2 gsa.global.lus 2> /dev/null 2>&1 &) as described in "http://www.gigaspaces.com/wiki/display/XAP71/XAP+on+VMWare".

It is supposed the distributed XAP runtime will keep two global GSMs and two global LUS. But after running 10 minutes, we saw from the GUI that there were 4 GSMs, 4 LUS in total, and the numbers kept changing. It sounds like a split-brain issue cased among GSAs (there are total 8 GSAs running in 8 VMs). Same issue happened when we deployed a space with 2partition-1backup topology: the number of space instances are not sable.

In the attached screen capture, we can see: The total number of GSAs/Hosts is 7 (actually 8), total GSC is 6 (shall be 8), total GSM/LUS is 4 (shall be 2), and the numbers kept changing.

This is critical, since we have to make sure the system is stable before putting into production . Any suggestions on how to tune such environment? Thanks!

Regards, Tianqi h4. Attachments

[Multicast_Sender_192.168.38.62.txt|/upfiles/13759704948671606.txt]

[2010-12-27~18.23-gigaspaces-gsc_1-192.168.38.62-3066.log|/upfiles/13759704945750806.txt]

[2010-12-27~18.24-gigaspaces-gsm_2-192.168.38.62-3178.log|/upfiles/13759704943285006.txt]

[2010-12-27~18.23-gigaspaces-gsa-192.168.38.62-3001.log|/upfiles/1375970494140906.txt]

[GSM-LUS-VMWare-VM-Cluster.png|/upfiles/13759704941774109.png]

{quote}This thread was imported from the previous forum. For your reference, the original is [available here|http://forum.openspaces.org/thread.jspa?threadID=3575]{quote}

asked 2010-12-26 20:38:31 -0500

tqwang gravatar image

updated 2013-08-08 09:52:00 -0500

jaissefsfex gravatar image
edit retag flag offensive close merge delete

1 Answer

Sort by ┬╗ oldest newest most voted
0

First of all, you should be aware that when running on VMWare you will experience a performance drop compared to running your system on a regular OS.
The average latency of remote operations is affected when running on VMWare.

Can you move into local GSM and LUS and see if this solved the problem?
Maybe you don't have multicast configured properly with your system.
Make sure also the NIC_ADDR is set correctly on each instance before the gs-agent is started.
Make sure you allocate the VM enough memory to accommodate your JVM heap size.

Shay

Attachments

  1. MulticastSender192.168.38.62.txt
  2. 2010-12-27~18.23-gigaspaces-gsc_1-192.168.38.62-3066.log
  3. 2010-12-27~18.24-gigaspaces-gsm_2-192.168.38.62-3178.log
  4. 2010-12-27~18.23-gigaspaces-gsa-192.168.38.62-3001.log

answered 2010-12-26 20:52:39 -0500

shay hassidim gravatar image
edit flag offensive delete link more

Comments

Thanks Shay,

1). By moving into local GSM/LUS, do you mean to use "gs-agent.bat gsa.gsc 1 gsa.global.gsm 0 gsa.global.lus 0 gsa.gsm 1 gsa.lus 1" in the startup script? If so, if I start 8 VMs, won't there be 8 GSMs?

2). I checked Multicast by running "admin multicastTest -sender -ba localhost -verbose" and "admin multicastTest -receiver -ba localhost -verbose", at the receiver side, it shows many "Received from [sender=127.0.0.1:4164] packet size: 100 bytes", but the sender side only shows: (no ACK msg received). Is this problematic?

gs> admin multicastTest -sender -ba localhost -verbose Starting Multicast-Sender... Started MulticastSocket=/224.0.1.187:4164, ack-reply port: 4161, ttl=1, bind interface=localhost/127.0.0.1, eventSize=100

---------- [localhost.localdomain] NETWORK INTERFACE INFO ----------- Names: eth0 / eth0 Address: fe80:0:0:0:250:56ff:fe34:2df3%2 Address: 192.168.38.62 Names: lo / lo Address: 0:0:0:0:0:0:0:1%1 Address: 127.0.0.1 ---------- [localhost.localdomain] NETWORK INTERFACE INFO ----------- 3). NIC_ADDR shall have been set correctly. Pls refer to the gsc log in the attachment and see if anything goes wrong. 4). GSC/GSM/LUS/GSA are set to have 128MB size each , and one VM has 1GB. So it shall be enough.

Besides, are there any parameters to control the GSA discovery settings? Any parameter to configure GSM retryTimeout?

Thanks, Tianqi h4. Attachments

[Multicast_Sender_192.168.38.62.txt|/upfiles/13759704941355453.txt]

[2010-12-27~18.23-gigaspaces-gsc_1-192.168.38.62-3066.log|/upfiles/13759704947464353.txt]

[2010-12-27~18.24-gigaspaces-gsm_2-192.168.38.62-3178.log|/upfiles/13759704948464653.txt]

[2010-12-27~18.23-gigaspaces-gsa-192.168.38.62-3001.log|/upfiles/13759704946098653.txt]

tqwang gravatar imagetqwang ( 2010-12-27 00:03:39 -0500 )edit

For the machines running LUS+GSM you should use this gs-agent.bat gsa.gsc 0 gsa.global.gsm 0 gsa.global.lus 0 gsa.gsm 1 gsa.lus 1

For the machines running GSC only you should use this gs-agent.bat gsa.gsc 1 gsa.global.gsm 0 gsa.global.lus 0 gsa.gsm 0 gsa.lus 0

bind interface can't be 127.0.0.1. It must be something different.

set the LOOKUPLOCATORS variable on all the machines to have the IP of the machine(s) running the LUS. example: export LOOKUPLOCATORS=ip1,ip2

Shay h4. Attachments

[Multicast_Sender_192.168.38.62.txt|/upfiles/13759704949040273.txt]

shay hassidim gravatar imageshay hassidim ( 2010-12-27 01:57:23 -0500 )edit

Thanks, Shay.

1). Seems the command "gs-agent.bat gsa.gsc 0 gsa.global.gsm 0 gsa.global.lus 0 gsa.gsm 1 gsa.lus 1" has same results with running "gsm.sh". If GSM/LUS node somehow stops, the whole cluster may collapse´╝č So we need to start at least 2 GSM/LUS nodes. SLA-driven won't work for the case.

2). For the multicastTest, after setting the true IP, the sender started receiving ACK msg. Two receivers were created. All run in VM. The sender log is attached. The latency was not stable: ranging from a few ms to a few seconds. Are there any parameters in XAP we need to tune to cater for such bad networking environment, e.g., FaultDetectionHandler, active-election setting?

3). "export LOOKUPLOCATORS=ip1,ip2" is UNICAST approach (v.s. Multi-cast), right? Or it shall be mixed with Multicast setting to smooth the service discovery process?

Thanks, Tianqi h4. Attachments

[Multicast_Sender_192.168.38.62.txt|/upfiles/13759704942434292.txt]

tqwang gravatar imagetqwang ( 2010-12-27 03:05:57 -0500 )edit

Since my guess is that the problem is with the VMWare network configuration I would like to start with something basic - i.e. unicast lookup service discovery. Once this will be stable we can move into multicast lookup service discovery. We should start with one GSM/LUS , make sure this works and later move into two GSM/LUS by running the command on 2 different VMs. With your tests , you should disable the muticast discovery by having the following system property: com.gs.multicast.enabled=false used. You can set it by using the EXT_JAVA_OPTIONS variable.

To tune the FaultDetectionHandler, active-election ,etc we should refer to the support team. I'm not sure this is needed.

Can you run the system on environment without VMWare?

Shay

shay hassidim gravatar imageshay hassidim ( 2010-12-27 20:55:46 -0500 )edit

We identified that one of of the 8 VMs has bad network performance which probably led to the GSA split-brain issue (due to SLA setting like: gsa.global.gsm 2 gsa.global.lus 2). After removing this node, and partitioned the whole VMs into two zones (LUS/GMS vs GSC), the system can run smoothly now.

Thanks very much for your help!

Tianqi

tqwang gravatar imagetqwang ( 2010-12-29 01:35:51 -0500 )edit

Your Answer

Please start posting anonymously - your entry will be published after you log in or create a new account.

Add Answer

Question Tools

1 follower

Stats

Asked: 2010-12-26 20:38:31 -0500

Seen: 64 times

Last updated: Dec 26 '10