sixtydoses. where od is harmless.

May 23, 2009

Edge Load Balancer Network Dispatcher – Double Collocated HA on HP-UX.

Filed under: Tech — Tags: , , , , — od @ 5:01 am

One of my recent project was to configure Edge load balancer on 2 servers in high availability (HA) environment. I rarely do Edge, but the configuration is pretty straightforward. In my past projects, Edge implementation has always been in separate boxes, which is easier compared to collocated setup. In this post I’m going to share my configuration for edge dispatcher (MAC forwarding) that resides together with web server (I’m using IHS) and WebSphere. Each server will use 1 IP address for both web server and dispatcher. The configuration is almost the same, but there were few issues that I encountered and I hope this post will be of help to those who are dealing with Edge dispatcher as well.

For typical setup of Edge load balancer servers that do not reside in the same box with web servers, the general rules are:
– Primary Edge – cluster IP aliased to its NIC.
– Standby Edge – cluster IP aliased to its loopback.
– Web Servers – cluster IP aliased to loopback.

These rules hold the same in collocated environment:
– Primary Edge – cluster IP aliased to its NIC.
– Standby Edge – cluster IP aliased to its loopback.

Collocated Edge.

Double collocated HA edge.

Say I have the following:
Cluster IP – 192.168.10.10
Cluster port – 8080
Primary Edge – 192.168.10.20
Backup Edge – 192.168.10.21


default.cfg for Primary Edge:

dscontrol set loglevel 5
dscontrol set logsize 50000000
dscontrol executor start

dscontrol executor set nfa 192.168.10.20

dscontrol highavailability heartbeat add 192.168.10.20 192.168.10.21
dscontrol highavailability backup add primary auto 8880
dscontrol highavailability reach add 192.168.10.55
dscontrol highavailability reach add 192.168.10.56

dscontrol cluster add 192.168.10.10
dscontrol port add 192.168.10.10:8080

dscontrol server add 192.168.10.10:8080:192.168.10.20
dscontrol server add 192.168.10.10:8080:192.168.10.21

dscontrol manager start manager.log 10004
dscontrol man reach set loglevel 5
dscontrol man reach set logsize 50000000
dscontrol advisor start Http 192.168.10.10:8080 Http_192.168.10.10_8080.log



default.cfg for Standby Edge:

dscontrol set loglevel 5
dscontrol set logsize 50000000

dscontrol executor start

dscontrol executor set nfa 192.168.10.21

dscontrol highavailability heartbeat add 192.168.10.21 192.168.10.20
dscontrol highavailability backup add backup auto 8880
dscontrol highavailability reach add 192.168.10.55
dscontrol highavailability reach add 192.168.10.56

dscontrol cluster add 192.168.10.10
dscontrol port add 192.168.10.10:8080

dscontrol server add 192.168.10.10:8080:192.168.10.21
dscontrol server add 192.168.10.10:8080:192.168.10.20

dscontrol manager start manager.log 10004
dscontrol man reach set loglevel 5
dscontrol man reach set logsize 50000000
dscontrol advisor start Http 192.168.10.10:8080 Http_192.168.10.10_8080.log



goActive script:

This script will remove the cluster IP from loopback and alias it to the NIC.

#!/bin/ksh

CLUSTER=192.168.10.10
LOOPBACK=lo0:1

ifconfig $LOOPBACK 0.0.0.0
dscontrol executor configure $CLUSTER



goStandby script:
This script will remove the cluster IP from NIC and alias it to the loopback.

#!/bin/ksh

LOOPBACK=lo0:1
CLUSTER=192.168.10.10
NETMASK=255.255.255.192

dscontrol executor unconfigure $CLUSTER
ifconfig $LOOPBACK $CLUSTER netmask $NETMASK up



goInOp script:
This script will remove the cluster IP from all devices (loopback and NIC).

#!/bin/ksh

CLUSTER=192.168.10.10
NETMASK=255.255.255.192

dscontrol executor unconfigure $CLUSTER
ifconfig $LOOPBACK $CLUSTER netmask $NETMASK down



The normal method to test if the high availability works smoothly is by plugging out the network cable off the edge server. I would tail the root mail (/var/mail/root) at the same time, so I could see which HA script has been triggered when the network is interrupted. Another method is to bring down the server, by rebooting it or shutting it down. With reboot you’ll only have a short time span to monitor the failover in action, but of course this depends on how long your servers take to start up.

But since this is a collocated environment, if I were to opt for either the described testing methods, I wouldn’t be able to see if the dispatcher balances all requests to both web servers accordingly (in my case I’m using the round robin algorithm). So what I did is, I manually stop the executor so that failover occurs. Note that stopping the dsserver alone won’t trigger the HA scripts. Actually it is not necessary to stop the dsserver. Well to be honest even if it’s not a collocated environment, I normally test the HA failover by stopping the executor, since normally am working remotely and plugging out the cable requires me to get the help of the sys admins. So might as well test if its really working before going through all the hassle.

One of the problem that I encountered was instability. Sometimes the dispatcher will run in the right mode (active | standby), but most of the time both will run as active. It was very unstable, no certain pattern that I could track. Even worse, sometimes when I tried ro run the dispatcher as a standalone lb, all of the incoming requests will be routed directly to the web server, skipping the dispatcher completely. I was stuck with this problem for several days when I finally figured out what the culprit is.

The ibmlb module.

Everytime when the executor is stopped, the ibmlb module will be unloaded. Everytime when the executor starts, the ibmlb module will be loaded to the kernel. I’m lucky that I have dmesg on both servers, so based from dmesg, this is how it should looked like whenever you stop and start the executor:

ibmlb DLKM successfully unloaded
ibmlb DLKM successfully loaded

But what happened was, when I stopped the executor, the ibmlb was not unloaded. The status was busy, and I’ll have to unload the module explicitly.

ibmlb DLKM successfully unloaded
ibmlb DLKM successfully loaded
ibmlb version is 06.01.00.00 – 20060515-232359 [wsbld265]
WARNING: moduload : module is busy, module id = 14, name = ibmlb
WARNING: moduload : module is busy, module id = 14, name = ibmlb
WARNING: moduload : module is busy, module id = 14, name = ibmlb
WARNING: moduload : module is busy, module id = 14, name = ibmlb
WARNING: moduload : module is busy, module id = 14, name = ibmlb

I’ve not seen anything like this before (I used to configure dispatcher on AIX servers). Consider the following test cases (arp table checked from a different server that resides on the same segment):

TEST 1.

1) Primary active, Backup standby. Cluster IP belongs to Primary.
2) Primary down, Backup goes active. Module ibmlb is UNLOADED successfully on Primary. Cluster IP belongs to Backup.
3) Primary up in active mode, Backup goes standby. Cluster IP belongs to Primary.

TEST 2.
1) Primary active, Backup standby. Cluster IP belongs to Primary.
2) Primary down, Backup goes active. Module ibmlb is busy and still LOADED on Primary. Cluster IP belongs to Backup.
3) Primary up in active mode, Backup stays active. Cluster IP belongs to Primary, but all requests will skip dispatcher and go straight to the web server.


TEST 3.

1) Primary active, Backup standby. Cluster IP belongs to Primary.
2) Primary down, Backup goes active. Module ibmlb is UNLOADED successfully on Primary. Cluster IP belongs to Backup.
3) Primary up in active mode, Backup goes standby. Cluster IP belongs to Primary.
4) Backup down. Module ibmlb is UNLOADED successfully.
5) Backup up, running in standby mode.
6) Backup down. Module ibmlb is busy and still LOADED on backup.
7) Backup up, running in active mode (remember that Primary is also in active mode too). Cluster IP belongs to Backup, but all requests will skip the dispatcher and go straight to the web server.
8 ) Backup down. Module ibmlb is busy and still LOADED on backup. Explicitly unload the module using kcmodule command until it gets UNLOADED. Cluster IP belongs to Primary.
9) Backup up, running in standby mode.

Most of the time I won’t be able to unload it right away, until I let the server ‘rest’ for about 15 – 20 minutes, before trying to unload it again. Rebooting the server will always solve this problem (the module next state is unused). Am not sure if there’s a way to force a module to be unloaded though. As far as I know there’s no force flag for kcmodule.

I was fooled several times since I tested the splash page of the web servers from my Opera browser. I was on a different subnet, so I guess there must be a switch/router in between me and the edge servers. At times, even when the cluster IP is aliased to the Primary Edge, my browser will point to the Backup Edge since the ARP cache was not refreshed. It was so annoying since this will affect the cluster report. The rest of the testings were done by running a browser from a different server but belongs to the same subnet. At least I could clear up the ARP cache manually if I have to.

Okay probably this is my browser problem, but testing the splash page with Firefox sucks. It kept on hitting the splash page even after I’ve stopped both web servers, and cleared up the cache. It was alright with Opera though. What gives?

By the way I’m using Edge v6.1. If you check out the Edge Fixpack page here, you’ll notice that there is no patch for HP-UX. Not a single patch. Is IBM trying to say something? Don’t use Edge on HP-UX, perhaps? Lol. Anyway, IBM packed me a patch (6.1.0.35), but still it didn’t address the module issue. Am not sure if I could call it a patch though, it’s more like an installer since I had to reinstall everything.

Thanks to Robert Brown from IBM for assisting me on this ‘false alarm’ panic attack (initially I thought it was a network issue).

July 18, 2008

I’m ready! I’m ready!

Filed under: Life, Tech — Tags: , — od @ 2:48 am
I'm ready! I'm ready!

I'm ready! I'm ready!

July 9, 2008

Too much of beans is driving me nuts.

Filed under: Life — Tags: , , — od @ 5:40 pm

This is a story of Enterprise JavaBeans, or better known as EJB. I don’t like EJB. In fact, I don’t like Java. Ok, I do like Java, but I don’t understand it. Well, prolly it’s because I’m so stressed up for the upcoming exam, but seriously, there are too many terms, too many policies and all sorts of beans that need to be memorized, or preferably, understood.

*sigh*

I don’t think I am well prepared for the next week’s exam. But yea, I am sort of mentally prepared for the worst.

All these beans are.. urgh.. annoying. So this morning I’ve made myself a wallpaper in attempt to encourage (or force) myself to believe that EJB is full of fun and goodness! Well to all of you EJB lovers out there, I dedicate this wallpaper to you. Yes fellas, it’s free.

*rolleyes drum-rolls*

EJBean.

March 20, 2008

File name truncated – huh?

Filed under: Tech — Tags: , — od @ 5:58 pm

I was installing WAS v6.0.2 on a HP-UX B.11.31 IA-64 server and had a problem while trying to install the Update Installer for fixpack 25. Now again, the problem did not occur during the patching, but during the installation of the Update Installer itself.

Since WAS v6.0.2.21, IBM decided to separate the update installer package from the fix pack. This is great since this can save up time as it avoids the redundancy of downloading the update installer which normally can be used across the similar product version. But now one must be sure that the he/she has the correct version of update installer to match with the fixpack to be installed – IBM takes care of this with ‘FIX CENTRAL’. As for me, I’m sure I have the correct one.

Now back to the problem with the update installer. Based on the logs it stated that there is a missing file – no such file or directory. Well actually there are more than just one missing file, but the installation will stop each time it failed to find a particular file, and on the next re-run it will detect another missing file and so on. When I went through the directories, I figured that there are few files with their names truncated. So that explains the error of the missing files.

I untared the same file on my ubuntu, checked the file names and they are all in perfect condition. Erms. I’m confused. Why and how were the names got truncated? Probably during the sftp of the installer from my lappie to the server?

A screenshot that says it all:

WAS err.

On a different note, I think it would be nice if IBM could provide a checksum of all the files available for download so that I could just simply check if the files that I downloaded are not corrupted. At the moment I’m keeping a list of my own md5 checksum of all installers that I have downloaded.

Another thing that bothers me is that sometimes I just don’t understand the Download Director that I use to download IBM softwares. While I like it more than http download, I am so confused with the ETA. How do you define a negative value of an ETA? This normally happened when I download multiple files at a time. Gah.

Download director.

March 14, 2008

Install WAS Base/ND v6.1.0 on Ubuntu Gutsy.

Filed under: Tech — Tags: , , , — od @ 2:07 pm

There are 2 things that need to be configured in order to install WebSphere Application Server Base/ND on Ubuntu Gutsy successfully – tested using WAS Base/ND v6.1.

1 – Ubuntu Gutsy links sh to dash instead of bash. There won’t be any error during the installation of WAS itself, but you will not be able to create any profile, so it’s useless. Two ways to fix this, either remove the symlink and relink it to bash, or change the shebang line inside the WAS install script from #!/bin/sh to #!/bin/bash. Changing the default shell from dash to bash may cause your system slower since dash is lighter than bash, but I think it is hardly noticeable. More info at https://wiki.ubuntu.com/DashAsBinSh.

2 – This applies to WAS ND, I didn’t encounter any issue with Base. If you’re having a problem in getting the dmgr server up, and the error in the SystemOut.log is something like this:

[3/12/08 15:38:06:539 MYT] 0000000a LogAdapter E DCSV9403E: Received an illegal configuration argument. Parameter
MulticastInterface, value: 127.0.1.1. Exception is java.lang.Exception: Network Interface 127.0.1.1 was not found in
local machine network interface list. Make sure that the NetworkInterface property is properly configured!
at com.ibm.rmm.mtl.transmitter.Config.<init>(Config.java:238)
at com.ibm.rmm.mtl.transmitter.MTransmitter.<init>(MTransmitter.java:192)
at com.ibm.rmm.mtl.transmitter.MTransmitter.getInstance(MTransmitter.java:406)
at com.ibm.rmm.mtl.transmitter.MTransmitter.getInstance(MTransmitter.java:345)
at com.ibm.htmt.rmm.RMM.getInstance(RMM.java:128)
at com.ibm.htmt.rmm.RMM.getInstance(RMM.java:189)
at com.ibm.ws.dcs.vri.transportAdapter.rmmImpl.rmmAdapter.RmmAdapter.<init>(RmmAdapter.java:218)
at com.ibm.ws.dcs.vri.transportAdapter.rmmImpl.rmmAdapter.MbuRmmAdapter.<init>(MbuRmmAdapter.java:76)
at com.ibm.ws.dcs.vri.transportAdapter.rmmImpl.rmmAdapter.RmmAdapter.getInstance(RmmAdapter.java:133)
at com.ibm.ws.dcs.vri.transportAdapter.TransportAdapter.getInstance(TransportAdapter.java:161)
at com.ibm.ws.dcs.vri.common.impl.DCSCoreStackImpl.<init>(DCSCoreStackImpl.java:178)
at com.ibm.ws.dcs.vri.common.impl.DCSCoreStackImpl.getInstance(DCSCoreStackImpl.java:167)
at com.ibm.ws.dcs.vri.common.impl.DCSStackFactory.getCoreStack(DCSStackFactory.java:92)
at com.ibm.ws.dcs.vri.DCSImpl.getCoreStack(DCSImpl.java:84)
at com.ibm.ws.hamanager.coordinator.impl.DCSPluginImpl.<init>(DCSPluginImpl.java:238)
at com.ibm.ws.hamanager.coordinator.impl.CoordinatorImpl.<init>(CoordinatorImpl.java:322)
at com.ibm.ws.hamanager.coordinator.corestack.CoreStackFactoryImpl.createDefaultCoreStack(CoreStackFactoryImpl
.java:82)

Chance is you have not assigned an IP address for your hostname, except for the default 127.* address. If this is the case you won’t be able to federate nodes to the Dmgr as well. So edit your hosts file. Since Edgy the hostname was split to 127.0.1.1, so you will see 127.0.0.1 is assigned to a localhost, and 127.0.1.1 to your hostname. Assign your hostname to 127.0.0.1 as well, and problem solved. But if you plan to do some nodes federation, then assign an IP for your hostname. Your hosts file should look something like this:

127.0.0.1 localhost YourHostName
127.0.1.1 YourHostName

Done.