Debugging-Health-Checks-in-Load-Balancing-on-Google-Compute-Engine

When a health check fails, how can you debug it? It's easier to understand how to debug a health check if you know what a correct load-balancing configuration looks like. In this post, I'll walk you through a correct configuration, talk a bit about how health checks work, and then discuss some typical kinds of failures and how to think about debugging health checks in general. I'll assume that you have some experience with load balancing on Compute Engine. If you're new to the subject, first try the steps in Network Load Balancing in the Compute Engine documentation.

Load balancing configuration

Let's look at a Debian GNU/Linux 7.8 (wheezy) instance running on Compute Engine. There is a package called google-compute-daemon that owns the /etc/init.d/google-address-manager startup script, as shown by running the following command:

$ dpkg-query -S /etc/init.d/google-address-manager
google-compute-daemon: /etc/init.d/google-address-manager

The address manager's job is to configure the network settings for the instance, including settings for load-balanced IP addresses. Starting with an instance that is not part of a load balancer's target pool, you can see the IP configuration by running the following commands:

$ /sbin/ifconfig
eth0      Link encap:Ethernet HWaddr 42:01:0a:f0:6c:91
         inet addr:192.0.2.0 Bcast:192.0.2.0 Mask:255.255.255.255
         UP BROADCAST RUNNING MULTICAST MTU:1460 Metric:1
         RX packets:263618 errors:0 dropped:0 overruns:0 frame:0
         TX packets:311301 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:1000
         RX bytes:114548264 (109.2 MiB) TX bytes:29265762 (27.9 MiB)

lo        Link encap:Local Loopback
         inet addr:127.0.0.1 Mask:255.0.0.0
         UP LOOPBACK RUNNING MTU:65536 Metric:1
         RX packets:0 errors:0 dropped:0 overruns:0 frame:0
         TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
         collisions:0 txqueuelen:0
         RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

$ ip route list table local
local 192.0.2.0 dev eth0 proto kernel scope host src 192.0.2.0
broadcast 192.0.2.0 dev eth0 proto kernel scope link src 192.0.2.0
broadcast 127.0.0.0 dev lo proto kernel scope link src 127.0.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1

Now, let’s look at what happens when you add the instance to a network load balancing target pool. For this example, assume that the load balancer has the IP address 198.51.100.0. When you add the instance to the target pool, the address manager logs the change in syslog:

$ grep google-address /var/log/syslog
Feb 19 00:22:04 instance-1 google-address-manager: INFO Changing public IPs from None to ['198.51.100.0'] by adding ['198.51.100.0'] and removing None

The route list also has a new entry for the load balancer's IP address:

When the load balancer sends a packet to the backend, the packet is forwarded, not rewritten. In other words, when an instance receives load-balanced traffic, the destination IP address of the packet matches the external address of the load balancer. This is different than traffic that's directed to the external address of the instance itself. Such traffic goes through 1:1 network address translation (NAT) and arrives with its destination IP address set to the instance's internal IP address.

Health Checks

Now that you know how traffic flows from the load balancer to the instance, you can see how the health check works. The metadata server at IP address 169.254.169.254 is responsible for sending traffic to the health check URL. The destination address of the health check is the load balancer's external address. This process mimics real incoming traffic.

The health check must be answered with an HTTP 200 status followed by a normal TCP connection closure within the time specified by the timeoutSec setting. For more information about health check options, see the documentation.

Types of health check failures

Here are some common reasons that health checks fail.

Failure 1: Not listening on the load balancer's address

The most common cause of health check failure is to bind a service only to the instance's external IP address. Here's an example set up with the following netcat command:

$ sudo nc -l -p 80 -s 192.0.2.0

$ netstat -an | grep :80
tcp 0 0 192.0.2.0:80 0.0.0.0:* LISTEN

You can see that there is a service listening on port 80 but, because it's bound to the instance's address, it will never answer queries for the load balancer's external address. It's easy to fix this problem: have your server process listen on 0.0.0.0 so it responds for any address. A server configured like this responds on port 80 for the external address:

$ netstat -an | grep :80
tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN

Failure 2: Address not configured

In an earlier version, there was a race condition in the google-address-manager startup script between the address manager and syslog. You can see the fix on GitHub.

When this race condition occurs, the instance will never be configured to accept traffic on the load balancer's external address because there’s no entry in the routing table. Recall that you can view the routing table by running ip route table list local. A similar issue can also occur if the Linux out-of-memory (OOM) killer runs and kills the google-address-manager daemon. If this is the case, you need to fix the condition causing the daemon not to run. In the meantime, you can start it manually as a workaround.

Failure 3: Sending an RST packet

The web server on the instance may be configured to close the health check's TCP connection with a reset (RST) packet instead of the usual TCP four-way closing handshake; some streaming media servers, for example, offer this option. In this case, running tcpdump will show what seems to be good traffic from the webserver, until you look at the flags. You can see the R(ST) flag in the following output:

Flags [R.], seq 59, ack 92, win 8096

If your web server offers this option, make sure it is disabled for the health check URL.

Failure 4: Taking too long to answer

If the webserver does not finish responding to the health check within the configured timeout, it will be deemed unhealthy, even if it eventually sends an HTTP 200 response code with a proper TCP connection closure. This is an example of the kind of failure that health checks are designed to circumvent. However, if this happens on a server that you consider healthy, you can address it by increasing the health check timeout period.

Failure 5: Not answering directly with a 200 response code

The web server may be configured to redirect to a page that returns an HTTP 200 response code. The health check will not follow the redirect; it expects the health check page to return a 200 directly.

Debugging

Here are some things to think about when you're trying to debug a load balancer that has failing health checks.

Run ip route list table local to check whether the load balancer address is properly configured. If it's not, look into why the address manager is not running. If the address is configured, run a tcpdump on the instances in the load balancer's pool that are in an unhealthy state. For example, run the following command:

$ tcpdump host 169.254.169.254

This command prints all the packets from the metadata server, which is the server that issues the health checks. You may need to filter further, because the instance sometimes queries the metadata server for project metadata. You don't need to see those queries.

You might be tempted to try to debug health checks by browsing to the instance's external address, but this approach is insufficient because it doesn't probe for failures 1 and 2 above, and can also miss failure 3.

One last point to be aware of when you're debugging is how health checks affect the load balancer. If any instance is marked healthy, the load balancer will send traffic to it. If all the instances are marked unhealthy, they’ll be sent traffic anyway, so as not to drop traffic. (See items 3 and 4 in the backupPool section of the Target pools documentation.) Therefore, sometimes your load balancer looks like it's working even though all instances are marked unhealthy. This could be the case for failure modes 3 or 4; either way, you should still address the health checks so you can properly distinguish between nodes that should get traffic and ones that should not.

- Posted by Charles Bacon, Technical Solutions Engineer

from Google Cloud Platform Blog http://googlecloudplatform.blogspot.com/2015/07/Debugging-Health-Checks-in-Load-Balancing-on-Google-Compute-Engine.html

Subscribe Us

Debugging-Health-Checks-in-Load-Balancing-on-Google-Compute-Engine

You may like these posts

Post a Comment

Coronavirus Articles

Contact form