[Kea-users] Help diagnosing (and potentially addressing) a possible performance problem?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Kea-users] Help diagnosing (and potentially addressing) a possible performance problem?

Klaus Steden

Hi everyone,

We've been using Kea successfully for several months now as a key part of our provisioning process. However, it seems like the server we're running it on (a VM running under XenServer 6.5) isn't beefy enough, but I'm not 100% confident in that diagnosis.

There are currently ~200 unique subnets defined, about 2/3rd of which are use to provide a single lease during provisioning, at which point the host in question assigns itself a static IP. There are 77 subnets that are actively in use (for IPMI), with the following lease attributes:

  "valid-lifetime": 4000,
  "renew-timer": 1000,
  "rebind-timer": 2000,

From what I'm seeing in the output of tcpdump, there are a LOT more requests coming in than replies going out, and netstat seems to confirm that:

# netstat -us
...
Udp:
    71774 packets received
    100 packets to unknown port received.
    565 packet receive errors
    4911 packets sent

If I monitor netstat continuously, I see spikes on the RecvQ for Kea that fluctuate wildly, anywhere between 0 and nearly 500K (and sometimes higher) moment to moment.

The log also reports a lot of ALLOC_ENGINE_V4_ALLOC_FAIL errors after typically 53 attempts (not sure why 53, but that number seems to be the typical upper limit before failure is confirmed).

I've been experimenting over the last hour or so with tuning various kernel parameters (net.ip4.udp_mem, net.core.rmem_default, net.core.netdev_max_backlog, etc.) but those don't appear to make any kind of difference, and the RecvQ remains high.

Is there any way I can either tune the daemon to handle this kind of backlog, or a list of which kernel tuneables I should be looking at modifying? Is there a more clear way to determine if I've got a genuine performance limitation that we're just now running into?

I've got a bare metal machine temporarily helping carry the burden and it doesn't have these issues, but then again, it's not carrying the full load; I'm loath to dedicate a whole physical server just to DHCP, but if the load is going to remain high like this, maybe that's just what I have to do.

thanks,
Klaus


_______________________________________________
Kea-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/kea-users
Reply | Threaded
Open this post in threaded view
|

Re: [Kea-users] Help diagnosing (and potentially addressing) a possible performance problem?

Rasmus Edgar

Hi Klaus,

I have seen something very similar on vmware with another application receiving a lot udp traffic and unfortunately we never found a solution for it and switched to bare metal as a workaround, which has irked me ever since and I'm interested in finding a root causes for these kinds of problems.

As far as I understand, and according to the netstat man page, Recv-Q is the count of bytes not yet copied by the user program connected to the socket.

Do you have special rules, execute something or do dns lookups when handling dhcp requests?

Have you read the comments on ALLOC_ENGINE_V4_ALLOC_FAIL?

"% ALLOC_ENGINE_V4_ALLOC_FAIL %1: failed to allocate an IPv4 address after %2 attempt(s)
The DHCP allocation engine gave up trying to allocate an IPv4 address
after the specified number of attempts.  This probably means that the
address pool from which the allocation is being attempted is either
empty, or very nearly empty.  As a result, the client will have been
refused a lease. The first argument includes the client identification
information.

This message may indicate that your address pool is too small for the
number of clients you are trying to service and should be expanded.
Alternatively, if the you know that the number of concurrently active
clients is less than the addresses you have available, you may want to
consider reducing the lease lifetime.  In this way, addresses allocated
to clients that are no longer active on the network will become available
sooner."

Br,

Rasmus

Klaus Steden skrev den 2017-10-05 03:03:


Hi everyone,
 
We've been using Kea successfully for several months now as a key part of our provisioning process. However, it seems like the server we're running it on (a VM running under XenServer 6.5) isn't beefy enough, but I'm not 100% confident in that diagnosis.
 
There are currently ~200 unique subnets defined, about 2/3rd of which are use to provide a single lease during provisioning, at which point the host in question assigns itself a static IP. There are 77 subnets that are actively in use (for IPMI), with the following lease attributes:

  "valid-lifetime": 4000,
  "renew-timer": 1000,
  "rebind-timer": 2000,

From what I'm seeing in the output of tcpdump, there are a LOT more requests coming in than replies going out, and netstat seems to confirm that:

# netstat -us
...
Udp:
    71774 packets received
    100 packets to unknown port received.
    565 packet receive errors
    4911 packets sent

If I monitor netstat continuously, I see spikes on the RecvQ for Kea that fluctuate wildly, anywhere between 0 and nearly 500K (and sometimes higher) moment to moment.

The log also reports a lot of ALLOC_ENGINE_V4_ALLOC_FAIL errors after typically 53 attempts (not sure why 53, but that number seems to be the typical upper limit before failure is confirmed).

I've been experimenting over the last hour or so with tuning various kernel parameters (net.ip4.udp_mem, net.core.rmem_default, net.core.netdev_max_backlog, etc.) but those don't appear to make any kind of difference, and the RecvQ remains high.

Is there any way I can either tune the daemon to handle this kind of backlog, or a list of which kernel tuneables I should be looking at modifying? Is there a more clear way to determine if I've got a genuine performance limitation that we're just now running into?

I've got a bare metal machine temporarily helping carry the burden and it doesn't have these issues, but then again, it's not carrying the full load; I'm loath to dedicate a whole physical server just to DHCP, but if the load is going to remain high like this, maybe that's just what I have to do.

thanks,
Klaus


_______________________________________________
Kea-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/kea-users



_______________________________________________
Kea-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/kea-users
Reply | Threaded
Open this post in threaded view
|

Re: [Kea-users] Help diagnosing (and potentially addressing) a possible performance problem?

Klaus Steden

Hello Rasmus,

After about a week or so of time for analysis, it turns out that it was a couple of factors working in concert, for the most part:

1. lease times were too short (1H), resulting in request storms as entire racks leases would expire roughly simultaneously, swamping the server with requests; I changed the default lease time to 12H, applied some monitoring to keep track, and lease counts recovered within an hour or so and stabilized

2. some racks timed out when requesting due to distances between the client and the DHCP server due to either simple network distance or packets lost due to asymmetric routing; one of the affected areas is in another segment entirely with different routing and eventually will be firewalled off, so the fastest way to resolve the issue was simply spin up another DHCP server and point the segment's switches IP helpers to it, instead of the original DHCP server. As they're not sharing the same lease/reservation tables, they can get along using the same database without causing conflicts (I think this scenario was actually explicitly tested by the Kea dev team, and found to be stable)

3. I think the ALLOC_FAIL messages were actually red herring false positives from a rack or racks(s) that haven't yet been assigned scopes, so no leases are available to be granted yet

Thank you for the feedback -- I was able to work through these issues using insights from your comments, a bit of "rubber duck" debugging with one of our network engineers, and some instrumentation help generated with InfluxDB. :-)

cheers,
Klaus

On Thu, Oct 5, 2017 at 2:02 AM, Rasmus Edgar <[hidden email]> wrote:

Hi Klaus,

I have seen something very similar on vmware with another application receiving a lot udp traffic and unfortunately we never found a solution for it and switched to bare metal as a workaround, which has irked me ever since and I'm interested in finding a root causes for these kinds of problems.

As far as I understand, and according to the netstat man page, Recv-Q is the count of bytes not yet copied by the user program connected to the socket.

Do you have special rules, execute something or do dns lookups when handling dhcp requests?

Have you read the comments on ALLOC_ENGINE_V4_ALLOC_FAIL?

"% ALLOC_ENGINE_V4_ALLOC_FAIL %1: failed to allocate an IPv4 address after %2 attempt(s)
The DHCP allocation engine gave up trying to allocate an IPv4 address
after the specified number of attempts.  This probably means that the
address pool from which the allocation is being attempted is either
empty, or very nearly empty.  As a result, the client will have been
refused a lease. The first argument includes the client identification
information.

This message may indicate that your address pool is too small for the
number of clients you are trying to service and should be expanded.
Alternatively, if the you know that the number of concurrently active
clients is less than the addresses you have available, you may want to
consider reducing the lease lifetime.  In this way, addresses allocated
to clients that are no longer active on the network will become available
sooner."

Br,

Rasmus

Klaus Steden skrev den 2017-10-05 03:03:


Hi everyone,
 
We've been using Kea successfully for several months now as a key part of our provisioning process. However, it seems like the server we're running it on (a VM running under XenServer 6.5) isn't beefy enough, but I'm not 100% confident in that diagnosis.
 
There are currently ~200 unique subnets defined, about 2/3rd of which are use to provide a single lease during provisioning, at which point the host in question assigns itself a static IP. There are 77 subnets that are actively in use (for IPMI), with the following lease attributes:

  "valid-lifetime": 4000,
  "renew-timer": 1000,
  "rebind-timer": 2000,

From what I'm seeing in the output of tcpdump, there are a LOT more requests coming in than replies going out, and netstat seems to confirm that:

# netstat -us
...
Udp:
    71774 packets received
    100 packets to unknown port received.
    565 packet receive errors
    4911 packets sent

If I monitor netstat continuously, I see spikes on the RecvQ for Kea that fluctuate wildly, anywhere between 0 and nearly 500K (and sometimes higher) moment to moment.

The log also reports a lot of ALLOC_ENGINE_V4_ALLOC_FAIL errors after typically 53 attempts (not sure why 53, but that number seems to be the typical upper limit before failure is confirmed).

I've been experimenting over the last hour or so with tuning various kernel parameters (net.ip4.udp_mem, net.core.rmem_default, net.core.netdev_max_backlog, etc.) but those don't appear to make any kind of difference, and the RecvQ remains high.

Is there any way I can either tune the daemon to handle this kind of backlog, or a list of which kernel tuneables I should be looking at modifying? Is there a more clear way to determine if I've got a genuine performance limitation that we're just now running into?

I've got a bare metal machine temporarily helping carry the burden and it doesn't have these issues, but then again, it's not carrying the full load; I'm loath to dedicate a whole physical server just to DHCP, but if the load is going to remain high like this, maybe that's just what I have to do.

thanks,
Klaus


_______________________________________________
Kea-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/kea-users




_______________________________________________
Kea-users mailing list
[hidden email]
https://lists.isc.org/mailman/listinfo/kea-users