RE: krb5-1.12 - New iprop features - tree propagation & slave notify

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

RE: krb5-1.12 - New iprop features - tree propagation & slave notify

Richard Basch
Now that I have stress tested this feature, I have discovered it is possible
under high load that slave notification events may be triggered but either
are lost or due to resource constraints cannot be processed (perhaps the
kprop is failing), and that it must fall back to the polling.

 

Overall, as long as the rate of change is not too great, I have not observed
latency issues, nor do I see any obvious bugs, except that the notification
is not synchronous and any failure there may result in the notification not
being processed and the client not being queued to be notified for another
update.

 

In other words, what I provided really is an immense improvement over the
prior replication (latency is generally reduced), but there is still an edge
case under load where it might fall back to prior behavior. I might change
my implementation slightly so that instead of dropping the client after the
first notification to implement a countdown for number of notifications
before the client is dropped from the list. This will likely reduce the
likelihood of the bug. I do not think slave notification should be
synchronous.

 

 

From: Richard Basch [mailto:[hidden email]]
Sent: Thursday, December 12, 2013 12:01 AM
To: '[hidden email]'
Cc: '[hidden email]'; '[hidden email]'; 'Nico Williams';
'[hidden email]'; 'Richard Basch'
Subject: RE: krb5-1.12 - New iprop features - tree propagation & slave
notify

 

Resent message (reformatted). Outlook previously munged the text wrapping &
URLs.


I cleaned up the commits (especially since I found an error in one of my man
page updates).

 

Wiki:

 <https://github.com/rbasch/krb5/wiki/Replication-enhancements>
https://github.com/rbasch/krb5/wiki/Replication-enhancements

 

Within the wiki, it lists the commits:

 
<https://github.com/rbasch/krb5/commit/0f4b84669c345cb81e60b26372fa32bb9834c
093>
https://github.com/rbasch/krb5/commit/0f4b84669c345cb81e60b26372fa32bb9834c0
93 (ulog tracking on slaves)
<https://github.com/rbasch/krb5/commit/023602cc7934c96ab7130aa19481228958473
749>
https://github.com/rbasch/krb5/commit/023602cc7934c96ab7130aa194812289584737
49 (tree propagation feature)

 
<https://github.com/rbasch/krb5/commit/de078a1f49a8d40ba5a5cb7fdbe22c593f305
09a>
https://github.com/rbasch/krb5/commit/de078a1f49a8d40ba5a5cb7fdbe22c593f3050
9a (slave notification)

 

A summary view is available via:

 <https://github.com/rbasch/krb5/compare/krb5-1.12>
https://github.com/rbasch/krb5/compare/krb5-1.12

 

For anyone just joining in the discussion, I enhanced the replication
strategy so a tree-based hierarchy can be setup (useful for improving
scale-out and for sites where multiple servers may be across slow WAN
links). In addition, when updates are registered in the database, notify
events are sent to iprop slave servers so they know there are pending
updates (which reduces replication lag). The exact methodology is described
in the wiki.

 

The patches have already been tested with the final 1.12 production release
(they were previously tested against the 1.12 beta releases).

_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: krb5-1.12 - New iprop features - tree propagation & slave notify

Nico Williams
On Tue, Jan 14, 2014 at 8:03 PM, Richard Basch <[hidden email]> wrote:
> Overall, as long as the rate of change is not too great, I have not observed
> latency issues, nor do I see any obvious bugs, except that the notification
> is not synchronous and any failure there may result in the notification not
> being processed and the client not being queued to be notified for another
> update.

Not synchronous... relative to the response to the kadm5 client?
Indeed, no one will want that :)  Async is fine, and it's what we've
always expected.  (Which is why it's important to write new keys to
keytabs before writing them to the KDB.)

Or if you meant async as in kadmind not waiting for a slave to fetch
its updates before moving on to the next slave (or next update),
that's also fine, and very much desirable: no slave should be able to
stop replication (unless it's part of the replication hierarchy, in
which case it can stop replication downstream of itself).

> In other words, what I provided really is an immense improvement over the
> prior replication (latency is generally reduced), but there is still an edge
> case under load where it might fall back to prior behavior. I might change
> my implementation slightly so that instead of dropping the client after the
> first notification to implement a countdown for number of notifications
> before the client is dropped from the list. This will likely reduce the
> likelihood of the bug. I do not think slave notification should be
> synchronous.

Dropping slaves after some number of updates without hearing from them
sounds good to me.

Nico
--
_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: krb5-1.12 - New iprop features - tree propagation & slave notify

Basch, Richard
I just figured I would post an update since I managed to see slaves lagging in updates under high load (and wait the iprop interval). Currently, I post the initial notification, but it appears high-change environments still need more.

My ideal would be to leave the notification async but if the notification fails, implement a parametized retry count before dropping the client.

Personally, I am probably going to increase my ulog size since the 2500 restriction is gone (the kdc.conf man page was not updated to reflect such).

If anyone sees any logic flaws in my code which explains the behavior, feel free to comment on such. It is possible I overlooked something, but it didn't jump out at me (but I figured people should be aware of the issue albeit one which is no worse off than the current code base).



----- Original Message -----
From: Nico Williams [mailto:[hidden email]]
Sent: Tuesday, January 14, 2014 10:57 PM Eastern Standard Time
To: Richard Basch <[hidden email]>
Cc: [hidden email] <[hidden email]>; Greg Hudson <[hidden email]>; Tom Yu <[hidden email]>; Basch, Richard [Tech]
Subject: Re: krb5-1.12 - New iprop features - tree propagation & slave notify

On Tue, Jan 14, 2014 at 8:03 PM, Richard Basch <[hidden email]> wrote:
> Overall, as long as the rate of change is not too great, I have not observed
> latency issues, nor do I see any obvious bugs, except that the notification
> is not synchronous and any failure there may result in the notification not
> being processed and the client not being queued to be notified for another
> update.

Not synchronous... relative to the response to the kadm5 client?
Indeed, no one will want that :)  Async is fine, and it's what we've
always expected.  (Which is why it's important to write new keys to
keytabs before writing them to the KDB.)

Or if you meant async as in kadmind not waiting for a slave to fetch
its updates before moving on to the next slave (or next update),
that's also fine, and very much desirable: no slave should be able to
stop replication (unless it's part of the replication hierarchy, in
which case it can stop replication downstream of itself).

> In other words, what I provided really is an immense improvement over the
> prior replication (latency is generally reduced), but there is still an edge
> case under load where it might fall back to prior behavior. I might change
> my implementation slightly so that instead of dropping the client after the
> first notification to implement a countdown for number of notifications
> before the client is dropped from the list. This will likely reduce the
> likelihood of the bug. I do not think slave notification should be
> synchronous.

Dropping slaves after some number of updates without hearing from them
sounds good to me.

Nico
--

_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: krb5-1.12 - New iprop features - tree propagation & slave notify

Nico Williams
On Wed, Jan 15, 2014 at 6:19 AM, Basch, Richard <[hidden email]> wrote:
> I just figured I would post an update since I managed to see slaves lagging in updates under high load (and wait the iprop interval). Currently, I post the initial notification, but it appears high-change environments still need more.
>
> My ideal would be to leave the notification async but if the notification fails, implement a parametized retry count before dropping the client.

It's only a ping, right?  So keep track of whether a slave was heard
from since pinging it and ping those not heard from again up to N
times more.  (Also, are you using RPC functions for the ping?  Are you
using the async interfaces?)

If there's a max latency tolerance you have to meet this isn't going
to do it.  You'll need something with closer to synchronous semantics
(e.g., go check all the slaves when you need to know if an update has
replicated) or actual sync semantics (the kadm5 operation does not
complete till all the slaves are updated).

> Personally, I am probably going to increase my ulog size since the 2500 restriction is gone (the kdc.conf man page was not updated to reflect such).

Of course :)  Make sure you're running 64-bit libkadm5srv consumers *only*.

Also, IMO the ulog design needs to be replaced.  It cannot handle very
large entries, for example.

Nico
--
_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

RE: krb5-1.12 - New iprop features - tree propagation & slave notify

Basch, Richard
Actually, in thinking about it, given that I am doing a fork/exec of kprop to signal the slave using the kprop dump, but sending an error code, the logic should be in kprop to retry based on error conditions, then return.  That way, kadmind/kpropd remain wholly async in their notification.

The RPC only works between kpropd to a local kadmind (in proponly mode) and from kpropd to remote kadmind; nothing else actually implements RPC (which made the implementation "trickier"). As a refresher, kadmind on the master knows about its slaves via who contacts them and when an UPDATE_OK/UPDATE_NIL is provided. Upon the next update kadmind processes, it spawns off a kprop to notify each of the slaves via the kprop (full dump) protocol, but sending an error code just before sending the database (already supported by older versions of kpropd). Upon receipt of this error, kpropd will send a USR1 to the iprop portion of kpropd (the USR1 signal is already supported), to notify the iprop portion of kpropd to expire the timer and poll immediately. In tree-mode, a  kadmind exists on intermediate hosts (only supporting the iprop RPC), and kpropd also attempts to signal the local kadmind (iprop interface) of any changes it has processed locally by sending it a NULL RPC. This last !
 piece therefore allows a full tree replication in near real-time, or at least so is the theory (the high load condition is the only thing which I am trying to optimize).

As soon as the fork to spawn kprop succeeds, the client is forgotten (if kprop fails for any reason other than the fork failing), it means the slave will not be notified and will receive no more notifications (but the periodic iprop poll will eventually kick in to resync the slave).


-----Original Message-----
From: Nico Williams [mailto:[hidden email]]
Sent: Wednesday, January 15, 2014 2:55 PM
To: Basch, Richard [Tech]
Cc: [hidden email]; [hidden email]; [hidden email]; [hidden email]
Subject: Re: krb5-1.12 - New iprop features - tree propagation & slave notify

On Wed, Jan 15, 2014 at 6:19 AM, Basch, Richard <[hidden email]> wrote:
> I just figured I would post an update since I managed to see slaves lagging in updates under high load (and wait the iprop interval). Currently, I post the initial notification, but it appears high-change environments still need more.
>
> My ideal would be to leave the notification async but if the notification fails, implement a parametized retry count before dropping the client.

It's only a ping, right?  So keep track of whether a slave was heard from since pinging it and ping those not heard from again up to N times more.  (Also, are you using RPC functions for the ping?  Are you using the async interfaces?)

If there's a max latency tolerance you have to meet this isn't going to do it.  You'll need something with closer to synchronous semantics (e.g., go check all the slaves when you need to know if an update has
replicated) or actual sync semantics (the kadm5 operation does not complete till all the slaves are updated).

> Personally, I am probably going to increase my ulog size since the 2500 restriction is gone (the kdc.conf man page was not updated to reflect such).

Of course :)  Make sure you're running 64-bit libkadm5srv consumers *only*.

Also, IMO the ulog design needs to be replaced.  It cannot handle very large entries, for example.

Nico
--

_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

RE: krb5-1.12 - New iprop features - tree propagation & slave notify

Basch, Richard
In reply to this post by Nico Williams
I think I just figured out the issue... My code does not queue the client if the server is responding with UPDATE_BUSY. Originally, I thought there was no code path which still returned UPDATE_BUSY, but it appears the client gets dropped when this condition is hit under high load.

I will work on a patch revision to account for this condition. This should resolve the situation without any other changes being required. My logs indicate UPDATE_BUSY still is provided (though I have to re-examine the code base to see how, since I didn't immediately see how with 1.11+ after Nico's patches were adopted).

-----Original Message-----
From: Nico Williams [mailto:[hidden email]]
Sent: Wednesday, January 15, 2014 2:55 PM
To: Basch, Richard [Tech]
Cc: [hidden email]; [hidden email]; [hidden email]; [hidden email]
Subject: Re: krb5-1.12 - New iprop features - tree propagation & slave notify

On Wed, Jan 15, 2014 at 6:19 AM, Basch, Richard <[hidden email]> wrote:
> I just figured I would post an update since I managed to see slaves lagging in updates under high load (and wait the iprop interval). Currently, I post the initial notification, but it appears high-change environments still need more.
>
> My ideal would be to leave the notification async but if the notification fails, implement a parametized retry count before dropping the client.

It's only a ping, right?  So keep track of whether a slave was heard from since pinging it and ping those not heard from again up to N times more.  (Also, are you using RPC functions for the ping?  Are you using the async interfaces?)

If there's a max latency tolerance you have to meet this isn't going to do it.  You'll need something with closer to synchronous semantics (e.g., go check all the slaves when you need to know if an update has
replicated) or actual sync semantics (the kadm5 operation does not complete till all the slaves are updated).

> Personally, I am probably going to increase my ulog size since the 2500 restriction is gone (the kdc.conf man page was not updated to reflect such).

Of course :)  Make sure you're running 64-bit libkadm5srv consumers *only*.

Also, IMO the ulog design needs to be replaced.  It cannot handle very large entries, for example.

Nico
--

_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev