iprop: Problem forcing complete database sync

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

iprop: Problem forcing complete database sync

Patrik Lundin-2
Hello,

I am currently testing a 7.4.0 based KDC master with a slave running the
7.1.0 version available in Debian Stretch.

Currently I am experiencing problems with getting iprop to force a
complete sync of the database. As is stated in iprop-log(8) truncating
the log with --reset should cause full propagations.

I belive the iprop code does not properly handle the case where the log
has been reset to version 0 on the master:
===
kdc-lab-master1:~# iprop-log truncate --reset
kdc-lab-master1:~# iprop-log dump --no-lock
nop: ver = 0, timestamp = 2017-10-04 12:51:32, len = 16
uberblock offset 40 timestamp 2017-10-04 12:51:32 version 0
===

At this point depending on the state of the slave, the propogation will
fail in one of two ways depending on if the slave has a log with changes
or not.

If the slave log has a version number greater than 0 the master will
notice it is out of sync and attempt to update it only to fail in
send_diffs() because kadm5_log_get_version_fd() returns
HEIM_ERR_EOF (and this is documented behaviour for that function when
the log is truncated and LOG_VERSION_FIRST is supplied).

The code in ipropd_master.c where we error out because 'ret' is set to
HEIM_ERR_EOF:
===
722     ret = kadm5_log_get_version_fd(server_context, log_fd, LOG_VERSION_FIRST,
723                                    &initial_version, &initial_tstamp);
724     sp = kadm5_log_goto_end(server_context, log_fd);
725     flock(log_fd, LOCK_UN);
726     if (ret) {
727         if (sp != NULL)
728             krb5_storage_free(sp);
729         krb5_warn(context, ret, "send_diffs: failed to read log");
730         send_are_you_there(context, s);
731         return ret;
732     }
===

Resulting in these logs on the master:
===
ipropd-master[3205]: connection from iprop/[hidden email]
ipropd-master[3205]: Slave iprop/[hidden email] (version 22) have later version the master (version 0) OUT OF SYNC
ipropd-master[3205]: send_diffs: failed to read log: End of file
===

... and these logs on the slave:
===
ipropd-slave[13608]: slave status change: connecting to master: kdc-lab-master1.example.com
ipropd-slave[13608]: connection successful to master: kdc-lab-master1.example.com[192.168.1.100]
ipropd-slave[13608]: ipropd-slave started at version: 22
ipropd-slave[13608]: slave status change: connected to master, waiting instructions
ipropd-slave[13608]: slave status change: up-to-date with version: 22 at 2017-10-04T12:59:38
ipropd-slave[13608]: krb5_write_message: write: Broken pipe
ipropd-slave[13608]: krb5_write_priv_message: write: Broken pipe
===

Attempting to reset the slave completely by removing the heimdal.db, the
log, and the ipropd-slave-status file (only retaining the m-key) the
slave will get stuck in a non-functional state where no database is
fetched:
===
root@kdc-lab-slave1:~# /etc/init.d/heimdal-kdc stop
Stopping heimdal-kdc (via systemctl): heimdal-kdc.service.
root@kdc-lab-slave1:~# ls /var/lib/heimdal-kdc/
heimdal.db  ipropd-slave-status  log  m-key
root@kdc-lab-slave1:~# rm /var/lib/heimdal-kdc/heimdal.db /var/lib/heimdal-kdc/ipropd-slave-status /var/lib/heimdal-kdc/log
root@kdc-lab-slave1:~# /etc/init.d/heimdal-kdc start
root@kdc-lab-slave1:~# ls /var/lib/heimdal-kdc/
ipropd-slave-status  log  m-key
===

Master logs:
===
ipropd-master[3205]: connection from iprop/[hidden email]
ipropd-master[3205]: slave iprop/[hidden email] in sync already at version 0
===

Slave logs:
===
ipropd-slave[13701]: slave status change: getting credentials from keytab/database
ipropd-slave[13701]: slave status change: creating log file
ipropd-slave[13701]: slave status change: ipropd-slave started
ipropd-slave[13705]: slave status change: connecting to master: kdc-lab-master1.example.com
ipropd-slave[13705]: connection successful to master: kdc-lab-master1.example.com[192.168.1.100]
ipropd-slave[13705]: ipropd-slave started at version: 0
ipropd-slave[13705]: slave status change: connected to master, waiting instructions
ipropd-slave[13705]: slave status change: up-to-date with version: 0 at 2017-10-04T13:10:00
===

Trying to modify the master database at this point will lead to a very
confused slave, where it appears to follow the updates, but only ends up
storing the delta:
===
root@kdc-lab-master1:~# kadmin -l
kadmin> ank --use-defaults testingprincipal
[hidden email]'s Password:
Verify password - [hidden email]'s Password:
kadmin>
===

Master logs:
===
ipropd-master[3205]: Got a signal, updating slaves 0 to 1
ipropd-master[3205]: syncing slave iprop/[hidden email] from version 0 to version 1
ipropd-master[3205]: slave iprop/[hidden email] is now up to date (1)
ipropd-master[3205]: slave iprop/[hidden email] in sync already at version 1
===

Slave logs:
===
ipropd-slave[13835]: connection successful to master: kdc-lab-master1.labkrb.it.example.com[192.168.1.100]
ipropd-slave[13835]: ipropd-slave started at version: 0
ipropd-slave[13835]: slave status change: connected to master, waiting instructions
ipropd-slave[13835]: slave status change: up-to-date with version: 0 at 2017-10-04T14:05:37
ipropd-slave[13835]: slave status change: up-to-date with version: 1 at 2017-10-04T14:06:19
ipropd-slave[13835]: slave status change: up-to-date with version: 1 at 2017-10-04T14:06:19
ipropd-slave[13835]: slave status change: up-to-date with version: 1 at 2017-10-04T14:06:49
===

... and the heimdal.db that appeared on the slave after this operation
is empty except for the specific principal created by the modification:
===
root@kdc-lab-slave1:~# kadmin -l
kadmin> list *
testingprincipal
kadmin>
===

The only way I have found to "fix" the slave at this point is to stop
the services on it, manually increment the log version past the master
version and then start the services again. Since the slave log version
is now greater than the master, and the master has a log version greater
than 0, the full download will successfully finish.

The master is at version 1:
===
root@kdc-lab-master1:~# iprop-log last-version --no-lock
version: 1
===

Stop the slave and bump the version to 2:
===
root@kdc-lab-slave1:~# /etc/init.d/heimdal-kdc stop
Stopping heimdal-kdc (via systemctl): heimdal-kdc.service.
root@kdc-lab-slave1:~# iprop-log last-version --no-lock
version: 1
root@kdc-lab-slave1:~# kadmin -l cpw -r testingprincipal
root@kdc-lab-slave1:~# iprop-log last-version --no-lock
version: 2
root@kdc-lab-slave1:~# /etc/init.d/heimdal-kdc start
Starting heimdal-kdc (via systemctl): heimdal-kdc.service.
===

Master logs:
===
ipropd-master[3205]: connection from iprop/[hidden email]
ipropd-master[3205]: Slave iprop/[hidden email] (version 2) have later version the master (version 1) OUT OF SYNC
ipropd-master[3205]: slave iprop/[hidden email] (version 2) out of sync with master (first version in log 0), sending complete database
ipropd-master[3205]: wrote new dumpfile (version 1)
ipropd-master[3205]: slave iprop/[hidden email] in sync already at version 1
===

Slave logs:
===
ipropd-slave[13938]: connection successful to master: kdc-lab-master1.example.com[192.168.1.100]
ipropd-slave[13938]: ipropd-slave started at version: 2
ipropd-slave[13938]: slave status change: connected to master, waiting instructions
ipropd-slave[13938]: receive complete database
ipropd-slave[13938]: slave status change: up-to-date with version: 1 at 2017-10-04T14:32:31
ipropd-slave[13938]: slave status change: up-to-date with version: 1 at 2017-10-04T14:32:31
===

I am aware that iprop has gone through quite a lot of changes a while
back so I wonder if this might be a regression introduced then. Since
not even a fresh 'init' of the database will cause a log version number
of "0" I guess this will only be noticed once you attempt to do manual
resets.

Regards,
Patrik Lundin
Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Harald Barth-2

By looking at the code of send_diffs (ipropd_master.c:666) it looks to
me that there somewhere at the beginning of that long function, before
plowing through the log on the master, there should be something like

if (current_version == 0 || s->version == 0){

   (...)

   return send_complete (...);
}

to catch these special cases. What else should the master do
than to send everyting in these cases? I don't see any
other code which clearly handles when current_version == 0
is encountered or if the slave reported that it was at version 0
(tmp == 0 in process_msg) which then sets s->version = 0.

In addition to that, the readability of the source code in process_msg()
could improve tremendously with better variable names and fixed grammar
or better description what happended in the warning messages.

Harald.
Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Patrik Lundin-2
On 2017-10-05 11:52:59, Harald Barth wrote:

>
> By looking at the code of send_diffs (ipropd_master.c:666) it looks to
> me that there somewhere at the beginning of that long function, before
> plowing through the log on the master, there should be something like
>
> if (current_version == 0 || s->version == 0){
>
>    (...)
>
>    return send_complete (...);
> }
>
> to catch these special cases. What else should the master do
> than to send everyting in these cases?I don't see any
> other code which clearly handles when current_version == 0
> is encountered or if the slave reported that it was at version 0
> (tmp == 0 in process_msg) which then sets s->version = 0.
>

There used to be code early in send_diffs() which called send_complete()
if the slave was at version 0, but it was removed in commit
539ba5fb8751d7c8741c89e99645775f131ba4fe. Checking out heimdal-1-5-branch will
show the following snippet:
===
475     /* if slave is a fresh client, starting over */
476     if (s->version == 0) {
477         krb5_warnx(context, "sending complete log to fresh slave %s",
478                    s->name);
479         return send_complete (context, s, database, current_version);
480     }
===

From what I can tell iprop has no way to differentiate between "I have no
database, therefore I consider myself at version 0" and "I have a database, in
sync with version 0".

Unless I am missing something this would mean that if something like the code
above was added back, truncating the log to version 0 would make all slaves
fetch the complete database over and over again until a modification bumps the
version at the master.

Regards,
Patrik Lundin
Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Harald Barth-2

> From what I can tell iprop has no way to differentiate between "I have no
> database, therefore I consider myself at version 0" and "I have a database, in
> sync with version 0".

I think (and hope) that a database is never at version 0 as it should
be created at version 1. So after the send_complete() which was
triggered by (current_version == 0 || s->version == 0) the
current_version will be 1 or greater and I don't think looping will
happen.

> There used to be code early in send_diffs() which called send_complete()
> if the slave was at version 0, but it was removed in commit
> 539ba5fb8751d7c8741c89e99645775f131ba4fe. Checking out heimdal-1-5-branch will
> show the following snippet:
> ===
> 475     /* if slave is a fresh client, starting over */
> 476     if (s->version == 0) {
> 477         krb5_warnx(context, "sending complete log to fresh slave %s",
> 478                    s->name);
> 479         return send_complete (context, s, database, current_version);
> 480     }
> ===

Removing that without putting something else in place (and I would say
not only s->version == 0 but even current_version == 0 should trigger
a send_complete) was probably a bad idea.

I would try to add something like this at ipropd_master.c:685:

if (s->version == 0 && current_version > 0) {
   krb5_warnx(context, "sending coplete database to fresh slave %s", s->name);
   write_dump(context, s, database, current_version);
   return send_complete (context, s, database, current_version, 0);
}
if (current_version == 0) {
   /* What to do here ? */
}

problem with this snipplet is that it only works if current_version
has not been reset to 0. So how can we recover the real current version
of the database so that we can do something like:

if (current_version == 0) {
   current_version = recover_current_version(database);
   write_dump(context, s, database, current_version);
   return send_complete (context, s, database, current_version, 0);
}


> From what I can tell iprop has no way to differentiate between "I have no
> database, therefore I consider myself at version 0" and "I have a database, in
> sync with version 0".

I think the log can be at version 0 (which means "invalid, please
recover") but the database should not be able to be at version 0.

> Unless I am missing something this would mean that if something like the code
> above was added back, truncating the log to version 0 would make all slaves
> fetch the complete database over and over again until a modification bumps the
> version at the master.

That would be unfortunate ;-)

Harald.
Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Jeffrey Hutzelman
On October 6, 2017 5:56:22 AM EDT, Harald Barth <[hidden email]> wrote:
>
>Removing that without putting something else in place (and I would say
>not only s->version == 0 but even current_version == 0 should trigger
>a send_complete) was probably a bad idea.

It shouldn't be needed. First, a full dump is automatically triggered if the master doesn't have enough log entries to bring the slave from its current version to the master version. so if a slave us at version 0, you either get a complete log replay out a full dump, either of which is sufficient. In practice, it should always be a full dump, because the log should never contain an entry for version 1.

It's also not necessary and in fact potentially harmful to check the master's version for 0.  A real database can never be at version 0, so if you see that, something is wrong, and copying that database wholesale to another server is probably a bad idea.

Remember that truncation does not zero the version. It throws away the existing log entries and writes a new log containing a single transaction with a newer version than the existing log. This forces a full update for any slave that was not already up to date.

>I think the log can be at version 0 (which means "invalid, please
>recover") but the database should not be able to be at version 0.

only an empty, never-initialized log can be at version 0.

databases don't have versions; they're entirely an artifact of the log

>> Unless I am missing something this would mean that if something like
>the code
>> above was added back, truncating the log to version 0 would make all
>slaves
>> fetch the complete database over and over again until a modification
>bumps the
>> version at the master.
>
>That would be unfortunate ;-)

it would, but truncating the log doesn't set the version to 0.

>Harald.


Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Patrik Lundin-2
On 2017-10-06 09:29:33, Jeffrey Hutzelman wrote:
>
> It shouldn't be needed. First, a full dump is automatically triggered
> if the master doesn't have enough log entries to bring the slave from
> its current version to the master version. so if a slave us at version
> 0, you either get a complete log replay out a full dump, either of
> which is sufficient. In practice, it should always be a full dump,
> because the log should never contain an entry for version 1.
>

This is not in line with the current state of the iprop-log manual (and
behaviour) on 7.4.0:
===
truncate
[...]
    If --reset is given, then the given, configured, or default log file will be truncated and will start at version 1.  This forces full
    propagations to all slave KDCs.

    Otherwise the log will be truncated but some entries will be preserved, as specified by the --keep-entries and/or --max-bytes options.
    The largest number of --keep-entries entries that are available and fit in the given --max-bytes option will be used.  The --keep-entries
    -option -defaults -to -100, -and -the --max-bytes option defaults to the log-max-size parameter in the configuration.
===

Based on that information it seems like expected behaviour to be able to
reset the log in order to "start over". I am not clear on if "start at
version 1" means "the log should be at version 1 directly after a
--reset truncation", or if it means "the first entry added after reset
will use version 1".

As it stands in 7.4.0, using "--reset" will make the log start at
version 0:
===
# iprop-log last-version --no-lock
version: 22
# iprop-log truncate --reset
# iprop-log last-version --no-lock
version: 0
# iprop-log dump --no-lock
nop: ver = 0, timestamp = 2017-10-06 16:17:42, len = 16
uberblock offset 40 timestamp 2017-10-06 16:17:42 version 0
===

>
> It's also not necessary and in fact potentially harmful to check the
> master's version for 0.  A real database can never be at version 0, so
> if you see that, something is wrong, and copying that database
> wholesale to another server is probably a bad idea.
>

Some assumption seems to have been broken if that is the case given the
above displayed behaviour.

> Remember that truncation does not zero the version. It throws away the
> existing log entries and writes a new log containing a single
> transaction with a newer version than the existing log. This forces a
> full update for any slave that was not already up to date.

This is no longer true, a truncate without --reset will keep a
configured amount of entries, so in the case with the newly init'ed
database it will only update the timestamps on the initial uber record
and not touch any of the entries.

> >
> >That would be unfortunate ;-)
>
> it would, but truncating the log doesn't set the version to 0.
>

See above.

Regards,
Patrik Lundin
Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Nico Williams
In reply to this post by Patrik Lundin-2
Trungcating the log is not the same thing as resetting it, and preserves
the version number.
Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Nico Williams
In reply to this post by Jeffrey Hutzelman
On Fri, Oct 06, 2017 at 09:29:33AM -0400, Jeffrey Hutzelman wrote:
> On October 6, 2017 5:56:22 AM EDT, Harald Barth <[hidden email]> wrote:
> It's also not necessary and in fact potentially harmful to check the
> master's version for 0.  A real database can never be at version 0, so
> if you see that, something is wrong, and copying that database
> wholesale to another server is probably a bad idea.

We do think there may be version rollover issues (it's only 32 bits...),
but that's not very realistic for Kerberos.  We feel that
lib/kadm5/log.c is solid now, but the ipropd-master.c and ipropd-slave.c
code still has dark corners and need a thorough code review.

> Remember that truncation does not zero the version. It throws away the
> existing log entries and writes a new log containing a single
> transaction with a newer version than the existing log. This forces a
> full update for any slave that was not already up to date.

Correct.

> >I think the log can be at version 0 (which means "invalid, please
> >recover") but the database should not be able to be at version 0.
>
> only an empty, never-initialized log can be at version 0.
>
> databases don't have versions; they're entirely an artifact of the log

Correct as well.

Having the HDB and the iprop layers be so disjoint has pros and cons.
The biggest pro is that iprop can work for any HDB backend that doesn't
have its own replication system.  The biggest con is that the
disjointedness can mean that if the two get out of sync, you have a
problem  (e.g., save the iprop log, turn off ipropd-master, do a bunch
of transactions, restore the iprop log, restart ipropd-master: the
slaves will miss those transactions).

Nico
--
Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Nico Williams
In reply to this post by Patrik Lundin-2
On Fri, Oct 06, 2017 at 04:48:20PM +0200, Patrik Lundin wrote:

> On 2017-10-06 09:29:33, Jeffrey Hutzelman wrote:
> > It shouldn't be needed. First, a full dump is automatically triggered
> > if the master doesn't have enough log entries to bring the slave from
> > its current version to the master version. so if a slave us at version
> > 0, you either get a complete log replay out a full dump, either of
> > which is sufficient. In practice, it should always be a full dump,
> > because the log should never contain an entry for version 1.
>
> This is not in line with the current state of the iprop-log manual (and
> behaviour) on 7.4.0:
> ===
> truncate
> [...]
>     If --reset is given, then the given, configured, or default log file will be truncated and will start at version 1.  This forces full
>     propagations to all slave KDCs.
>
>     Otherwise the log will be truncated but some entries will be preserved, as specified by the --keep-entries and/or --max-bytes options.
>     The largest number of --keep-entries entries that are available and fit in the given --max-bytes option will be used.  The --keep-entries
>     -option -defaults -to -100, -and -the --max-bytes option defaults to the log-max-size parameter in the configuration.
> ===

So, the log now always has an uber entry, and that entry has the same
header/footer as all the others, so it needs a version number, and it
will get version 0.  But it's a no-op entry, so it has no HDB contents.
What the uber entry has is metadata about what entries in the log have
been written to the HDB.

So, yes, the log will start at version 0, but it won't have any real
content at version 0, and the first real entry will be version 1.

> Based on that information it seems like expected behaviour to be able to
> reset the log in order to "start over". I am not clear on if "start at
> version 1" means "the log should be at version 1 directly after a
> --reset truncation", or if it means "the first entry added after reset
> will use version 1".

Yes.

> As it stands in 7.4.0, using "--reset" will make the log start at
> version 0:
> ===
> # iprop-log last-version --no-lock
> version: 22
> # iprop-log truncate --reset
> # iprop-log last-version --no-lock
> version: 0
> # iprop-log dump --no-lock
> nop: ver = 0, timestamp = 2017-10-06 16:17:42, len = 16
> uberblock offset 40 timestamp 2017-10-06 16:17:42 version 0
> ===

Looks right.  Though I see that tests/kdc/check-iprop.in does not test
that iprop-log truncate --reset forces full props.  We should improve
that.

> > Remember that truncation does not zero the version. It throws away the
> > existing log entries and writes a new log containing a single
> > transaction with a newer version than the existing log. This forces a
> > full update for any slave that was not already up to date.
>
> This is no longer true, a truncate without --reset will keep a
> configured amount of entries, so in the case with the newly init'ed
> database it will only update the timestamps on the initial uber record
> and not touch any of the entries.

Correct.

Nico
--
Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Nico Williams
We're going to revisit the when-to-send_complete() logic.  We've had one
case where a slave was stuck doing repeated send_completes.

It's also problematic that send_complete() is synchronous and
ipropd-master does not fork() and is not multi-threaded.  We've seen a
case where one send_complete caused other slaves to timeout and
reconnect and missed enough transactions that they then had to... get a
full prop, which then led to a cycle of slaves doing nothing but full
props!

The thing we're likely to do to fix the latter is fork() to
send_complete(), or else fork() per-slave (which will complicate the
writing of the stats file).

Nico
--
Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Henry B Hotz
On thing that’s conspicuously missing from this discussion is any historical context for how the version numbers are *supposed* to be handled. It seems like most of these problems are recent, or at least recent-ish.

IIUC the deal is (should be? used to be? Please correct!):

1) On initial creation, the log contains a version 0 no-op, making the db version 1.

2) On connection, the slave tells the master what version it has. If it doesn’t match what the master has then the master sends updates to bring them in sync.
2a) If the master’s change log is insufficient, (or the difference is “too big), then it sends the whole DB.
2b) If the difference is small enough, then the master just replays the change log from where the slave is.

3) Seems to me that the handling of the heartbeat messages ought to mirror the initial connection logic, or else make no attempt to do anything to the DB at all. Anything else is clearly risky and unnecessarily complex. (I never worried about them because I had already implemented external processes to deal with the issue. Somebody else should write this bullet.)

A new DB (on a slave) is guaranteed to have a smaller version number than the master (if the master is actually populated), so will always get a complete download.

Truncation, preserving the version number is safe and periodically necessary.

I do not remember the --reset option, but it’s clearly dangerous. How can it be used safely, knowing only the above?

(Where is Love when you need him?)

Personal email.  [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Nico Williams
On Sat, Oct 07, 2017 at 06:53:32PM -0700, Henry B (Hank) Hotz, CISSP wrote:
> On thing that’s conspicuously missing from this discussion is any
> historical context for how the version numbers are *supposed* to be
> handled. It seems like most of these problems are recent, or at least
> recent-ish.

The previous system had... a lot of serious issues.  We mostly rewrote
lib/kadm5/log.c, but mostly also kept the design as it was.  We added
the uberblock to help with atomicity and to help find the end of the log
quickly, and made the iprop log function as a roll-forward log for the
HDB.  We mostly did not rewrite the ipropd daemons though.

> IIUC the deal is (should be? used to be? Please correct!):
>
> 1) On initial creation, the log contains a version 0 no-op, making the
> db version 1.

Pretty much.

> 2) On connection, the slave tells the master what version it has. If
> it doesn’t match what the master has then the master sends updates to
> bring them in sync.

Yes.

> 2a) If the master’s change log is insufficient, (or the difference is
> “too big), then it sends the whole DB.

There is not and never was an "is too big" heuristic.  If the slave was
1e6 entries behind, and th emaster had those 1e6 entries, then those
would be sent.

The new system automatically truncates the log (rewrites it, actually,
preserving N entries) as necessary.  This functions as an "is too big"
heuristic.

> 2b) If the difference is small enough, then the master just replays
> the change log from where the slave is.

Yes.

> 3) Seems to me that the handling of the heartbeat messages ought to
> mirror the initial connection logic, or else make no attempt to do
> anything to the DB at all. Anything else is clearly risky and
> unnecessarily complex. (I never worried about them because I had
> already implemented external processes to deal with the issue.
> Somebody else should write this bullet.)

I don't follow.

> A new DB (on a slave) is guaranteed to have a smaller version number
> than the master (if the master is actually populated), so will always
> get a complete download.
>
> Truncation, preserving the version number is safe and periodically
> necessary.

Yes.

> I do not remember the --reset option, but it’s clearly dangerous. How
> can it be used safely, knowing only the above?

It's no different than removing the log and restarting the master.

We didn't change the iprop _protocol_, but we've considered it.  If we
did modify it, then we'd a) make the version numbers larger, b) use
{vno, timestamp} rather than just {vno} to identify state, then if you
reset the log on the master then the master would be able to
send_complete() to slaves with, say, version 2.

So far we've tried hard to support graceful upgrades.  But we've been
tempted to make more radical changes.

For example, one thing we might do (no promises) is to make the HDB
interface for the kadm5 API (if we don't just burn that API altogether,
though as much as we dislike it, it's actually valueable just because of
the existing codebase using it, such as Russ' Wallet, or Roland
krb5_admin stack) just... a SQL interface.  We'd probably keep libhdb
for the KDC, and have the iprop system write old-style HDBs for the KDC,
but not for the admin interfaces.  We might then throw away the existing
iprop system and replace it with an RDBMS replication system.
PostgreSQL comes to mind, though we could also build a suitable system
out of SQLite3.

Key to all of that would be an implementation plan that makes it easy to
do all of this, otherwise it couldn't happen.  In the Heimdal tradition,
that would probably imply some sort of compiler.  (One thing I've toyed
with is modifying asn1_compile to support generation of code to use "SQL
rules", as it were, to encode to/from an RDBMS.)

But anyways, for now, and for as long as we don't choose to make such
radical changes, graceful upgrades are supported, at least to some
degree: the iprop protocol has not been modified, the iprop log format
has not been modified (the ubeblock is a nop, which already existed).
This has limited us somewhat.  In particular we have problems to deal
with like vno rollover, and how to gracefully deal with master-side
iprop log reset (spoilers: we can't!).

We've still managed to make significant improvements to the iprop
system, and we'll be making more (mostly we'll make ipropd-master fork()
per-slave processes, do a complete review of the ipropd daemons, and fix
the bugs we're aware of so far.

Nico
--
Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Harald Barth-2
In reply to this post by Harald Barth-2

Have I got this right?

1. Does write_dump(context, s, database, current_version) always write
a comptele dump no matter the value of current_version?

2. Does write_dump(context, s, database, current_version) only use
current_version to set the version of the dump?

3. I's named "send_complete", but what does it need the _two_ version
   numbers for?
   send_complete (context, s, database, current_version, oldest_version)
   What does it use "oldest_version" for?


        /*
         * If the current dump has an appropriate version, then we can
         * break out of the loop and send the file below.
         */

        if (ret == 0 && vno != 0 && st.st_mtime > initial_log_tstamp &&
            vno >= oldest_version && vno <= current_version)
            break;

   It would be nice to get a better explanation what this is supposed to
   do. I don't get it from the comment.

Harald.
Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Nico Williams

On Mon, Oct 9, 2017 at 3:19 AM Harald Barth <[hidden email]> wrote:

Have I got this right?

1. Does write_dump(context, s, database, current_version) always write
a comptele dump no matter the value of current_version?

2. Does write_dump(context, s, database, current_version) only use
current_version to set the version of the dump?

3. I's named "send_complete", but what does it need the _two_ version
   numbers for?
   send_complete (context, s, database, current_version, oldest_version)
   What does it use "oldest_version" for?


        /*
         * If the current dump has an appropriate version, then we can
         * break out of the loop and send the file below.
         */

        if (ret == 0 && vno != 0 && st.st_mtime > initial_log_tstamp &&
            vno >= oldest_version && vno <= current_version)
            break;

   It would be nice to get a better explanation what this is supposed to
   do. I don't get it from the comment.


I'll see about updating the commentary.  The oldest version is the oldest I the iprop log, so if the dump is older then it can't help the slave to send it that dump and we must dump again.  This is here to avoid a pathological case where we'll be sending full dumps to a bunch of slaves and we end up re-dumping the HDB once per-slave -- or where we don't re-dump and end up in a send complete loop.

Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Nico Williams
In reply to this post by Harald Barth-2

On Mon, Oct 9, 2017 at 3:19 AM Harald Barth <[hidden email]> wrote:

Have I got this right?

1. Does write_dump(context, s, database, current_version) always write
a comptele dump no matter the value of current_version?

Yes.  BTW, there's a bug here in that we don't remove the dump on failure or rename it into place on success.  This means that a failure in the middle of dumping may leave a valid-looking but incomplete dump that may then be sent to slaves.

2. Does write_dump(context, s, database, current_version) only use
current_version to set the version of the dump?

Yes.

3. I's named "send_complete", but what does it need the _two_ version
   numbers for?
   send_complete (context, s, database, current_version, oldest_version)
   What does it use "oldest_version" for?

Answered separately.

Nico
-- 

Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Nico Williams

On Mon, Oct 9, 2017 at 3:40 PM Nico Williams <[hidden email]> wrote:
On Mon, Oct 9, 2017 at 3:19 AM Harald Barth <[hidden email]> wrote:

Have I got this right?

1. Does write_dump(context, s, database, current_version) always write
a comptele dump no matter the value of current_version?

Yes.  BTW, there's a bug here in that we don't remove the dump on failure or rename it into place on success.  This means that a failure in the middle of dumping may leave a valid-looking but incomplete dump that may then be sent to slaves.

Actually, i tell a lie.  This is only a problem on version zero.
Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Patrik Lundin-2
In reply to this post by Nico Williams
On 2017-10-06 10:03:37, Nico Williams wrote:
> Trungcating the log is not the same thing as resetting it, and preserves
> the version number.

To summarize this tread: The --reset flag should currently not be used
in a production systems since ipropd-master is unable to parse the
resulting log file.

In general the concensus seems to be that version numbers never
decrement during normal operations. If you need to trigger a full dump you
should do normal truncations on the master preserving few enough entries
that slaves are unable to catch up and probably --reset the log on
slaves so they do not think they can get away with just fetching the
delta.

Since loading a database dump using "kadmin -l load <dumpfile>" does not
increment the iprop log version, this means that if you are migrating to
a new master which has no log you will need to do something like this:

#1. Stop the current master.
#2. Dump the database on the current master.
#3. Load the database on the new master.
#4. Do some random modification of the database on the new master via
    kadmin -l in order to set the log version to at least 2.
#5. Make sure the log history does not look complete since this
    means a fresh slave (at version 0) will just try to get the delta:
    iprop-log truncate --keep-entries=1
#6. Start processes on the new master.
#7. Point slaves to the new master. They will now either have a log
    version greater than the master, and hence be considered out of sync, or they
    will be at version 0, unable to reach version 2 via log entries
    since version 1 was removed from the log in step #5.

Regards,
Patrik Lundin
Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Nico Williams
On Thu, Oct 12, 2017 at 05:38:32PM +0200, Patrik Lundin wrote:
> On 2017-10-06 10:03:37, Nico Williams wrote:
> > Trungcating the log is not the same thing as resetting it, and preserves
> > the version number.
>
> To summarize this tread: The --reset flag should currently not be used
> in a production systems since ipropd-master is unable to parse the
> resulting log file.

No, the master is perfectly able to to parse the new log.  The issue is
that it's not enough to ensure that slaves get full props.

A fix for that and other things is in the works.  For now just don't use
--reset.

> In general the concensus seems to be that version numbers never
> decrement during normal operations. If you need to trigger a full dump you
> should do normal truncations on the master preserving few enough entries
> that slaves are unable to catch up and probably --reset the log on
> slaves so they do not think they can get away with just fetching the
> delta.

Yes.

Maybe we should add a record type (probably a variant of nop) that
indicates the desire to force a full prop at that point in time.

> Since loading a database dump using "kadmin -l load <dumpfile>" does not
> increment the iprop log version, this means that if you are migrating to
> a new master which has no log you will need to do something like this:

kadmin load *does* truncate and reset the log, but it leaves it at
version 0, which is what i think you meant.

> #1. Stop the current master.
> #2. Dump the database on the current master.
> #3. Load the database on the new master.
> #4. Do some random modification of the database on the new master via
>     kadmin -l in order to set the log version to at least 2.

Just version 1 will do, but yes.

> #5. Make sure the log history does not look complete since this
>     means a fresh slave (at version 0) will just try to get the delta:
>     iprop-log truncate --keep-entries=1

You could stop the ipropd-slave processes, remove their log, and restart
them.

> #6. Start processes on the new master.
> #7. Point slaves to the new master. They will now either have a log
>     version greater than the master, and hence be considered out of sync, or they
>     will be at version 0, unable to reach version 2 via log entries
>     since version 1 was removed from the log in step #5.

Yes.

Nico
--
Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Patrik Lundin-2
On 2017-10-12 11:38:57, Nico Williams wrote:
> On Thu, Oct 12, 2017 at 05:38:32PM +0200, Patrik Lundin wrote:
> >
> > To summarize this tread: The --reset flag should currently not be used
> > in a production systems since ipropd-master is unable to parse the
> > resulting log file.
>
> No, the master is perfectly able to to parse the new log.  The issue is
> that it's not enough to ensure that slaves get full props.
>

A 7.4.0 ipropd-master chokes on the log when it is truncated with
--reset as I showed in my initial message to the list:
http://www.h5l.org/pipermail/heimdal-discuss/2017-October/000277.html:
===
If the slave log has a version number greater than 0 the master will
notice it is out of sync and attempt to update it only to fail in
send_diffs() because kadm5_log_get_version_fd() returns
HEIM_ERR_EOF (and this is documented behaviour for that function when
the log is truncated and LOG_VERSION_FIRST is supplied).
===

> > Since loading a database dump using "kadmin -l load <dumpfile>" does not
> > increment the iprop log version, this means that if you are migrating to
> > a new master which has no log you will need to do something like this:
>
> kadmin load *does* truncate and reset the log, but it leaves it at
> version 0, which is what i think you meant.
>

Ah, then it should cause the ipropd-master process to fail to send any
logs afterwards based on the above mentioned failure mode.

> > #1. Stop the current master.
> > #2. Dump the database on the current master.
> > #3. Load the database on the new master.
> > #4. Do some random modification of the database on the new master via
> >     kadmin -l in order to set the log version to at least 2.
>
> Just version 1 will do, but yes.
>

No, if you leave it at version 1 the slave will "get in sync with
version 1 by means of replaying log entries" which leaves it with an
empty database:
===
root@kdc-lab-master1:~# iprop-log last-version --no-lock
version: 1
===

===
root@kdc-lab-slave1:~# ls /var/lib/heimdal-kdc/
m-key

root@kdc-lab-slave1:~# /etc/init.d/heimdal-kdc start
Starting heimdal-kdc (via systemctl): heimdal-kdc.service.
===

The slave logs:
===
ipropd-slave[28181]: slave status change: getting credentials from keytab/database
ipropd-slave[28181]: slave status change: creating log file
ipropd-slave[28181]: slave status change: ipropd-slave started
ipropd-slave[28185]: slave status change: connecting to master: kdc-lab-master1.example.com
ipropd-slave[28185]: connection successful to master: kdc-lab-master1.labkrb.it.su.se[192.168.1.100]
ipropd-slave[28185]: ipropd-slave started at version: 0
ipropd-slave[28185]: slave status change: connected to master, waiting instructions
ipropd-slave[28185]: slave status change: up-to-date with version: 1 at 2017-10-16T16:18:44
ipropd-slave[28185]: slave status change: up-to-date with version: 1 at 2017-10-16T16:18:44
===

===
root@kdc-lab-slave1:~# ls /var/lib/heimdal-kdc/
heimdal.db  ipropd-slave-status  log  m-key

root@kdc-lab-slave1:~# kadmin -l list '*'
root@kdc-lab-slave1:~#
===

Only by going to version 2 and severing the log delta between the master
and the fresh slave version (0) will a complete dump be performed.

--
Patrik Lundin
Reply | Threaded
Open this post in threaded view
|

Re: iprop: Problem forcing complete database sync

Nico Williams
On Mon, Oct 16, 2017 at 04:27:35PM +0200, Patrik Lundin wrote:

> On 2017-10-12 11:38:57, Nico Williams wrote:
> > On Thu, Oct 12, 2017 at 05:38:32PM +0200, Patrik Lundin wrote:
> > >
> > > To summarize this tread: The --reset flag should currently not be used
> > > in a production systems since ipropd-master is unable to parse the
> > > resulting log file.
> >
> > No, the master is perfectly able to to parse the new log.  The issue is
> > that it's not enough to ensure that slaves get full props.
> >
>
> A 7.4.0 ipropd-master chokes on the log when it is truncated with
> --reset as I showed in my initial message to the list:
> http://www.h5l.org/pipermail/heimdal-discuss/2017-October/000277.html:
> ===
> If the slave log has a version number greater than 0 the master will
> notice it is out of sync and attempt to update it only to fail in
> send_diffs() because kadm5_log_get_version_fd() returns
> HEIM_ERR_EOF (and this is documented behaviour for that function when
> the log is truncated and LOG_VERSION_FIRST is supplied).
> ===

Ah, OK, I'll fix this too.  Thanks.

> > > #1. Stop the current master.
> > > #2. Dump the database on the current master.
> > > #3. Load the database on the new master.
> > > #4. Do some random modification of the database on the new master via
> > >     kadmin -l in order to set the log version to at least 2.
> >
> > Just version 1 will do, but yes.
> >
>
> No, if you leave it at version 1 the slave will "get in sync with
> version 1 by means of replaying log entries" which leaves it with an
> empty database:

Oh, nice failure mode.  But version 1 should suffice, so I'll make it
so.

Thanks!

Nico
--