Questions about restoring master kdc from a backup....

classic Classic list List threaded Threaded
30 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Questions about restoring master kdc from a backup....

Renata Maria Dart

Hi, we believe we have a corrupt entry in our heimdal.db and
we plan to roll back to an earlier known good copy.  We think
it is corrupt because although kinit and kadmin commands
continue to work,

1. the dump command never completes and the dump file grows forever,
2. kadmin commands using a wildcard '*' never come back
3. iprop is no longer able to copy over to the slaves
4. and entries in the console logs for the master and slave indicate a problem

I was hoping someone could advise on the steps.  We think all that is
needed is:

1.  stop kdc on master
2.  replace the heimdal.db with the good version
3.  start up kdc on master

But I am wondering about the slaves and whether or not iprop will be
able to work properly.  We have manually copied over the heimdal.db
from the master to some slaves that are in a reserved part of our
network and whose users needed to see the updates that have been made
to the master heimdal.db but weren't being propagated by iprop.

1.  Should I kill iprop on the slaves before making the master kdc
switch?
2.  Do I need to remove the /var/heimdal/log file on the slaves
before starting iprop back up?  And should I remove that file on the master?
3.  Should I remove everything in the /var/heimdal directory on
the slaves (except the key)?
4.  Should I do 3 and then run hprop instead?

Thanks for any assistance,

Renata

Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Paul Robert Marino
If it is a corrupt database than yes you. Will have to revert it on them all‎.
So yes stop iprop on all of the slaves, then fix the master, then bring each slave offline to bring it back to the same state. Then bring it back online and restart iprop.

As far as the restore each transaction gets an I'd number I'm not sure where that I'd number comes from so I would restore the whole directory.

But before you do that save your self some pain ‎check if the problem only exists on the master or the slaves as well. If the issue is only on the master than you can simply stop one of the slaves rsync the directory back to the master then start every thing back up again, because a slave can be promoted to master just by stopping the iprop slave process, starting the master process, then poining the other slaves to it. In a previous job I use to have keepalived do this for me automatically incase the master went down, I just tied kadmin and kpasswd the the VIP and let DNS provide the 3 static addresses to my KDC's for the authentication ports.

  Original Message  
From: Renata Maria Dart
Sent: Friday, April 8, 2016 18:45
To: [hidden email]
Reply To: [hidden email]
Cc: [hidden email]
Subject: Questions about restoring master kdc from a backup....


Hi, we believe we have a corrupt entry in our heimdal.db and
we plan to roll back to an earlier known good copy. We think
it is corrupt because although kinit and kadmin commands
continue to work,

1. the dump command never completes and the dump file grows forever,
2. kadmin commands using a wildcard '*' never come back
3. iprop is no longer able to copy over to the slaves
4. and entries in the console logs for the master and slave indicate a problem

I was hoping someone could advise on the steps. We think all that is
needed is:

1. stop kdc on master
2. replace the heimdal.db with the good version
3. start up kdc on master

But I am wondering about the slaves and whether or not iprop will be
able to work properly. We have manually copied over the heimdal.db
from the master to some slaves that are in a reserved part of our
network and whose users needed to see the updates that have been made
to the master heimdal.db but weren't being propagated by iprop.

1. Should I kill iprop on the slaves before making the master kdc
switch?
2. Do I need to remove the /var/heimdal/log file on the slaves
before starting iprop back up? And should I remove that file on the master?
3. Should I remove everything in the /var/heimdal directory on
the slaves (except the key)?
4. Should I do 3 and then run hprop instead?

Thanks for any assistance,

Renata

Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Russ Allbery-2
In reply to this post by Renata Maria Dart
Renata Maria Dart <[hidden email]> writes:

> Hi, we believe we have a corrupt entry in our heimdal.db and we plan to
> roll back to an earlier known good copy.  We think it is corrupt because
> although kinit and kadmin commands continue to work,

> 1. the dump command never completes and the dump file grows forever,
> 2. kadmin commands using a wildcard '*' never come back
> 3. iprop is no longer able to copy over to the slaves
> 4. and entries in the console logs for the master and slave indicate a
> problem

Hi Renata!

We had exactly the same problem with the stanford.edu production realm in
the past.  I'm fairly sure that it was caused by a lot of kadmin.local
changes in very close succession, which whould point to some sort of
locking or race bug, but I was never able to track it down.  (Not that I
tried very hard.)

The underlying problem is that your Berkeley DB file has a loop.  (There's
a next entry pointer that points back to a previous part in the same
chain.)  This is pretty obvious if you watch kadmin list * with strace.

> I was hoping someone could advise on the steps.  We think all that is
> needed is:

> 1.  stop kdc on master
> 2.  replace the heimdal.db with the good version
> 3.  start up kdc on master

Correct, that's what we did.

> But I am wondering about the slaves and whether or not iprop will be
> able to work properly.  We have manually copied over the heimdal.db from
> the master to some slaves that are in a reserved part of our network and
> whose users needed to see the updates that have been made to the master
> heimdal.db but weren't being propagated by iprop.

> 1.  Should I kill iprop on the slaves before making the master kdc
> switch?

Yes.

> 2.  Do I need to remove the /var/heimdal/log file on the slaves before
> starting iprop back up?  And should I remove that file on the master?

So, my recommendation would be to purge the iprop log on the master with
iprop-log truncate at about the same time that you replace heimdal.db.

My recollection is that if you then just bring the slaves back up without
doing anything at all with them, iprop is smart enough to figure out that
the master is reporting a newer version but has no incremental log and
will request a full replication, and all the right things will happen.

> 3.  Should I remove everything in the /var/heimdal directory on
> the slaves (except the key)?

You should not have to do this.

> 4.  Should I do 3 and then run hprop instead?

You shouldn't need to.  iprop is a strict superset of what hprop can do.

--
Russ Allbery ([hidden email])              <http://www.eyrie.org/~eagle/>
Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Renata Maria Dart
In reply to this post by Paul Robert Marino
Thanks very much for your input!


Renata

On Fri, 8 Apr 2016, [hidden email] wrote:

>If it is a corrupt database than yes you. Will have to revert it on them all?.
>So yes stop iprop on all of the slaves, then fix the master, then bring each slave offline to bring it back to the same state. Then bring it back online and restart iprop.
>
>As far as the restore each transaction gets an I'd number I'm not sure where that I'd number comes from so I would restore the whole directory.
>
>But before you do that save your self some pain ?check if the problem only exists on the master or the slaves as well. If the issue is only on the master than you can simply stop one of the slaves rsync the directory back to the master then start every thing back up again, because a slave can be promoted to master just by stopping the iprop slave process, starting the master process, then poining the other slaves to it. In a previous job I use to have keepalived do this for me automatically incase the master went down, I just tied kadmin and kpasswd the the VIP and let DNS provide the 3 static addresses to my KDC's for the authentication ports.
>
>  Original Message  
>From: Renata Maria Dart
>Sent: Friday, April 8, 2016 18:45
>To: [hidden email]
>Reply To: [hidden email]
>Cc: [hidden email]
>Subject: Questions about restoring master kdc from a backup....
>
>
>Hi, we believe we have a corrupt entry in our heimdal.db and
>we plan to roll back to an earlier known good copy. We think
>it is corrupt because although kinit and kadmin commands
>continue to work,
>
>1. the dump command never completes and the dump file grows forever,
>2. kadmin commands using a wildcard '*' never come back
>3. iprop is no longer able to copy over to the slaves
>4. and entries in the console logs for the master and slave indicate a problem
>
>I was hoping someone could advise on the steps. We think all that is
>needed is:
>
>1. stop kdc on master
>2. replace the heimdal.db with the good version
>3. start up kdc on master
>
>But I am wondering about the slaves and whether or not iprop will be
>able to work properly. We have manually copied over the heimdal.db
>from the master to some slaves that are in a reserved part of our
>network and whose users needed to see the updates that have been made
>to the master heimdal.db but weren't being propagated by iprop.
>
>1. Should I kill iprop on the slaves before making the master kdc
>switch?
>2. Do I need to remove the /var/heimdal/log file on the slaves
>before starting iprop back up? And should I remove that file on the master?
>3. Should I remove everything in the /var/heimdal directory on
>the slaves (except the key)?
>4. Should I do 3 and then run hprop instead?
>
>Thanks for any assistance,
>
>Renata
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Renata Maria Dart
In reply to this post by Russ Allbery-2
Hi Russ, we have been spinning up many batch hosts which make entries
in the kdc very close together and we were wondering if that wasn't
the cause.  We have stopped it for now....once we get a functioning
kdc master/slave back I think we will have to slow down the rate of
those changes.  Thanks for your feedback Russ.  Hope all is well with
you in your life-after-Stanford.

Renata

On Fri, 8 Apr 2016, Russ Allbery wrote:

>Renata Maria Dart <[hidden email]> writes:
>
>> Hi, we believe we have a corrupt entry in our heimdal.db and we plan to
>> roll back to an earlier known good copy.  We think it is corrupt because
>> although kinit and kadmin commands continue to work,
>
>> 1. the dump command never completes and the dump file grows forever,
>> 2. kadmin commands using a wildcard '*' never come back
>> 3. iprop is no longer able to copy over to the slaves
>> 4. and entries in the console logs for the master and slave indicate a
>> problem
>
>Hi Renata!
>
>We had exactly the same problem with the stanford.edu production realm in
>the past.  I'm fairly sure that it was caused by a lot of kadmin.local
>changes in very close succession, which whould point to some sort of
>locking or race bug, but I was never able to track it down.  (Not that I
>tried very hard.)
>
>The underlying problem is that your Berkeley DB file has a loop.  (There's
>a next entry pointer that points back to a previous part in the same
>chain.)  This is pretty obvious if you watch kadmin list * with strace.
>
>> I was hoping someone could advise on the steps.  We think all that is
>> needed is:
>
>> 1.  stop kdc on master
>> 2.  replace the heimdal.db with the good version
>> 3.  start up kdc on master
>
>Correct, that's what we did.
>
>> But I am wondering about the slaves and whether or not iprop will be
>> able to work properly.  We have manually copied over the heimdal.db from
>> the master to some slaves that are in a reserved part of our network and
>> whose users needed to see the updates that have been made to the master
>> heimdal.db but weren't being propagated by iprop.
>
>> 1.  Should I kill iprop on the slaves before making the master kdc
>> switch?
>
>Yes.
>
>> 2.  Do I need to remove the /var/heimdal/log file on the slaves before
>> starting iprop back up?  And should I remove that file on the master?
>
>So, my recommendation would be to purge the iprop log on the master with
>iprop-log truncate at about the same time that you replace heimdal.db.
>
>My recollection is that if you then just bring the slaves back up without
>doing anything at all with them, iprop is smart enough to figure out that
>the master is reporting a newer version but has no incremental log and
>will request a full replication, and all the right things will happen.
>
>> 3.  Should I remove everything in the /var/heimdal directory on
>> the slaves (except the key)?
>
>You should not have to do this.
>
>> 4.  Should I do 3 and then run hprop instead?
>
>You shouldn't need to.  iprop is a strict superset of what hprop can do.
>
>--
>Russ Allbery ([hidden email])              <http://www.eyrie.org/~eagle/>
>

Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Russ Allbery-2
Renata Maria Dart <[hidden email]> writes:

> Hi Russ, we have been spinning up many batch hosts which make entries in
> the kdc very close together and we were wondering if that wasn't the
> cause.  We have stopped it for now....once we get a functioning kdc
> master/slave back I think we will have to slow down the rate of those
> changes.  Thanks for your feedback Russ.  Hope all is well with you in
> your life-after-Stanford.

Yeah, that would be what I'd be suspicious of.  I set off this bug when
setting password expiration times for large numbers of users in a tight
loop.

BTW, a note on the other feedback you received: when we had this happen,
everything up to the last iprop propagation was fine, and the slave KDCs
had a good copy of the database.  So we didn't restore from backup; we
just grabbed the database from one of the slave KDCs.

Everything's going well for me!  Very engrossing, though, and sadly not a
lot of Kerberos involved (at least yet -- I'm still working on it).

--
Russ Allbery ([hidden email])              <http://www.eyrie.org/~eagle/>
Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Harald Barth-2

With the different database backends behaving so differently (all from
hash duplicates to loops), should there be some advice be given what
to use for production and what should be the default?

> Everything's going well for me!

Nice to hear. Here too, I'm actually currently on parental leave and
will be working again in fall.

Harald.
Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Daniel Kouril
In reply to this post by Russ Allbery-2
On Fri, Apr 08, 2016 at 06:26:26PM -0700, Russ Allbery wrote:
> We had exactly the same problem with the stanford.edu production realm in
> the past.  I'm fairly sure that it was caused by a lot of kadmin.local
> changes in very close succession, which whould point to some sort of
> locking or race bug, but I was never able to track it down.  (Not that I
> tried very hard.)

We also experienced the problem a while ago. As far as I can recollect,
we managed to fix the database using the db*_recover command.

Daniel
Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Russ Allbery-2
In reply to this post by Harald Barth-2
Harald Barth <[hidden email]> writes:

> With the different database backends behaving so differently (all from
> hash duplicates to loops), should there be some advice be given what to
> use for production and what should be the default?

Is LDB ready for prime time, in the sense of having been banged on heavily
in production at some large site?  I'm very, very conservative about such
things, so I'd probably keep using Berkeley DB until people told me I was
completely foolish for doing so, since at least I know what its bugs look
like.

--
Russ Allbery ([hidden email])              <http://www.eyrie.org/~eagle/>
Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Quanah Gibson-Mount-3
--On Sunday, April 10, 2016 8:52 AM -0700 Russ Allbery <[hidden email]>
wrote:

> Harald Barth <[hidden email]> writes:
>
>> With the different database backends behaving so differently (all from
>> hash duplicates to loops), should there be some advice be given what to
>> use for production and what should be the default?
>
> Is LDB ready for prime time, in the sense of having been banged on heavily
> in production at some large site?  I'm very, very conservative about such
> things, so I'd probably keep using Berkeley DB until people told me I was
> completely foolish for doing so, since at least I know what its bugs look
> like.

Do you mean LMDB?  I was wondering this as well.  We use LMDB based
backends for OpenLDAP and Postfix, but I haven't heard one way or the other
yet of anyone using LMDB with Heimdal (or cyrus-sasl for that matter).

--Quanah

--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra ::  the leader in open source messaging and collaboration
A division of Synacor, Inc
Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Paul Robert Marino
Well the problem here is with using kadmin.local‎ when when you do that you are directly modifying the database, and as a result are bypassing the kadmind server which provides more safety.
The kadmind server will only allow one action to happen on the database at a time and queues the following modifications until the previous one completed, this provides an inherent lock. The local version bypasses that and can allow for concurrent modifications which may step on each other. 
That's also a big part of the reason why the slave was OK in this case. If you noticed in my previous email I suggested checking the slave, because whenever I've seen this before it was caused by the kadmin.local command and the slaves are never impacted due to the fact that it breaks the iprop transaction.

Kadmin.local is meant for manual emergency operations only, and should never be used by automated scripts. For automation use the API'S there is a great Perl module for Hiemdal on CPAN I've used for automation many times and never once has it caused these kind of problems. But I have seen it happen when admins who didn't want to be bothered by the kadmin command prompting them for their password used the kadmin.local command while an other change was happening.


  Original Message  
From: Quanah Gibson-Mount
Sent: Sunday, April 10, 2016 13:36
To: Russ Allbery
Reply To: [hidden email]
Cc: [hidden email]
Subject: Re: Questions about restoring master kdc from a backup....

--On Sunday, April 10, 2016 8:52 AM -0700 Russ Allbery <[hidden email]>
wrote:

> Harald Barth <[hidden email]> writes:
>
>> With the different database backends behaving so differently (all from
>> hash duplicates to loops), should there be some advice be given what to
>> use for production and what should be the default?
>
> Is LDB ready for prime time, in the sense of having been banged on heavily
> in production at some large site? I'm very, very conservative about such
> things, so I'd probably keep using Berkeley DB until people told me I was
> completely foolish for doing so, since at least I know what its bugs look
> like.

Do you mean LMDB? I was wondering this as well. We use LMDB based
backends for OpenLDAP and Postfix, but I haven't heard one way or the other
yet of anyone using LMDB with Heimdal (or cyrus-sasl for that matter).

--Quanah

--

Quanah Gibson-Mount
Platform Architect
Zimbra, Inc.
--------------------
Zimbra :: the leader in open source messaging and collaboration
A division of Synacor, Inc
Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Russ Allbery-2
[hidden email] writes:

> Well the problem here is with using kadmin.local‎ when when you do that
> you are directly modifying the database, and as a result are bypassing
> the kadmind server which provides more safety.

That's not supposed to be true of kadmin -l (sorry, not kadmin.local; that
was an MITish in my reply).  That's what database locks are for.  It's
possible that there's a bug in the locking in Heimdal, but this is
definitely supposed to work.  See all the invocations of hdb_lock in
lib/hdb in the Heimdal source.

> The kadmind server will only allow one action to happen on the database
> at a time and queues the following modifications until the previous one
> completed, this provides an inherent lock.

This is not true of Heimdal, as is immediately obvious from the fact that
kadmind can be run via inetd and multiple copies can be running at the
same time.  I'm pretty sure there's no reason why you can't encounter
exactly the same locking bug with kadmind except that it's so slow that
people rarely try to use it for major bulk operations.

> That's also a big part of the reason why the slave was OK in this
> case. If you noticed in my previous email I suggested checking the
> slave, because whenever I've seen this before it was caused by the
> kadmin.local command and the slaves are never impacted due to the fact
> that it breaks the iprop transaction.

In our case, it did not break the iprop transaction, actually.  All the
changes propagated fine, even though the master database was corrupted.

You're correct that iprop serializes all changes, and therefore avoids any
races, but the kadmin server does not do this.  Maybe it does in MIT, but
my impression was that even in MIT it forks children and does not
serialize changes.

I'm pretty sure there's some sort of rare but long-standing bug in the HDB
layer for Berkeley DB that causes a lock to not be held when it should be.

> Kadmin.local is meant for manual emergency operations only, and should
> never be used by automated scripts.

There may be a bug, but this is not the intention, nor is this the
documented behavior.

Also, using libkadm5clnt instead of libkadm5srv (since that's what we're
actually talking about here) is *painfully* slow, in both Heimdal and MIT,
which is why we use libkadm5srv for heavy automation, not because we don't
know how to use keytabs with libkadm5clnt.

> For automation use the API'S there is a great Perl module for Hiemdal on
> CPAN I've used for automation many times and never once has it caused
> these kind of problems.

Note that if you link that Perl module with libkadm5srv, it does exactly
the same thing kadmin -l does.

--
Russ Allbery ([hidden email])              <http://www.eyrie.org/~eagle/>
Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Paul Robert Marino
‎Russ

Well I knew the kadmin.local is the MIT version (they also support a command line option in the regular command) but some one said it before (namely you), and I thought it was clearer than kadmin -l, also I've seen versions of RPM's that add aliases to the bash profile that make it work like MIT's in that respect.

Also I never run kadmin in xinetd I always run it as a standalone daemon, mostly because it works well, but also because doing it that way makes automated master failover possible (read what I was saying earlier about keepalived). Sorry I left that detail out.

Also I never run my automation scripts on my authentication servers so I highly doubt what you are saying about the ‎Perl module is completely true since there is no local database to edit, but admittedly I've never looked to deep into it.

‎Sometimes automation needs speed limits :). Things are going fast and now a race conditions pop up in the locking (locking without message serialization and or queuing is always vulnerable to intermittent race conditions). There may be reasons for the slower performance of the libraries that you aren't aware of that add safety either by design or by coincidence. I know it's not ideal but sometimes you just have to deal with it. That said I am sure that there are safety measures safety that are not working the way they were intended but until it's fixed if ever ‎we need to deal with it. 

By the way wow that's wonderful how the person who asked for help sent you the logs, and traces off the list, so you could determine ‎that iprop replicated the bad entries and magically fixed them.
That's amazing I personally can only speak to the times I've seen it happen and the results of my lab stress tests, which say the slave fails the transaction and iprop gets stuck, but doesn't replicate the bad data.




  Original Message  
From: Russ Allbery
Sent: Sunday, April 10, 2016 20:56
To: [hidden email]; [hidden email]
Subject: Re: Questions about restoring master kdc from a backup....

[hidden email] writes:

> Well the problem here is with using kadmin.local‎ when when you do that‎
> you are directly modifying the database, and as a result are bypassing
> the kadmind server which provides more safety.

That's not supposed to be true of kadmin -l (sorry, not kadmin.local; that
was an MITish in my reply). That's what database locks are for. It's
possible that there's a bug in the locking in Heimdal, but this is
definitely supposed to work. See all the invocations of hdb_lock in
lib/hdb in the Heimdal source.

> The kadmind server will only allow one action to happen on the database
> at a time and queues the following modifications until the previous one
> completed, this provides an inherent lock.

This is not true of Heimdal, as is immediately obvious from the fact that
kadmind can be run via inetd and multiple copies can be running at the
same time. I'm pretty sure there's no reason why you can't encounter
exactly the same locking bug with kadmind except that it's so slow that
people rarely try to use it for major bulk operations.

> That's also a big part of the reason why the slave was OK in this
> case. If you noticed in my previous email I suggested checking the
> slave, because whenever I've seen this before it was caused by the
> kadmin.local command and the slaves are never impacted due to the fact
> that it breaks the iprop transaction.

In our case, it did not break the iprop transaction, actually. All the
changes propagated fine, even though the master database was corrupted.

You're correct that iprop serializes all changes, and therefore avoids any
races, but the kadmin server does not do this. Maybe it does in MIT, but
my impression was that even in MIT it forks children and does not
serialize changes.

I'm pretty sure there's some sort of rare but long-standing bug in the HDB
layer for Berkeley DB that causes a lock to not be held when it should be.

> Kadmin.local is meant for manual emergency operations only, and should
> never be used by automated scripts.

There may be a bug, but this is not the intention, nor is this the
documented behavior.

Also, using libkadm5clnt instead of libkadm5srv (since that's what we're
actually talking about here) is *painfully* slow, in both Heimdal and MIT,
which is why we use libkadm5srv for heavy automation, not because we don't
know how to use keytabs with libkadm5clnt.

> For automation use the API'S there is a great Perl module for Hiemdal on
> CPAN I've used for automation many times and never once has it caused
> these kind of problems.

Note that if you link that Perl module with libkadm5srv, it does exactly
the same thing kadmin -l does.

--
Russ Allbery ([hidden email]) <http://www.eyrie.org/~eagle/>
Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Russ Allbery-2
[hidden email] writes:

> Also I never run kadmin in xinetd I always run it as a standalone
> daemon, mostly because it works well, but also because doing it that way
> makes automated master failover possible (read what I was saying earlier
> about keepalived). Sorry I left that detail out.

Yeah, but for the discussion that we're having, I think it doesn't matter.
Take a look at wait_for_connection() in kadmin/kadm_conn.c.  kadmind forks
a separate child for each incoming connection when running as a standalone
server.  So there's nothing that prevents changes from being made at the
same time apart from database locking.

> Also I never run my automation scripts on my authentication servers so I
> highly doubt what you are saying about the ‎Perl module is completely
> true since there is no local database to edit, but admittedly I've never
> looked to deep into it.

This is how the kadmin library code works.  Both libkadm5srv and
libkadm5clnt provide the same API.  One of them works on a local database,
and the other one sends connections over the wire protocol.  In general,
you can change a program that uses the kadmin wire protocol to one that
modifies the database directly by just linking with the other library.
It's kind of a neat, weird property that's not widely used outside of that
library pair.

> ‎Sometimes automation needs speed limits :). Things are going fast and
> now a race conditions pop up in the locking (locking without message
> serialization and or queuing is always vulnerable to intermittent race
> conditions).

Oh, sure, you can make races less likely by slowing down.  But that just
means there's a bug.  The bug may be hard enough to fix that it's worth
working around it instead, but it would be even better if we could just
fix the bug.  :)

> By the way wow that's wonderful how the person who asked for help sent
> you the logs, and traces off the list, so you could determine ‎that iprop
> replicated the bad entries and magically fixed them.

> That's amazing I personally can only speak to the times I've seen it
> happen and the results of my lab stress tests, which say the slave fails
> the transaction and iprop gets stuck, but doesn't replicate the bad
> data.

To be clear, I wasn't talking about Renata's problem there.  I was talking
about the time I encountered the same bug myself while I was running
Stanford's main campus KDCs.  I spent a bit of time with strace and logs
and whatnot figuring out the details of what had happened before replacing
the database.

It wouldn't *surprise* me if this bug usually broke iprop, of course, but
I think it won't *necessarily* depending on exactly where the database
corruption gets introduced.

--
Russ Allbery ([hidden email])              <http://www.eyrie.org/~eagle/>
Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Paul Robert Marino
I decline to answer further on the list because it's going in the direction that makes the community look bad (flame war), if you would like to continue this as a technical debate please email me directly. I would be happy if we came up with answers we can agree upon and make some combined helpful recommendations to the larger community at the end of it.

  Original Message  
From: Russ Allbery
Sent: Sunday, April 10, 2016 21:57
To: [hidden email]
Cc: [hidden email]
Subject: Re: Questions about restoring master kdc from a backup....

[hidden email] writes:

> Also I never run kadmin in xinetd I always run it as a standalone
> daemon, mostly because it works well, but also because doing it that way
> makes automated master failover possible (read what I was saying earlier
> about keepalived). Sorry I left that detail out.

Yeah, but for the discussion that we're having, I think it doesn't matter.
Take a look at wait_for_connection() in kadmin/kadm_conn.c. kadmind forks
a separate child for each incoming connection when running as a standalone
server. So there's nothing that prevents changes from being made at the
same time apart from database locking.

> Also I never run my automation scripts on my authentication servers so I
> highly doubt what you are saying about the ‎Perl module is completely
> true since there is no local database to edit, but admittedly I've never
> looked to deep into it.

This is how the kadmin library code works. Both libkadm5srv and
libkadm5clnt provide the same API. One of them works on a local database,
and the other one sends connections over the wire protocol. In general,
you can change a program that uses the kadmin wire protocol to one that
modifies the database directly by just linking with the other library.
It's kind of a neat, weird property that's not widely used outside of that
library pair.

> ‎Sometimes automation needs speed limits :). Things are going fast and
> now a race conditions pop up in the locking (locking without message
> serialization and or queuing is always vulnerable to intermittent race
> conditions).

Oh, sure, you can make races less likely by slowing down. But that just
means there's a bug. The bug may be hard enough to fix that it's worth
working around it instead, but it would be even better if we could just
fix the bug. :)

> By the way wow that's wonderful how the person who asked for help sent
> you the logs, and traces off the list, so you could determine ‎that iprop
> replicated the bad entries and magically fixed them.

> That's amazing I personally can only speak to the times I've seen it
> happen and the results of my lab stress tests, which say the slave fails
> the transaction and iprop gets stuck, but doesn't replicate the bad
> data.

To be clear, I wasn't talking about Renata's problem there. I was talking
about the time I encountered the same bug myself while I was running
Stanford's main campus KDCs. I spent a bit of time with strace and logs
and whatnot figuring out the details of what had happened before replacing
the database.

It wouldn't *surprise* me if this bug usually broke iprop, of course, but
I think it won't *necessarily* depending on exactly where the database
corruption gets introduced.

--
Russ Allbery ([hidden email]) <http://www.eyrie.org/~eagle/>
Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Thomas M. Payerle-3
In reply to this post by Daniel Kouril
There are multiple ways in which a DB can be corrupted, but we definitely
ran into some issues in which we had databases that differed between the
master and the slaves, although each individually was good wrt Berkeley
DB tools like db_recover.

I suggest that Renata do something like
kadmin -l list -s -o principal,knvo,mod_time '*' > temp-kdc.dump
on the master and each of the slaves and compare them.  I would not
be surprised to see issues on both the master and the slave copies
of the DB. I believe it is in theory possible to migrate individual
records between such (i.e., do a dump of one DB, extract the records, and do
a merge into the other DB), but I would not generally recommend it (it is
way beyond my comfort zone).  But at least knowing what the differences
are (and knowing what you might need to fix after reverting the master to
the slave version, or pushing a full master to the slaves) is of value.

*Most* of our DB corruption issues had been traced to web based script
wherein incoming students could set their initial password and possible choose
a new principal name not handling users hitting the submit button repeatedly,
resulting in multiple threads talking to kadmind at the same time, modifying the
same principal(s).  We eventually fixed that, greatly reducing the incidence
of database synchronization issues between the master and slaves.

But we also started doing a kadmin list as above nightly on the master and each
of the slaves to detect synchronization issues early (do it in the middle of the
night to minimize likelihood of legit DB updates occuring while the dumps are
running.  The above options provide a single line per principal, with enough data
that is should catch most issues, yet without very sensitive data.)

(Without getting into the kadmin -l discussion, the above does NOT modify any
the DB on any of the KDCs, so this should be a safe operation).

On Sun, 10 Apr 2016, Daniel Kouril wrote:

> On Fri, Apr 08, 2016 at 06:26:26PM -0700, Russ Allbery wrote:
>> We had exactly the same problem with the stanford.edu production realm in
>> the past.  I'm fairly sure that it was caused by a lot of kadmin.local
>> changes in very close succession, which whould point to some sort of
>> locking or race bug, but I was never able to track it down.  (Not that I
>> tried very hard.)
>
> We also experienced the problem a while ago. As far as I can recollect,
> we managed to fix the database using the db*_recover command.
>
> Daniel
>

Tom Payerle
IT-ETI-EUS [hidden email]
4254 Stadium Dr (301) 405-6135
University of Maryland
College Park, MD 20742-4111
Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Russ Allbery-2
"Thomas M. Payerle" <[hidden email]> writes:

> But we also started doing a kadmin list as above nightly on the master
> and each of the slaves to detect synchronization issues early (do it in
> the middle of the night to minimize likelihood of legit DB updates
> occuring while the dumps are running.  The above options provide a
> single line per principal, with enough data that is should catch most
> issues, yet without very sensitive data.)

Oh, yeah, good point -- we did that too, just to make sure it would
complete.

--
Russ Allbery ([hidden email])              <http://www.eyrie.org/~eagle/>
Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Jeffrey Altman-2
In reply to this post by Russ Allbery-2
On 4/10/2016 7:56 PM, Russ Allbery wrote:
> This is not true of Heimdal, as is immediately obvious from the fact that
> kadmind can be run via inetd and multiple copies can be running at the
> same time.  I'm pretty sure there's no reason why you can't encounter
> exactly the same locking bug with kadmind except that it's so slow that
> people rarely try to use it for major bulk operations.

What would be most useful is if someone could develop a test that
reproduces and detects the corruption.   99.9% of the work of
identifying the cause of a bug and fixing it is having a test case.

Thank you.

Jeffrey Altman



smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Russ Allbery-2
Jeffrey Altman <[hidden email]> writes:
> On 4/10/2016 7:56 PM, Russ Allbery wrote:

>> This is not true of Heimdal, as is immediately obvious from the fact
>> that kadmind can be run via inetd and multiple copies can be running at
>> the same time.  I'm pretty sure there's no reason why you can't
>> encounter exactly the same locking bug with kadmind except that it's so
>> slow that people rarely try to use it for major bulk operations.

> What would be most useful is if someone could develop a test that
> reproduces and detects the corruption.   99.9% of the work of
> identifying the cause of a bug and fixing it is having a test case.

Yeah, absolutely.  Unfortunately, whatever triggers the race seems to be
pretty rare.  I did exactly the same operation on Stanford's KDCs five or
ten times after having induced the corruption and never got it to happen
again.

Probably the next thing to try (and sadly I don't have the time to do this
at the moment) is to spin up something that does a lot of kadmin changes
in parallel, like five or ten streams of kadmin -l modifications at the
same time, against a reasonably large database, and see if you can get it
to blow up.

--
Russ Allbery ([hidden email])              <http://www.eyrie.org/~eagle/>
Reply | Threaded
Open this post in threaded view
|

Re: Questions about restoring master kdc from a backup....

Jeffrey Altman-2
On 4/11/2016 12:36 PM, Russ Allbery wrote:

> Jeffrey Altman <[hidden email]> writes:
>> On 4/10/2016 7:56 PM, Russ Allbery wrote:
>
>>> This is not true of Heimdal, as is immediately obvious from the fact
>>> that kadmind can be run via inetd and multiple copies can be running at
>>> the same time.  I'm pretty sure there's no reason why you can't
>>> encounter exactly the same locking bug with kadmind except that it's so
>>> slow that people rarely try to use it for major bulk operations.
>
>> What would be most useful is if someone could develop a test that
>> reproduces and detects the corruption.   99.9% of the work of
>> identifying the cause of a bug and fixing it is having a test case.
>
> Yeah, absolutely.  Unfortunately, whatever triggers the race seems to be
> pretty rare.  I did exactly the same operation on Stanford's KDCs five or
> ten times after having induced the corruption and never got it to happen
> again.
>
> Probably the next thing to try (and sadly I don't have the time to do this
> at the moment) is to spin up something that does a lot of kadmin changes
> in parallel, like five or ten streams of kadmin -l modifications at the
> same time, against a reasonably large database, and see if you can get it
> to blow up.
Russ,

I replied to your e-mail but didn't mean to single you out with the reply.

The optimal test would be something that could reside in the test suite
and not involve a production deployment.

Jeffrey Altman



smime.p7s (5K) Download Attachment
12