Proposed new krb5 FILE ccache protocol

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Proposed new krb5 FILE ccache protocol

Nico Williams
Below is a description of a new FILE ccache that is backwards
interoperable and compatible with the current one, improves read
performance and probably write performance as well, recovers from
corruption, and is concurrency-safe without using POSIX file locking.
This design is Viktor's and mine.

The new ccache would consist of two components in the filesystem:

 - a "main file" with same format as FILE ccache, containing only the
   header, start TGT, and cc configs;

 - an ancillary directory (mkdtemp()'ed, named from the main file) with
   "hash buckets" which are basically FILE ccaches containing creds that
   hash into them.

The ccache type would still be "FILE", for backwards interop/compat,
since the new ccache type would be backwards interoperable/compatible.

All writes to any one file are to be renameat(2)-into-place writes (or
rename(2), if renameat(2) is not available).

Ancillary directory names should be named krb5ccd-XXXXX.

Files in the ancillary directory should be named as follows:

    krb5cctmp-<generation-number>-<bucket-number>

Hashing is needed because we cannot use unescaped principal names as
components of filenames (for obvious reasons, besides, the principal
names might be too long).  This helps us keep the amount of copying down
and speeds up lookups.  The bucket number will be between 0 and 63
(say); krb5_cred's that are neither cc configs nor start TGTs will be
hashed into a bucket number.

The prefix is necessary to prevent a security vulnerability on systems
that lack a renameat(2).  (A malicious user could create a symlink with
the same name as the old ancillary directory to get the loser of two
racing kdestroys to unlink files elsewhere.  This is avoided by using
unlinkat(2) where available.  Where unlinkat(2) is not available the
prefix business makes the names to be unlinked useless to the attacker.)

The generation number is there to ensure that running kinit results in
old tickets disappearing atomically, even if the removal process is
interruped.  The generation number is incremented on every kinit.
Racing to do this is fine as either way old tickets are -or at least
appear to be- removed.

Writes will generally be delayed until krb5_cc_close() time;
krb5_cc_{initialize, gen_new, new_unique, store_cred, destroy, ...}()
will queue up tasks in memory.  But when storing a non-cc config,
non-start TGT then krb5_cc_store_cred() should act immediately because
some apps hold ccaches open for long periods of time (perhaps this
decision can be based on whether KRB5_TC_OPENCLOSE is set).

(krb5_cc_last_change_time() may need to stat all buckets, or writing can
utimes(2) touch the main file.)

All the logic can be in a single source file in the library; no changes
to kinit, kvno, or kdestroy should be necessary.  This should be a
drop-in replacement for src/lib/krb5/ccache/cc_file.c.

The protocols for initializing, writing to, and destroying the new
ccache type are described below.

 - Search for a credential:

   a) read the main file, if found return, else get the ancillary
      directory name,
   b) hash the credential to a bucket,
   c) search that bucket.

   In all cases treat errors when reading a credential (e.g., lengths
   that are too long) as EOF.  I.e., treat corruption as EOF, not as any
   sort of fatal error.

   Enumeration (klist) is very similar: every bucket is iterated.

 - Write a new start TGT or cc config:

   a) read the main file to copy all other cc configs,
   b) mk[o]stemp() a new main file,
   c) write the new contents to the main file,
   d) rename(2) the new main file into place.

 - Write a new credential that is not a cc config or start TGT:

   a) read the main file to find the ancillary directory name,
   b) hash the credential to a bucket,
   c) mk[o]stemp() a new bucket in the ancillary directory,
   d) write the new cred to the new bucket,
   e) open the old bucket and copy up to N bytes (or creds) from the old
      bucket to the new if the old one exists,
   f) renameat(2) the new bucket into place.

   Note that the copy at step (e) can be a straight copy using write(2)
   of the old bucket mmap()ed in.  There's no need to iterate over
   bucket entries to write whole entries.  Partial tail entries are
   corrupt, but that's OK since we ignore tail corruption at read-time.
   This means that the copy step can be faster than iterative read and
   copy.  Indeed, this copy could even be done using aio_write(), with
   the renameat(2) done immediately after starting the write -- since
   incompletely-written entries will be treated as EOF, no harm results.

   Things to be hashed: cname, crealm, sname, srealm, session key
   enctype.

 - Initialization (first time):

   a) if the ccache exists and contains the ancillary directory name and
      start TGT, then re-initialize (see below), else continue,
   b) mkdtemp() the ancillary directory;
   c) generate a prefix for filenames in the ancillary directory;
   d) mk[o]stemp() the new main file including cc configs for various
      things *including* the name of the ancillary directory and the
      prefix for filenames in it;
   e) rename(2) the main file into place.

   Note that multiple  initializations can race, and some may leave
   empty temp directories lying around.

 - Re-initialization (e.g., kinit -R):

   a) read the existing main file to get the ancillary directory name
      and current generation number,
   b) setup a new main file with the same ancillary directory, and
      increment the generation number
   c) rename(2) into place the main file into place,
   d) and cleanup the ancillary directory by iterating the previous
      generation's bucket file names and using unlinkat(2) (if
      available, else unlink(2) to remove them), ignoring errors from
      unlinkat(2).

   There's a race condition here that can leave old buckets around, but
   the tickets stored there should be non-renewable, non-start TGTs, and
   should expire eventually.

   Interruption in the middle of unlinking old buckets can also leave
   them lying around.

 - Destroy:

   a) read the ancillary directory name and bucket prefix from the main
      file,
   b) unlink(2) the main file,
   c) unlinkat(2) all the files in the ancillary directory by iterating
      over the possible bucket names (ignoring errors from unlinkat(2)),
   d) rmdir() the ancillary directory.

   Interruption can leave old buckets lying around, but the tickets
   stored there should be non-renewable, non-start TGTs, and should
   expire eventually.

BENEFITS:

 - ABSOLUTELY NO POSIX FILE LOCKING.

 - Only depends on filesystem synchronization primitives that are
   concurrency-safe and also thread-safe.

 - Speeds up ccache lookups through hashing, by putting an upper bound
   on ccache and bucket size, by greatly reducing contention, and by
   putting most-recently-acquired tickets closest to the front in each
   bucket.

 - Speeds up writes by reducing contention, both by using hashing and by
   pushing all contention to renameat(2) (which greatly reduces the
   amount of time for which locks are held [in the filesystem]).

   Writes are made slower by the need to copy buckets, but this can be
   made asynchronous, and anyways, the amount to copy can be tuned.

 - Self-cleaning.  No need to kinit just to remove old tickets, though
   kinit (and kinit -R) will still have that effect (purposefully, to
   avoid surprises).

 - Backwards compatible:

    - old FILE ccache implementations may still corrupt the main file,
      but new implementations will recover automatically as corruption
      will not affect the main file entries written by new
      implementations;

    - old FILE ccache implementations will not find cached non-start
      TGTs written by new ones, but that's OK.

ISSUES:

 - kdestroys may be interrupted, leaving buckets with what should be
   non-renewable, non-start-TGTs lying around.  This is acceptable as
   the credentials in question will expire soon enough.

   If this is not acceptable then it can be addressed simply by
   super-encrypting ticket session keys in a key stored only in the main
   file.  Or even encrypting every bucket entry with a key stored only
   in the main file.

 - Racing initial ccache initializations can result in orphaned
   ancillary directories.  See above.

   Naming ancillary directories with a recognizable prefix allows for
   periodic cleanup of orphaned ancillary directories.

 - Racing krb5_cc_set_config() with kinit / kinit -R can result in
   losing the new start TGT.  In practice this won't happen as all calls
   to krb5_cc_set_config() are in the context of initialization.

We can make the worst-case only leak storage.  Malicious users cannot
cause others to leak storage.

Nico
--
_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Proposed new krb5 FILE ccache protocol

Greg Hudson
On 01/27/2014 05:50 PM, Nico Williams wrote:
> Below is a description of a new FILE ccache that is backwards
> interoperable and compatible with the current one, improves read
> performance and probably write performance as well, recovers from
> corruption, and is concurrency-safe without using POSIX file locking.

After discussing this with a few people, I don't think we would want an
implementation of this in libkrb5, for a few reasons:

* It carries a lot of complexity, some of which is user-visible.  What
used to access only the named file would now create a directory
alongside the file, which in edge cases can get left behind.

* It assumes that the process has write access to the parent directory
of the ccache path.

* It makes assumptions about what kinds of creds are precious and what
kinds are discardable.  Such assumptions can run afoul of unusual use
cases, such as application servers using S4U2Self.

However, we would be happy to have a pluggable ccache layer.  With our
plugin system, a pluggable ccache layer would naturally allow a
dynamically loaded module to replace the built-in FILE type.

For the POSIX file locks problem, I think there are simpler solutions.
For example, we could use fcntl plus flock on platforms where that
works, and fcntl plus a global mutex (accepting the cost of serializing
all ccache operations) on platforms where it doesn't.  In the long term
the best answer is file-private locks[1], if that ever catches on.

For performance scalability, I think the best answer is a daemon-based
ccache type, which is something we want anyway for better OS X
integration.  I know that doesn't meet the requirement of interoperating
with old code, but the complexity of maintaining FILE compatibility
while also achieving scalable performance is just too high for us.  A
ccache daemon has some relatively trivial options for achieving good
performance, such as keeping file and memory copies of each ccache and
using the memory ccache for reads.

Other changes I would like to have include:

* In the FILE cache implementation, marshal creds to memory and write
them out all at once with O_APPEND.  That wouldn't give absolute
guarantees, but in practice it should greatly reduce the incidence of
interleaved or partially written credentials even in the face of failed
locks and sudden process death.  This would also allow a substantial
amount of code to be shared between cc_file.c and cc_keyring.c.

* Better recovery for the FILE ccache type if we run into a partially
written cred.  I'm not sure exactly what is required.

* API changes for atomic ccache refresh (issue #7707), probably by
improving krb5_cc_move().

* API changes to supplant the TC_OPENCLOSE flag, which doesn't play well
with threads.  At a minimum, a fix for issue #7804 (unsetting
TC_OPENCLOSE causes writes to fail).

[1]
http://jtlayton.wordpress.com/2014/01/07/file-private-posix-locks-aka-un-posix-locks/
_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Proposed new krb5 FILE ccache protocol

Nico Williams
On Tue, Jan 28, 2014 at 08:21:59PM -0500, Greg Hudson wrote:
> After discussing this with a few people, I don't think we would want an
> implementation of this in libkrb5, for a few reasons:

Thanks for your thoughtful reply.  Please bear with me below.

> * It carries a lot of complexity, some of which is user-visible.  What
> used to access only the named file would now create a directory
> alongside the file, which in edge cases can get left behind.

I think the only user-visible aspect is the edge cases where directories
get left behind, which shouldn't happen often and are of little concern
(they'd be in $TMPDIR, where one expects to find garbage).

> * It assumes that the process has write access to the parent directory
> of the ccache path.

The ancillary directory can be in $TMPDIR (we can assume at least /tmp),
and the main file can be written by truncation as a fallback (with all
the problems that that entails).

The ccache could be created in a directory just for that purpose (e.g.,
in the krb5_cc_new_unique() case, thus making it easier to identify and
automatically cleanup any garbage leaked in these edge cases.)

The ancillary directory could also be written as a fork, where that's
supported, thus making it even easier to clean up.

(It doesn't really matter where the ancillary directory is (especially
if the session keys stored in it are encrypted, and even more so if all
creds stored in it are encrypted), though it probably shouldn't be
written on, say, an NFS filesystem that requires Kerberos credentials to
access it, but that's true of ccache files in general, and especially
when the filesystem is accessed in the clear over the network.)

> * It makes assumptions about what kinds of creds are precious and what
> kinds are discardable.  Such assumptions can run afoul of unusual use
> cases, such as application servers using S4U2Self.

In the GSS cases S4U* tickets always go into new ccaches.

I see that kvno(1) has -U and -P options for S4U.  Is this actually used
in real life, or is this just for diagnostic purposes.  What
applications will search a ccache for S4U credentials specifically, as
opposed to using whatever credentials they find in the ccache?  If there
are any such applications, how do they keep their precious S4U tickets
alive?

> However, we would be happy to have a pluggable ccache layer.  With our
> plugin system, a pluggable ccache layer would naturally allow a
> dynamically loaded module to replace the built-in FILE type.

That's fair and very much appreciated.

> For the POSIX file locks problem, I think there are simpler solutions.
> For example, we could use fcntl plus flock on platforms where that
> works, and fcntl plus a global mutex (accepting the cost of serializing
> all ccache operations) on platforms where it doesn't.  In the long term
> the best answer is file-private locks[1], if that ever catches on.

That covers 99% of the cases, and all of the ones I care about today.
It doesn't have the other benefits of the new design, which I do want.

> For performance scalability, I think the best answer is a daemon-based
> ccache type, which is something we want anyway for better OS X
> integration.  I know that doesn't meet the requirement of interoperating
> with old code, but the complexity of maintaining FILE compatibility
> while also achieving scalable performance is just too high for us.  A
> ccache daemon has some relatively trivial options for achieving good
> performance, such as keeping file and memory copies of each ccache and
> using the memory ccache for reads.

Yes, this is highly desirable.  A decent non-MEMORY ccache is still
needed for the kcm case (daemon restarts should not wipe out
credentials, unless that's desired and the daemon configured to use the
MEMORY ccache type).

On IRC tonight we noticed at least one additional problem with the
current FILE implementation: it's supposed to be thread-safe to share a
single ccache with multiple threads, but it's not because of the
KRB5_TC_OPENCLOSE flag reset/restore business.  This can be fixed of
course.  Assuming a kcm is not threaded or that it has a ccache handle
per-thread then this wouldn't have to be fixed for the kcm case, but it
should still be fixed for the general case.

> Other changes I would like to have include:
>
> * In the FILE cache implementation, marshal creds to memory and write
> them out all at once with O_APPEND.  That wouldn't give absolute
> guarantees, but in practice it should greatly reduce the incidence of
> interleaved or partially written credentials even in the face of failed
> locks and sudden process death.  This would also allow a substantial
> amount of code to be shared between cc_file.c and cc_keyring.c.

In the existing one, yes, this is desirable.

> * Better recovery for the FILE ccache type if we run into a partially
> written cred.  I'm not sure exactly what is required.

Pretend you've hit EOF.  Also truncate at the corrupted entry's start
offset, so as to clear up the corruption.

> * API changes for atomic ccache refresh (issue #7707), probably by
> improving krb5_cc_move().

Yes, write a MEMORY or new_unique ccache and krb5_cc_move() it into
place -- that's the only reasonable way to get atomicity out of this API
for writing.  You'll have to move part of the move operation into the
ccache plugin interface.

If you have rename(2) into place in mind... you'll need write permission
to the directory holding the ccache (see above).  Otherwise you must
continue to rely on file locking to make these things atomic.

> * API changes to supplant the TC_OPENCLOSE flag, which doesn't play well
> with threads.  At a minimum, a fix for issue #7804 (unsetting
> TC_OPENCLOSE causes writes to fail).

A transactional API for storing credentials would useful.  But
krb5_cc_move() and krb5_cc_copy_creds() are probably sufficient (if they
become transactional under the hood).

As for KRB5_TC_OPENCLOSE... I'd just make it a no-op and deprecate it.
Instead each cursor should get its own file descriptor for the ccache
for thread-safety, and the file should be re-opened at the start of any
search for creds -or when storing creds- if the file has changed
(st_dev, st_ino).  After all, that's actually the effect of
KRB5_TC_OPENCLOSE, and therefore its apparent purpose: notice ccache
re-initialization *except* when in the middle of iterating all creds
(since otherwise the offset of each cred becomes invalid if the file
changes, leading to apparent corruption).  (Also, re-opening the ccache
in the current design means dropping the locks that allow those offsets
to remain valid.)

The whole KRB5_TC_OPENCLOSE business is an implementation detail that
didn't need to be exposed.

Nico
--
_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Proposed new krb5 FILE ccache protocol

Russ Allbery-2
Nico Williams <[hidden email]> writes:

> The ancillary directory can be in $TMPDIR (we can assume at least /tmp),
> and the main file can be written by truncation as a fallback (with all
> the problems that that entails).

I'm dubious that the Kerberos libraries can safely assume that $TMPDIR or
/tmp are available.  Do they currently assume that somewhere?  (I'm
thinking of chroot cases, SELinux and other MAC use cases, jails,
namespace restrictions on Linux, etc.)

--
Russ Allbery ([hidden email])              <http://www.eyrie.org/~eagle/>
_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Proposed new krb5 FILE ccache protocol

Nico Williams
On Tue, Jan 28, 2014 at 10:41 PM, Russ Allbery <[hidden email]> wrote:

> Nico Williams <[hidden email]> writes:
>
>> The ancillary directory can be in $TMPDIR (we can assume at least /tmp),
>> and the main file can be written by truncation as a fallback (with all
>> the problems that that entails).
>
> I'm dubious that the Kerberos libraries can safely assume that $TMPDIR or
> /tmp are available.  Do they currently assume that somewhere?  (I'm
> thinking of chroot cases, SELinux and other MAC use cases, jails,
> namespace restrictions on Linux, etc.)

Good question.  It's not about what POSIX says either.  All bets are
off with chroot (since it's up to who sets up the space to chroot
into).  As for jails and such, I very much expect them to have a /tmp
if they are to be general-purpose.

Nico
--
_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Proposed new krb5 FILE ccache protocol

Nico Williams
On Tue, Jan 28, 2014 at 11:01 PM, Nico Williams <[hidden email]> wrote:

> On Tue, Jan 28, 2014 at 10:41 PM, Russ Allbery <[hidden email]> wrote:
>> Nico Williams <[hidden email]> writes:
>>
>>> The ancillary directory can be in $TMPDIR (we can assume at least /tmp),
>>> and the main file can be written by truncation as a fallback (with all
>>> the problems that that entails).
>>
>> I'm dubious that the Kerberos libraries can safely assume that $TMPDIR or
>> /tmp are available.  Do they currently assume that somewhere?  (I'm
>> thinking of chroot cases, SELinux and other MAC use cases, jails,
>> namespace restrictions on Linux, etc.)

And to answer your question, krb5_cc_new_unique/gen_new() for the FILE
ccache to refer to /tmp by default, and a variety of functions,
including krb5_cc_default_name(), support expansion of a TEMP token to
whatever $TMPDIR is.

That said, there are surprisingly few uses of mkstemp() and friends in
the library.

Nico
--
_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev