Fwd: Gss context refresh failure due to clock skew

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Gss context refresh failure due to clock skew

Adamson, Andy
Bringing in the NFSv4 and Kerberos list.

—>Andy

> Begin forwarded message:
>
> From: Andy Adamson <[hidden email]>
> Subject: Gss context refresh failure
> Date: October 1, 2015 at 4:39:06 PM EDT
> To: Bruce Fields <[hidden email]>
> Cc: Simo Sorce <[hidden email]>
>
> Hi Bruce, Simo
>
> We are seeing a failure on GSS context renewal between the Linux client and NetApp, so I tested Linux client to Linux server and see a similar situation.
>
> The situation occurs as follows.
>
> 1) For simplicity , set the client and server to both have the same Kerberos clock skew - I leave the default 5 minutes.
> 2) For convenience, I set the TGS lifetimes to be as short as possible, 10 minutes for Win2008R2 AD which I test with.
> 3) Client clock is set to the KDC (WinAD) clock via NTP.
> 4) Set the server clock to be 2-3 minutes ahead of the client clock, but still within the Kerberos clock skew.
> 5) Mount NFS with -o sec=krb5 - this succeeds because clocks are within the Kerberos clock skew
> 6) On the client klist -ce the machine creds to note the server TGS expiry time
> 7) On the server, wait until the client’s serverTGS expiry time has just passed. Note at this time that according to the client clock, the TGS has 2-3 minutes of being valid, but on the server, the TGS will have expired.
> 8) On the client, make a new directory: mkdir /mntpoint/blah
> The client sends an NFS ACCESS (for the mkdir) call which gets back an AUTH_ERROR, GSS credential problem.
> This pokes the client to do an upcall to refresh the GSS context.
> 9) The client sends a NULL call RPCSEC_GSS_INIT, using the TGS in the credential cache (no call to KDC for new TGS) because the client clock says the TGS is still valid.
>
> 10) ONTAP: The server replies to the NULL call with an a GSS minor status of GSS_S_CREDENTIAL_EXPIRED.
>
> 10) LINUX: The server does not reply to the NULL RPCSEC_GSS_INIT call, in fact, the Linux server sends a FIN to the client gssd connection.
>
> 11) The mkdir request gets permission denied
>
> 12) Wait until the client clock is past the server TGS expiry time
> 13) re-try the mkdir - it succeeds after a successful GSS INIT NULL call exchange for both servers.
>
> In the ONTAP case, the accept_sec_context call into the MIT libraries fails (even though krb5_check_clockskew() is apparently called). I believe the gss_proxy on Linux also just calls into the MIT libraries.
>
> Shouldn’t these refresh calls succeed? Isn’t the Kerberos clock skew supposed to handle this situation?
>
> —>Andy


_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Gss context refresh failure due to clock skew

Greg Hudson
Sorry for the delay; Andy's mail got stuck in the krbdev moderation
queue by mistake.

On 10/01/2015 05:30 PM, Adamson, Andy wrote:
> The situation occurs as follows.

I am a little bit confused by this description because of terminology
issues.  In your description, you appear to use the phrase "TGS" to
refer to service tickets (i.e. tickets whose service principal is
nfs/server.name), but I can't be sure.  The actual meaning of "TGS" is
"ticket-granting service," i.e. the KDC service whose principal name is
krbtgt/REALM.

> 2) For convenience, I set the TGS lifetimes to be as short as possible, 10 minutes for Win2008R2 AD which I test with.

Are you setting the maximum lifetime for nfs/server.name tickets to 10
minutes, but still allowing ticket-granting tickets to have a lifetime
of multiple hours?

>> 12) Wait until the client clock is past the server TGS expiry time
>> 13) re-try the mkdir - it succeeds after a successful GSS INIT NULL call exchange for both servers.

If I understand correctly, this request succeeds because
krb5_get_credentials() ignores the expired cached service ticket and
makes a TGS request for a new service ticket.  The cache now contains:

* A ticket for krbtgt/REALM with hours remaining
* A ticket for nfs/server.name which expired recently
* Another ticket for nfs/server.name which expires in ten minutes

Is that correct?

> Shouldn’t these refresh calls succeed? Isn’t the Kerberos clock skew supposed to handle this situation?

I think this case doesn't arise often because people don't often set
maximum service ticket lifetimes to be shorter than maximum TGT
lifetimes.  If the TGT itself has expired or is about to expire, some
out-of-band agent needs to refresh the TGT somehow, and it doesn't
matter all that much whether the failure comes from the client or the
server.

That said, your scenario should work, and it doesn't.  The primary cause
is an explicit check added to the krb5 mech's gss_accept_sec_context()
implementation in 1996 (before the MIT krb5 1.0 release), which checks
the ticket endtime with no allowance for clock skew.  I don't know
precisely why the check was added, but my guess it is for the
computation of the context validity lifetime; it would make no sense to
tell the application "the authentication succeeded and the resulting
context is valid for the next -3 minutes."

Perhaps a better choice would be to remove this check, and instead add
the clock skew to the validity lifetime of GSS krb5 acceptor contexts.
_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Gss context refresh failure due to clock skew

Adamson, Andy

> On Oct 5, 2015, at 3:10 PM, Greg Hudson <[hidden email]> wrote:
>
> Sorry for the delay; Andy's mail got stuck in the krbdev moderation
> queue by mistake.
>
> On 10/01/2015 05:30 PM, Adamson, Andy wrote:
>> The situation occurs as follows.
>
> I am a little bit confused by this description because of terminology
> issues.  In your description, you appear to use the phrase "TGS" to
> refer to service tickets (i.e. tickets whose service principal is
> nfs/server.name), but I can't be sure.  The actual meaning of "TGS" is
> "ticket-granting service," i.e. the KDC service whose principal name is
> krbtgt/REALM.

Hi Greg

Pardon my terminology gaff. I mean a ticket for nfs/server.name.

>
>> 2) For convenience, I set the TGS lifetimes to be as short as possible, 10 minutes for Win2008R2 AD which I test with.
>
> Are you setting the maximum lifetime for nfs/server.name tickets to 10
> minutes, but still allowing ticket-granting tickets to have a lifetime
> of multiple hours?

[root@rhel6-7ga sles-kernel]# klist -ce /tmp/krb5cc_machine_ANDROSAD.FAKE
Ticket cache: FILE:/tmp/krb5cc_machine_ANDROSAD.FAKE
Default principal: nfs/[hidden email]

Valid starting     Expires            Service principal
09/30/15 11:57:02  09/30/15 12:57:02  krbtgt/[hidden email]
        renew until 10/07/15 11:57:02, Etype (skey, tkt): aes256-cts-hmac-sha1-96, aes256-cts-hmac-sha1-96
09/30/15 11:57:02  09/30/15 12:07:02  nfs/[hidden email]
        renew until 10/07/15 11:57:02, Etype (skey, tkt): arcfour-hmac, arcfour-hmac


>
>>> 12) Wait until the client clock is past the server TGS expiry time
>>> 13) re-try the mkdir - it succeeds after a successful GSS INIT NULL call exchange for both servers.
>
> If I understand correctly, this request succeeds because
> krb5_get_credentials() ignores the expired cached service ticket and
> makes a TGS request for a new service ticket.  The cache now contains:
>
> * A ticket for krbtgt/REALM with hours remaining
> * A ticket for nfs/server.name which expired recently
> * Another ticket for nfs/server.name which expires in ten minutes
>
> Is that correct?

Yes, and the new service ticket produces an RPCSEC_GSS_INIT token that has an expiry that passes the servers clock test.

>
>> Shouldn’t these refresh calls succeed? Isn’t the Kerberos clock skew supposed to handle this situation?
>
> I think this case doesn't arise often because people don't often set
> maximum service ticket lifetimes to be shorter than maximum TGT
> lifetimes.  

Not the cause of the issue. The service ticket lifetime of 10 minutes is just there for testing this issue as I needed to wait until the service ticket had ‘expired’ on the server - but not yet on the client.

We see this issue all the time in NetApp QA as we run mutiple day heavy IO tests against a kerberos mount. If the server clock is ahead of the client clock, permission denied errors stop the test as the first service ticket “expires” on the server but not on the client.


> If the TGT itself has expired or is about to expire, some
> out-of-band agent needs to refresh the TGT somehow, and it doesn't
> matter all that much whether the failure comes from the client or the
> server.

I thought that having a keytab entry and a renewable TGT was enough.

>
> That said, your scenario should work, and it doesn't.  The primary cause
> is an explicit check added to the krb5 mech's gss_accept_sec_context()
> implementation in 1996 (before the MIT krb5 1.0 release), which checks
> the ticket endtime with no allowance for clock skew.  I don't know
> precisely why the check was added, but my guess it is for the
> computation of the context validity lifetime; it would make no sense to
> tell the application "the authentication succeeded and the resulting
> context is valid for the next -3 minutes.”

That also makes no sense - simply use the kerberos clock skew in the message. e.g. if the clock skew is 5 minutes, and if according to the server clock the ticket has been expired for 2 minutes, then the message becomes "the authentication succeeded and the resulting context is valid for the next 3 minutes.”  as there are 3 minutes left in the server clock time cavat the configured kerberos clock skew.


>
> Perhaps a better choice would be to remove this check, and instead add
> the clock skew to the validity lifetime of GSS krb5 acceptor contexts.

Yes. That is my opinion.

—>Andy


_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Gss context refresh failure due to clock skew

Greg Hudson
On 10/05/2015 03:35 PM, Adamson, Andy wrote:
>> I think this case doesn't arise often because people don't often set
>> maximum service ticket lifetimes to be shorter than maximum TGT
>> lifetimes.  
>
> Not the cause of the issue. The service ticket lifetime of 10 minutes is just there for testing this issue as I needed to wait until the service ticket had ‘expired’ on the server - but not yet on the client.
>
> We see this issue all the time in NetApp QA as we run mutiple day heavy IO tests against a kerberos mount. If the server clock is ahead of the client clock, permission denied errors stop the test as the first service ticket “expires” on the server but not on the client.

If the issue is not caused by short-lifetime service principals, then
the test scenario you described isn't representative of the real
scenario.  To reproduce the problem as it manifests in your IO tests,
you will need to adjust the TGT lifetime down to ten minutes as well as
the nfs/server lifetime.

>> If the TGT itself has expired or is about to expire, some
>> out-of-band agent needs to refresh the TGT somehow, and it doesn't
>> matter all that much whether the failure comes from the client or the
>> server.
>
> I thought that having a keytab entry and a renewable TGT was enough.

I'm not sure why you would do both of these; if you're getting initial
creds with a keytab, there is no need to muck around with ticket renewal.

Anyway, gss_init_sec_context() never renews tickets, and only gets
tickets from a keytab when a client keytab is configured (new in 1.11).
 When tickets are obtained using a client keytab, they are refreshed
from the keytab when they are halfway to expiring, so this clock skew
issue should not arise, so I don't think that feature is being used.

It is possible that the NFS client code has its own separate logic for
obtaining new tickets using a keytab.  If so, we need to understand how
it works.  It's possible (though unlikely) that changing the behavior of
gss_accept_sec_context() wouldn't be sufficient by itself.
_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Gss context refresh failure due to clock skew

Adamson, Andy

> On Oct 5, 2015, at 4:02 PM, Greg Hudson <[hidden email]> wrote:
>
> On 10/05/2015 03:35 PM, Adamson, Andy wrote:
>>> I think this case doesn't arise often because people don't often set
>>> maximum service ticket lifetimes to be shorter than maximum TGT
>>> lifetimes.  
>>
>> Not the cause of the issue. The service ticket lifetime of 10 minutes is just there for testing this issue as I needed to wait until the service ticket had ‘expired’ on the server - but not yet on the client.
>>
>> We see this issue all the time in NetApp QA as we run mutiple day heavy IO tests against a kerberos mount. If the server clock is ahead of the client clock, permission denied errors stop the test as the first service ticket “expires” on the server but not on the client.
>
> If the issue is not caused by short-lifetime service principals,

I was wrong - you are right, it is caused by service ticket lifetimes being shorter than TGT lifetimes.

I didn’t know setting the service ticket lifetimes to not be less than TGT lifetimes was a requirement. Neither does NetApp QA and I suspect, neither do customers in general.

> then
> the test scenario you described isn't representative of the real
> scenario.  To reproduce the problem as it manifests in your IO tests,
> you will need to adjust the TGT lifetime down to ten minutes as well as
> the nfs/server lifetime.

Code was added to rpc.gssd, the NFS client agent that creates GSS contexts for NFS, to take into account the clock skew and get a new TGT before (now+clock skew). So if the service ticket lifetime is equal to or greater than the TGT lifetime, then all is well.

>
>>> If the TGT itself has expired or is about to expire, some
>>> out-of-band agent needs to refresh the TGT somehow, and it doesn't
>>> matter all that much whether the failure comes from the client or the
>>> server.
>>
>> I thought that having a keytab entry and a renewable TGT was enough.
>
> I'm not sure why you would do both of these; if you're getting initial
> creds with a keytab, there is no need to muck around with ticket renewal.

I wouldn’t, but QA and customers do.

>
> Anyway, gss_init_sec_context() never renews tickets, and only gets
> tickets from a keytab when a client keytab is configured (new in 1.11).
> When tickets are obtained using a client keytab, they are refreshed
> from the keytab when they are halfway to expiring,

refreshed by…?

> so this clock skew
> issue should not arise, so I don't think that feature is being used.
>
> It is possible that the NFS client code has its own separate logic for
> obtaining new tickets using a keytab.  

When an NFS request requires a GSS context, if the context does not exist, is not valid, or if it is valid but the server replies to an RPC request using a GSS context with an RPC error that indicates it’s side of the GSS context has a problem, the client kernel does an upcall to rpc.gssd which then decides if a new service ticket is required to send an RPCSEC_GSS_INIT message to the server to create a new GSS context. The resultant GSS context is stored in the client kernel with a lifetime equal to the service ticket used to create it.

If rpc.gssd calls the code that refreshes the tickets from the keytab when they are half way to expiring’ then that should mitigate the clock skew issue.


> If so, we need to understand how
> it works.  It's possible (though unlikely) that changing the behavior of
> gss_accept_sec_context() wouldn't be sufficient by itself.


_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Gss context refresh failure due to clock skew

Benjamin Kaduk-2
On Mon, 5 Oct 2015, Adamson, Andy wrote:

>
> > On Oct 5, 2015, at 4:02 PM, Greg Hudson <[hidden email]> wrote:
> >
> > On 10/05/2015 03:35 PM, Adamson, Andy wrote:
> >>> I think this case doesn't arise often because people don't often set
> >>> maximum service ticket lifetimes to be shorter than maximum TGT
> >>> lifetimes.
> >>
> >> Not the cause of the issue. The service ticket lifetime of 10 minutes is just there for testing this issue as I needed to wait until the service ticket had ‘expired’ on the server - but not yet on the client.
> >>
> >> We see this issue all the time in NetApp QA as we run mutiple day heavy IO tests against a kerberos mount. If the server clock is ahead of the client clock, permission denied errors stop the test as the first service ticket “expires” on the server but not on the client.
> >
> > If the issue is not caused by short-lifetime service principals,
>
> I was wrong - you are right, it is caused by service ticket lifetimes being shorter than TGT lifetimes.
>
> I didn’t know setting the service ticket lifetimes to not be less than
> TGT lifetimes was a requirement. Neither does NetApp QA and I suspect,
> neither do customers in general.
It's not a requirement.  (Greg explicitly said "That said, your scenario
should work, and it doesn't." in his first message.)

> > then
> > the test scenario you described isn't representative of the real
> > scenario.  To reproduce the problem as it manifests in your IO tests,
> > you will need to adjust the TGT lifetime down to ten minutes as well as
> > the nfs/server lifetime.
>
> Code was added to rpc.gssd, the NFS client agent that creates GSS
> contexts for NFS, to take into account the clock skew and get a new TGT
> before (now+clock skew). So if the service ticket lifetime is equal to
> or greater than the TGT lifetime, then all is well.
>
> >
> >>> If the TGT itself has expired or is about to expire, some
> >>> out-of-band agent needs to refresh the TGT somehow, and it doesn't
> >>> matter all that much whether the failure comes from the client or the
> >>> server.
> >>
> >> I thought that having a keytab entry and a renewable TGT was enough.
> >
> > I'm not sure why you would do both of these; if you're getting initial
> > creds with a keytab, there is no need to muck around with ticket renewal.
>
> I wouldn’t, but QA and customers do.
>
> >
> > Anyway, gss_init_sec_context() never renews tickets, and only gets
> > tickets from a keytab when a client keytab is configured (new in 1.11).
> > When tickets are obtained using a client keytab, they are refreshed
> > from the keytab when they are halfway to expiring,
>
> refreshed by…?
The GSS library itself.
http://k5wiki.kerberos.org/wiki/Projects/Keytab_initiation and
http://web.mit.edu/kerberos/krb5-latest/doc/basic/keytab_def.html#default-client-keytab
give a little bit of intro, though this feature could benefit from better
documentation.

-Ben

> > so this clock skew
> > issue should not arise, so I don't think that feature is being used.
> >
> > It is possible that the NFS client code has its own separate logic for
> > obtaining new tickets using a keytab.
>
> When an NFS request requires a GSS context, if the context does not
> exist, is not valid, or if it is valid but the server replies to an RPC
> request using a GSS context with an RPC error that indicates it’s side
> of the GSS context has a problem, the client kernel does an upcall to
> rpc.gssd which then decides if a new service ticket is required to send
> an RPCSEC_GSS_INIT message to the server to create a new GSS context.
> The resultant GSS context is stored in the client kernel with a lifetime
> equal to the service ticket used to create it.
>
> If rpc.gssd calls the code that refreshes the tickets from the keytab
> when they are half way to expiring’ then that should mitigate the clock
> skew issue.
>
>
> > If so, we need to understand how
> > it works.  It's possible (though unlikely) that changing the behavior of
> > gss_accept_sec_context() wouldn't be sufficient by itself.
>
>
> _______________________________________________
> krbdev mailing list             [hidden email]
> https://mailman.mit.edu/mailman/listinfo/krbdev
>
_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Gss context refresh failure due to clock skew

Adamson, Andy

> On Oct 5, 2015, at 8:02 PM, Benjamin Kaduk <[hidden email]> wrote:
>
> On Mon, 5 Oct 2015, Adamson, Andy wrote:
>
>>
>>> On Oct 5, 2015, at 4:02 PM, Greg Hudson <[hidden email]> wrote:
>>>
>>> On 10/05/2015 03:35 PM, Adamson, Andy wrote:
>>>>> I think this case doesn't arise often because people don't often set
>>>>> maximum service ticket lifetimes to be shorter than maximum TGT
>>>>> lifetimes.
>>>>
>>>> Not the cause of the issue. The service ticket lifetime of 10 minutes is just there for testing this issue as I needed to wait until the service ticket had ‘expired’ on the server - but not yet on the client.
>>>>
>>>> We see this issue all the time in NetApp QA as we run mutiple day heavy IO tests against a kerberos mount. If the server clock is ahead of the client clock, permission denied errors stop the test as the first service ticket “expires” on the server but not on the client.
>>>
>>> If the issue is not caused by short-lifetime service principals,
>>
>> I was wrong - you are right, it is caused by service ticket lifetimes being shorter than TGT lifetimes.
>>
>> I didn’t know setting the service ticket lifetimes to not be less than
>> TGT lifetimes was a requirement. Neither does NetApp QA and I suspect,
>> neither do customers in general.
>
> It's not a requirement.  (Greg explicitly said "That said, your scenario
> should work, and it doesn't." in his first message.)

Hi Ben

OK.  This does mean that until this gets addressed, we will need to point this out to administrators.

>
>>> then
>>> the test scenario you described isn't representative of the real
>>> scenario.  To reproduce the problem as it manifests in your IO tests,
>>> you will need to adjust the TGT lifetime down to ten minutes as well as
>>> the nfs/server lifetime.
>>
>> Code was added to rpc.gssd, the NFS client agent that creates GSS
>> contexts for NFS, to take into account the clock skew and get a new TGT
>> before (now+clock skew). So if the service ticket lifetime is equal to
>> or greater than the TGT lifetime, then all is well.
>>
>>>
>>>>> If the TGT itself has expired or is about to expire, some
>>>>> out-of-band agent needs to refresh the TGT somehow, and it doesn't
>>>>> matter all that much whether the failure comes from the client or the
>>>>> server.
>>>>
>>>> I thought that having a keytab entry and a renewable TGT was enough.
>>>
>>> I'm not sure why you would do both of these; if you're getting initial
>>> creds with a keytab, there is no need to muck around with ticket renewal.
>>
>> I wouldn’t, but QA and customers do.
>>
>>>
>>> Anyway, gss_init_sec_context() never renews tickets, and only gets
>>> tickets from a keytab when a client keytab is configured (new in 1.11).
>>> When tickets are obtained using a client keytab, they are refreshed
>>> from the keytab when they are halfway to expiring,
>>
>> refreshed by…?
>
> The GSS library itself.
> http://k5wiki.kerberos.org/wiki/Projects/Keytab_initiation and
> http://web.mit.edu/kerberos/krb5-latest/doc/basic/keytab_def.html#default-client-keytab
> give a little bit of intro, though this feature could benefit from better
> documentation.

Thanks for the info

—>Andy

>
> -Ben
>
>>> so this clock skew
>>> issue should not arise, so I don't think that feature is being used.
>>>
>>> It is possible that the NFS client code has its own separate logic for
>>> obtaining new tickets using a keytab.
>>
>> When an NFS request requires a GSS context, if the context does not
>> exist, is not valid, or if it is valid but the server replies to an RPC
>> request using a GSS context with an RPC error that indicates it’s side
>> of the GSS context has a problem, the client kernel does an upcall to
>> rpc.gssd which then decides if a new service ticket is required to send
>> an RPCSEC_GSS_INIT message to the server to create a new GSS context.
>> The resultant GSS context is stored in the client kernel with a lifetime
>> equal to the service ticket used to create it.
>>
>> If rpc.gssd calls the code that refreshes the tickets from the keytab
>> when they are half way to expiring’ then that should mitigate the clock
>> skew issue.
>>
>>
>>> If so, we need to understand how
>>> it works.  It's possible (though unlikely) that changing the behavior of
>>> gss_accept_sec_context() wouldn't be sufficient by itself.
>>
>>
>> _______________________________________________
>> krbdev mailing list             [hidden email]
>> https://mailman.mit.edu/mailman/listinfo/krbdev


_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Gss context refresh failure due to clock skew

Adamson, Andy

> On Oct 6, 2015, at 10:53 AM, Andy Adamson <[hidden email]> wrote:
>
>>
>> On Oct 5, 2015, at 8:02 PM, Benjamin Kaduk <[hidden email]> wrote:
>>
>> On Mon, 5 Oct 2015, Adamson, Andy wrote:
>>
>>>
>>>> On Oct 5, 2015, at 4:02 PM, Greg Hudson <[hidden email]> wrote:
>>>>
>>>> On 10/05/2015 03:35 PM, Adamson, Andy wrote:
>>>>>> I think this case doesn't arise often because people don't often set
>>>>>> maximum service ticket lifetimes to be shorter than maximum TGT
>>>>>> lifetimes.
>>>>>
>>>>> Not the cause of the issue. The service ticket lifetime of 10 minutes is just there for testing this issue as I needed to wait until the service ticket had ‘expired’ on the server - but not yet on the client.
>>>>>
>>>>> We see this issue all the time in NetApp QA as we run mutiple day heavy IO tests against a kerberos mount. If the server clock is ahead of the client clock, permission denied errors stop the test as the first service ticket “expires” on the server but not on the client.
>>>>
>>>> If the issue is not caused by short-lifetime service principals,
>>>
>>> I was wrong - you are right, it is caused by service ticket lifetimes being shorter than TGT lifetimes.


Actually, setting the service ticket lifetime to be equal to (or greater than if this is possible) the TGT lifetime will not help. Just as in the example I sent, the application will get permission denied during the time difference between the client and server clock.

Say the server clock is ahead of the client clock by X seconds and that the service ticket on the client has X seconds of lifetime left before is expires on the client - which means it has already expired on the server. The client sends an NFS request which will be rejected by the server with a bad context RPC level AUTH_ERROR as the GSS Context on the server will have expired (context lifetime == service ticket lifetime). The server reply prompts the client do an upcall to gssd to refresh the GSS context. gssd sends an RPCSEC_GSS_INIT message using the service ticket (as it still has X seconds of valid lifetime), and the server will reject the RPCSEC_GSS_INIT request with GSS_S_CREDENTIALS_EXPIRED.

—>Andy



>>>
>>> I didn’t know setting the service ticket lifetimes to not be less than
>>> TGT lifetimes was a requirement. Neither does NetApp QA and I suspect,
>>> neither do customers in general.
>>
>> It's not a requirement.  (Greg explicitly said "That said, your scenario
>> should work, and it doesn't." in his first message.)
>
> Hi Ben
>
> OK.  This does mean that until this gets addressed, we will need to point this out to administrators.
>
>>
>>>> then
>>>> the test scenario you described isn't representative of the real
>>>> scenario.  To reproduce the problem as it manifests in your IO tests,
>>>> you will need to adjust the TGT lifetime down to ten minutes as well as
>>>> the nfs/server lifetime.
>>>
>>> Code was added to rpc.gssd, the NFS client agent that creates GSS
>>> contexts for NFS, to take into account the clock skew and get a new TGT
>>> before (now+clock skew). So if the service ticket lifetime is equal to
>>> or greater than the TGT lifetime, then all is well.
>>>
>>>>
>>>>>> If the TGT itself has expired or is about to expire, some
>>>>>> out-of-band agent needs to refresh the TGT somehow, and it doesn't
>>>>>> matter all that much whether the failure comes from the client or the
>>>>>> server.
>>>>>
>>>>> I thought that having a keytab entry and a renewable TGT was enough.
>>>>
>>>> I'm not sure why you would do both of these; if you're getting initial
>>>> creds with a keytab, there is no need to muck around with ticket renewal.
>>>
>>> I wouldn’t, but QA and customers do.
>>>
>>>>
>>>> Anyway, gss_init_sec_context() never renews tickets, and only gets
>>>> tickets from a keytab when a client keytab is configured (new in 1.11).
>>>> When tickets are obtained using a client keytab, they are refreshed
>>>> from the keytab when they are halfway to expiring,
>>>
>>> refreshed by…?
>>
>> The GSS library itself.
>> http://k5wiki.kerberos.org/wiki/Projects/Keytab_initiation and
>> http://web.mit.edu/kerberos/krb5-latest/doc/basic/keytab_def.html#default-client-keytab
>> give a little bit of intro, though this feature could benefit from better
>> documentation.
>
> Thanks for the info
>
> —>Andy
>>
>> -Ben
>>
>>>> so this clock skew
>>>> issue should not arise, so I don't think that feature is being used.
>>>>
>>>> It is possible that the NFS client code has its own separate logic for
>>>> obtaining new tickets using a keytab.
>>>
>>> When an NFS request requires a GSS context, if the context does not
>>> exist, is not valid, or if it is valid but the server replies to an RPC
>>> request using a GSS context with an RPC error that indicates it’s side
>>> of the GSS context has a problem, the client kernel does an upcall to
>>> rpc.gssd which then decides if a new service ticket is required to send
>>> an RPCSEC_GSS_INIT message to the server to create a new GSS context.
>>> The resultant GSS context is stored in the client kernel with a lifetime
>>> equal to the service ticket used to create it.
>>>
>>> If rpc.gssd calls the code that refreshes the tickets from the keytab
>>> when they are half way to expiring’ then that should mitigate the clock
>>> skew issue.
>>>
>>>
>>>> If so, we need to understand how
>>>> it works.  It's possible (though unlikely) that changing the behavior of
>>>> gss_accept_sec_context() wouldn't be sufficient by itself.
>>>
>>>
>>> _______________________________________________
>>> krbdev mailing list             [hidden email]
>>> https://mailman.mit.edu/mailman/listinfo/krbdev


_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Gss context refresh failure due to clock skew

Greg Hudson
On 10/07/2015 09:22 AM, Adamson, Andy wrote:
> Actually, setting the service ticket lifetime to be equal to (or greater than if this is possible) the TGT lifetime will not help. Just as in the example I sent, the application will get permission denied during the time difference between the client and server clock.

That is expected.  What is not expected, in this variant, is that
gss_init_sec_context() will succeed by itself once the client believes
the TGT and service ticket to have expired.  Apologies for any
miscommunication on this point.

There may be something in the calling code which refreshes the TGT in
this situation.  If so, then to fully understand the scenario, we need
to know how the calling code decides when to refresh the TGT.

I opened a ticket about this issue here:

    http://krbdev.mit.edu/rt/Ticket/Display.html?id=8268
_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Gss context refresh failure due to clock skew

Adamson, Andy

> On Oct 7, 2015, at 10:45 AM, Greg Hudson <[hidden email]> wrote:
>
> On 10/07/2015 09:22 AM, Adamson, Andy wrote:
>> Actually, setting the service ticket lifetime to be equal to (or greater than if this is possible) the TGT lifetime will not help. Just as in the example I sent, the application will get permission denied during the time difference between the client and server clock.
>
> That is expected.  What is not expected, in this variant, is that
> gss_init_sec_context() will succeed by itself once the client believes
> the TGT and service ticket to have expired.  Apologies for any
> miscommunication on this point.
>
> There may be something in the calling code which refreshes the TGT in
> this situation.  If so, then to fully understand the scenario, we need
> to know how the calling code decides when to refresh the TGT.
>
> I opened a ticket about this issue here:
>
>    http://krbdev.mit.edu/rt/Ticket/Display.html?id=8268

—— from the ticket: ——

This unnecessarily strict check causes a particularly bad experience
when (a) the client's clock is slightly ahead of the server's clock,
and (b) the maximum service ticket lifetime is lower than the maximum
TGT lifetime.

—— ——
I think both a) and b) are incorrect.

for a) you got it backwards. this occurs when the server clock is ahead of the client clock.

for b) the relationship between the TGT lifetime and the service ticket lifetime is irrelevant. Only the service ticket lifetime has any effect as the client will use a valid service ticket to construct an RPCSEC_GSS_INIT request irregardless of the TGT lifetime value.


Here is how I would describe it:

Change this:

This unnecessarily strict check causes a particularly bad experience
when (a) the client's clock is slightly ahead of the server's clock,
and (b) the maximum service ticket lifetime is lower than the maximum
TGT lifetime.  In that circumstance, the client will acquire a new
service ticket using the TGT if the client sees the credential as
expired, but the application will experience an authentication
failure if only the server sees the credential as expired.


to this

This unnecessarily strict check causes a particularly bad experience
when the server's clock is slightly ahead of the clients clock.
In that circumstance, the client will use the almost expired service
ticket in constructing the gss_init_sec_context call and not acquire
a new service ticket using the TGT. The application will experience
an authentication failure as the server calling gss_accept_sec_context
sees the credential as expired.




_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Gss context refresh failure due to clock skew

Greg Hudson
On 10/07/2015 11:00 AM, Adamson, Andy wrote:

> —— from the ticket: ——
>
> This unnecessarily strict check causes a particularly bad experience
> when (a) the client's clock is slightly ahead of the server's clock,
> and (b) the maximum service ticket lifetime is lower than the maximum
> TGT lifetime.
>
> —— ——
> I think both a) and b) are incorrect.
>
> for a) you got it backwards. this occurs when the server clock is ahead of the client clock.

Yes, I did write the wrong thing there; I will follow up on that.

> for b) the relationship between the TGT lifetime and the service ticket lifetime is irrelevant. Only the service ticket lifetime has any effect as the client will use a valid service ticket to construct an RPCSEC_GSS_INIT request irregardless of the TGT lifetime value.

I will try one more time to communicate what I mean:

* If the service ticket end time is less than the TGT end time, then
gss_init_sec_context() fails during the clock skew window, and starts
succeeding again afterwards.

* If the service ticket and TGT have both expired (according to the
server), then gss_init_sec_context() fails, and keeps failing
afterwards, unless there is some out-of-band agent refreshing expired TGTs.

Put another way: we expect authentications to start failing around the
time the TGT expires.  We do not expect authentications to start failing
around the time a service ticket expires, if the TGT is still valid.
That is what I refer to as a "particularly" bad experience.

If that isn't clear, perhaps we should ignore this as a moot point; it
doesn't really affect how we plan to change the krb5 code.
_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Gss context refresh failure due to clock skew

Olga Kornievskaia
On Wed, Oct 7, 2015 at 11:08 AM, Greg Hudson <[hidden email]> wrote:

> On 10/07/2015 11:00 AM, Adamson, Andy wrote:
>> —— from the ticket: ——
>>
>> This unnecessarily strict check causes a particularly bad experience
>> when (a) the client's clock is slightly ahead of the server's clock,
>> and (b) the maximum service ticket lifetime is lower than the maximum
>> TGT lifetime.
>>
>> —— ——
>> I think both a) and b) are incorrect.
>>
>> for a) you got it backwards. this occurs when the server clock is ahead of the client clock.
>
> Yes, I did write the wrong thing there; I will follow up on that.
>
>> for b) the relationship between the TGT lifetime and the service ticket lifetime is irrelevant. Only the service ticket lifetime has any effect as the client will use a valid service ticket to construct an RPCSEC_GSS_INIT request irregardless of the TGT lifetime value.
>
> I will try one more time to communicate what I mean:
>
> * If the service ticket end time is less than the TGT end time, then
> gss_init_sec_context() fails during the clock skew window, and starts
> succeeding again afterwards.
>
> * If the service ticket and TGT have both expired (according to the
> server), then gss_init_sec_context() fails, and keeps failing
> afterwards, unless there is some out-of-band agent refreshing expired TGTs.
>
> Put another way: we expect authentications to start failing around the
> time the TGT expires.  We do not expect authentications to start failing
> around the time a service ticket expires, if the TGT is still valid.

Why not? This is not what should happen according to the theory of
Kerberos protocol. Let's use slightly generic terms, TGT is a
credential that proves client's identity to the KDC. TGT or it's
lifetime has no relevance in the context of authentication between the
client and a kerberized service, in this case an NFS server. Then a
service ticket is a credential that is used to prove client's identity
to the NFS server. The lifetime of the NFS service ticket should be
allowed to be valid within some configurable clock skew.

> That is what I refer to as a "particularly" bad experience.
>
> If that isn't clear, perhaps we should ignore this as a moot point; it
> doesn't really affect how we plan to change the krb5 code.
> _______________________________________________
> krbdev mailing list             [hidden email]
> https://mailman.mit.edu/mailman/listinfo/krbdev

_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Gss context refresh failure due to clock skew

Simo Sorce-3
On 07/10/15 11:46, Olga Kornievskaia wrote:

> On Wed, Oct 7, 2015 at 11:08 AM, Greg Hudson <[hidden email]> wrote:
>> On 10/07/2015 11:00 AM, Adamson, Andy wrote:
>>> —— from the ticket: ——
>>>
>>> This unnecessarily strict check causes a particularly bad experience
>>> when (a) the client's clock is slightly ahead of the server's clock,
>>> and (b) the maximum service ticket lifetime is lower than the maximum
>>> TGT lifetime.
>>>
>>> —— ——
>>> I think both a) and b) are incorrect.
>>>
>>> for a) you got it backwards. this occurs when the server clock is ahead of the client clock.
>>
>> Yes, I did write the wrong thing there; I will follow up on that.
>>
>>> for b) the relationship between the TGT lifetime and the service ticket lifetime is irrelevant. Only the service ticket lifetime has any effect as the client will use a valid service ticket to construct an RPCSEC_GSS_INIT request irregardless of the TGT lifetime value.
>>
>> I will try one more time to communicate what I mean:
>>
>> * If the service ticket end time is less than the TGT end time, then
>> gss_init_sec_context() fails during the clock skew window, and starts
>> succeeding again afterwards.
>>
>> * If the service ticket and TGT have both expired (according to the
>> server), then gss_init_sec_context() fails, and keeps failing
>> afterwards, unless there is some out-of-band agent refreshing expired TGTs.
>>
>> Put another way: we expect authentications to start failing around the
>> time the TGT expires.  We do not expect authentications to start failing
>> around the time a service ticket expires, if the TGT is still valid.
>
> Why not?

Because technically the client can acquire a new ticket at any time if
the TGT is valid, but in this case instead it fails to acquire a new
ticket and fails the authentication.

> This is not what should happen according to the theory of
> Kerberos protocol. Let's use slightly generic terms, TGT is a
> credential that proves client's identity to the KDC. TGT or it's
> lifetime has no relevance in the context of authentication between the
> client and a kerberized service, in this case an NFS server. Then a
> service ticket is a credential that is used to prove client's identity
> to the NFS server. The lifetime of the NFS service ticket should be
> allowed to be valid within some configurable clock skew.

Yes, but this is not what Greg was referring to :)

Simo.

>> That is what I refer to as a "particularly" bad experience.
>>
>> If that isn't clear, perhaps we should ignore this as a moot point; it
>> doesn't really affect how we plan to change the krb5 code.
>> _______________________________________________
>> krbdev mailing list             [hidden email]
>> https://mailman.mit.edu/mailman/listinfo/krbdev
>
> _______________________________________________
> krbdev mailing list             [hidden email]
> https://mailman.mit.edu/mailman/listinfo/krbdev
>


--
Simo Sorce * Red Hat, Inc * New York
_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev
Reply | Threaded
Open this post in threaded view
|

Re: Gss context refresh failure due to clock skew

Olga Kornievskaia
On Wed, Oct 7, 2015 at 9:48 PM, Simo Sorce <[hidden email]> wrote:

> On 07/10/15 11:46, Olga Kornievskaia wrote:
>> On Wed, Oct 7, 2015 at 11:08 AM, Greg Hudson <[hidden email]> wrote:
>>> On 10/07/2015 11:00 AM, Adamson, Andy wrote:
>>>> —— from the ticket: ——
>>>>
>>>> This unnecessarily strict check causes a particularly bad experience
>>>> when (a) the client's clock is slightly ahead of the server's clock,
>>>> and (b) the maximum service ticket lifetime is lower than the maximum
>>>> TGT lifetime.
>>>>
>>>> —— ——
>>>> I think both a) and b) are incorrect.
>>>>
>>>> for a) you got it backwards. this occurs when the server clock is ahead of the client clock.
>>>
>>> Yes, I did write the wrong thing there; I will follow up on that.
>>>
>>>> for b) the relationship between the TGT lifetime and the service ticket lifetime is irrelevant. Only the service ticket lifetime has any effect as the client will use a valid service ticket to construct an RPCSEC_GSS_INIT request irregardless of the TGT lifetime value.
>>>
>>> I will try one more time to communicate what I mean:
>>>
>>> * If the service ticket end time is less than the TGT end time, then
>>> gss_init_sec_context() fails during the clock skew window, and starts
>>> succeeding again afterwards.
>>>
>>> * If the service ticket and TGT have both expired (according to the
>>> server), then gss_init_sec_context() fails, and keeps failing
>>> afterwards, unless there is some out-of-band agent refreshing expired TGTs.
>>>
>>> Put another way: we expect authentications to start failing around the
>>> time the TGT expires.  We do not expect authentications to start failing
>>> around the time a service ticket expires, if the TGT is still valid.
>>
>> Why not?
>
> Because technically the client can acquire a new ticket at any time if
> the TGT is valid, but in this case instead it fails to acquire a new
> ticket and fails the authentication.
>

Client should not be acquiring a new service ticket when it has a
non-expired service ticket according to its clock. It is the case that
the server thinks the ticket has expired because it has no slack for
clocks being skewed and that's incorrect.

It's not clear that this issue is agreed upon. Whether or not a new
service ticket is acquired later by the client is not in question.

If the server implements a reasonable clock skew policy, it will allow
for the client side code to detect that the service ticket has expired
and renew it. That functionality is properly working on the client
side.

Alternatively, client side code can be changed to take care of
receiving and properly handling CREDENTIALS_EXPIRED error on the
client side by acquiring a service ticket then which the code doesn't
currently do.

>> This is not what should happen according to the theory of
>> Kerberos protocol. Let's use slightly generic terms, TGT is a
>> credential that proves client's identity to the KDC. TGT or it's
>> lifetime has no relevance in the context of authentication between the
>> client and a kerberized service, in this case an NFS server. Then a
>> service ticket is a credential that is used to prove client's identity
>> to the NFS server. The lifetime of the NFS service ticket should be
>> allowed to be valid within some configurable clock skew.
>
> Yes, but this is not what Greg was referring to :)
>
> Simo.
>
>>> That is what I refer to as a "particularly" bad experience.
>>>
>>> If that isn't clear, perhaps we should ignore this as a moot point; it
>>> doesn't really affect how we plan to change the krb5 code.
>>> _______________________________________________
>>> krbdev mailing list             [hidden email]
>>> https://mailman.mit.edu/mailman/listinfo/krbdev
>>
>> _______________________________________________
>> krbdev mailing list             [hidden email]
>> https://mailman.mit.edu/mailman/listinfo/krbdev
>>
>
>
> --
> Simo Sorce * Red Hat, Inc * New York
> _______________________________________________
> krbdev mailing list             [hidden email]
> https://mailman.mit.edu/mailman/listinfo/krbdev

_______________________________________________
krbdev mailing list             [hidden email]
https://mailman.mit.edu/mailman/listinfo/krbdev