ipropd slow when pushing complete database

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

ipropd slow when pushing complete database

Adam Lewenberg
I sometimes push the entire master database to a replica. I do this
using ipropd. The process is to stop ipropd-slave on the slave, delete
the database and log files, and then restart the ipropd-slave service.
The entire database is then pushed to the slave.

When pushing the database via ipropd to a traditional VM the process
takes a very long time, upwards of 5 to 6 *minutes*. I also get these
speeds when pushing to a stretch container running as a slave. In the
past, when using Heimdal 7.1 and pushing to a jessie container, the
speed was quite fast, around 10 seconds.

For comparison, I did an scp of a copy of the database file from the
master to the slave, and that only took about 6 seconds.

I realize that doing an ssh copy is not at all the same thing as doing
an ipropd copy from version 0, but something must be wrong if the ipropd
copy is 50 to 60 times slower.

I am running Heimdal version 7.5 on Debian stretch machines. The
database file is around 430 MB and the log file is around 23MB.

Anyone have any troubleshooting steps to find out why ipropd is so slow?

Thanks, Adam Lewenberg







Reply | Threaded
Open this post in threaded view
|

Re: ipropd slow when pushing complete database

Jeffrey Altman-2
On 12/19/2018 1:21 PM, [hidden email] wrote:

> I sometimes push the entire master database to a replica. I do this
> using ipropd. The process is to stop ipropd-slave on the slave, delete
> the database and log files, and then restart the ipropd-slave service.
> The entire database is then pushed to the slave.
>
> When pushing the database via ipropd to a traditional VM the process
> takes a very long time, upwards of 5 to 6 *minutes*. I also get these
> speeds when pushing to a stretch container running as a slave. In the
> past, when using Heimdal 7.1 and pushing to a jessie container, the
> speed was quite fast, around 10 seconds.
>
> For comparison, I did an scp of a copy of the database file from the
> master to the slave, and that only took about 6 seconds.
>
> I realize that doing an ssh copy is not at all the same thing as doing
> an ipropd copy from version 0, but something must be wrong if the ipropd
> copy is 50 to 60 times slower.
>
> I am running Heimdal version 7.5 on Debian stretch machines. The
> database file is around 430 MB and the log file is around 23MB.
>
> Anyone have any troubleshooting steps to find out why ipropd is so slow?
>
> Thanks, Adam Lewenberg
Looking at the changes from 7.1 to 7.5 there isn't anything obvious in
ipropd that could alter the performance.

  git log -p heimdal-7.1.0..heimdal-7.5.0 lib/kadm5/

I would start by comparing heimdal 7.5 on jessie with 7.1 on jessie or
7.1 on stretch with 7.5 on stretch.

I'm sure that there have been many changes in the network plumping
between jessie and stretch with regards to containers.  Start by
reducing the variables.

Jeffrey Altman



smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ipropd slow when pushing complete database

Adam Lewenberg
In reply to this post by Adam Lewenberg
Maybe 5 minutes is to be expected. What are others seeing as the time it
takes a slave to pull down a complete database using iprop?



On 12/19/2018 10:21 AM, [hidden email] wrote:

> I sometimes push the entire master database to a replica. I do this
> using ipropd. The process is to stop ipropd-slave on the slave, delete
> the database and log files, and then restart the ipropd-slave service.
> The entire database is then pushed to the slave.
>
> When pushing the database via ipropd to a traditional VM the process
> takes a very long time, upwards of 5 to 6 *minutes*. I also get these
> speeds when pushing to a stretch container running as a slave. In the
> past, when using Heimdal 7.1 and pushing to a jessie container, the
> speed was quite fast, around 10 seconds.
>
> For comparison, I did an scp of a copy of the database file from the
> master to the slave, and that only took about 6 seconds.
>
> I realize that doing an ssh copy is not at all the same thing as doing
> an ipropd copy from version 0, but something must be wrong if the ipropd
> copy is 50 to 60 times slower.
>
> I am running Heimdal version 7.5 on Debian stretch machines. The
> database file is around 430 MB and the log file is around 23MB.
>
> Anyone have any troubleshooting steps to find out why ipropd is so slow?
>
> Thanks, Adam Lewenberg
>
>
>
>
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: ipropd slow when pushing complete database

Nico Williams
In reply to this post by Adam Lewenberg
Hi, you need the commits from master listed below (which will be in
8.0), which disable sync writes whild doing bulk HDB loads.  It's the
per-record fsync()s that kill performance.

If you copied the HDB with scp and loaded it with kadmin -l, you'd find
it's still slow -- the network has nothing to do with it :/

Our HDB is larger than yours, so we needed this sooner...

commit 7d5f8bb051ca84592d1196bf5d5522da5a50f9d6
Author: Nicolas Williams <[hidden email]>
Date:   Tue Oct 10 12:18:57 2017 -0500

    Disable sync during kadmin load

commit 305dc816525f461f9bfe640d87f671f53f0e0fc6
Author: Nicolas Williams <[hidden email]>
Date:   Tue Oct 10 12:11:26 2017 -0500

    Disable sync during iprop receive_everything()

    Doing an fsync per-record when receiving the complete HDB is a performance
    disaster.  Among other things, if the HDB is very large, then one slave
    receving a full HDB can cause other slaves to timeout and, if HDB write
    activity is high enough to cause iprop log truncation, then also need full
    syncs, which leads to a cycle of full syncs for all slaves until HDB write
    activity drops.

    Allowing the iprop log to be larger helps, but improving receive_everything()
    performance helps even more.

commit 5bcbe2125b18160f6ad348b15f8036ffedc15770
Author: Nicolas Williams <[hidden email]>
Date:   Tue Oct 10 13:06:21 2017 -0500

    Add hdb_set_sync() method


Reply | Threaded
Open this post in threaded view
|

Re: ipropd slow when pushing complete database

Jeffrey Altman-2
In reply to this post by Adam Lewenberg
If the statement that transfers took 10s with 7.1 and Jessie is
accurate, then the expected time should be 10s.

Nico replied with a list of changes that are present on the "master"
branch which will speed up writes by avoiding a disk sync operation for
each and every record.

One question that you should also ponder is whether the storage
infrastructure on which the Stretch systems are running is equivalent to
the infrastructure on the Jessie systems were running.  Perhaps the
Jessie systems ignored disk sync operations.

Jeffrey Altman


On 12/19/2018 4:58 PM, Adam Lewenberg wrote:

> Maybe 5 minutes is to be expected. What are others seeing as the time it
> takes a slave to pull down a complete database using iprop?
>
>
>
> On 12/19/2018 10:21 AM, [hidden email] wrote:
>> I sometimes push the entire master database to a replica. I do this
>> using ipropd. The process is to stop ipropd-slave on the slave, delete
>> the database and log files, and then restart the ipropd-slave service.
>> The entire database is then pushed to the slave.
>>
>> When pushing the database via ipropd to a traditional VM the process
>> takes a very long time, upwards of 5 to 6 *minutes*. I also get these
>> speeds when pushing to a stretch container running as a slave. In the
>> past, when using Heimdal 7.1 and pushing to a jessie container, the
>> speed was quite fast, around 10 seconds.
>>
>> For comparison, I did an scp of a copy of the database file from the
>> master to the slave, and that only took about 6 seconds.
>>
>> I realize that doing an ssh copy is not at all the same thing as doing
>> an ipropd copy from version 0, but something must be wrong if the
>> ipropd copy is 50 to 60 times slower.
>>
>> I am running Heimdal version 7.5 on Debian stretch machines. The
>> database file is around 430 MB and the log file is around 23MB.
>>
>> Anyone have any troubleshooting steps to find out why ipropd is so slow?
>>
>> Thanks, Adam Lewenberg
>>
>>
>>
>>
>>
>>
>>
>


smime.p7s (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: ipropd slow when pushing complete database

Nico Williams
On Wed, Dec 19, 2018 at 07:35:23PM -0500, Jeffrey Altman wrote:
> If the statement that transfers took 10s with 7.1 and Jessie is
> accurate, then the expected time should be 10s.
>
> Nico replied with a list of changes that are present on the "master"
> branch which will speed up writes by avoiding a disk sync operation for
> each and every record.

I don't know why 7.1 would have been fast and 7.5 slow.  I looked
through the commit history, but nothing obvious caught my attention.

> One question that you should also ponder is whether the storage
> infrastructure on which the Stretch systems are running is equivalent to
> the infrastructure on the Jessie systems were running.  Perhaps the
> Jessie systems ignored disk sync operations.

Good question!

Nico
--
Reply | Threaded
Open this post in threaded view
|

Re: ipropd slow when pushing complete database

Harald Barth-2
In reply to this post by Adam Lewenberg

Have you considered that this could be a problem with TCP packet sizes and
Nagle's algorithm? https://en.wikipedia.org/wiki/Nagle%27s_algorithm

We have identified back in 2015 that kadmin get * is slow because of
this and I don't thinks that was ever fixed because one needs to
restructure the whole send/receive program flow for this. To start
with one could use TCP_NODELAY and see if that fixes the problem.

Then if TCP_NODELAY is a permanent fix and should go into the
production code is another question (which was never solved for the
kadmin case if I recall corectly).

This was just a guess,
Harald.
Reply | Threaded
Open this post in threaded view
|

Re: ipropd slow when pushing complete database

Nico Williams
On Thu, Dec 20, 2018 at 09:40:39AM +0100, Harald Barth wrote:

> Have you considered that this could be a problem with TCP packet sizes and
> Nagle's algorithm? https://en.wikipedia.org/wiki/Nagle%27s_algorithm
>
> We have identified back in 2015 that kadmin get * is slow because of
> this and I don't thinks that was ever fixed because one needs to
> restructure the whole send/receive program flow for this. To start
> with one could use TCP_NODELAY and see if that fixes the problem.
>
> Then if TCP_NODELAY is a permanent fix and should go into the
> production code is another question (which was never solved for the
> kadmin case if I recall corectly).
>
> This was just a guess,

It's a pretty good one.

The fsync() thing is a big deal by itself.  We should definitely make
sure we disable Nagle in most TCP protocols in Heimdal.

Nico
--