[Buildroot] [PATCH 00/19] support: limit install-time instrumentation to current package's files (branch yem/files-list-2)

Yann E. MORIN yann.morin.1998 at free.fr
Mon Jan 7 22:05:35 UTC 2019


Hello All!

Currently, the instrumentation steps, that we run after a package is
installed, get confused about the files that package may have be
responsible for.

The first problem is that all .la files are tweaked after a package is
installed, and thus those files are all then newer than the built
stampfile of that package, and consequently all .la files are accounted
to that package.

The second problem is that, during development and agter a user
requested a package reinstall (but not a rebuild!), then the built
stampfile is much older, and thus all files that have been installed
since the package was last built are accoutned to that package.

Those two problems are caused by 7fb6e782542f, when we switched away
from an md5 comparison between the state before and after the
installation, to a time-based comparison against the bult stampfile.

Furthermore, during development, the list of installed files can get
out of sync with what is really installed. For example, if a user were
to modify the source of a package, and trigger a re-configure, rebuild,
or re-install, then we'd remove the list of previously installed files
before generating the list of currently installed files. If files
installed in the previous installation are no longer installed, they are
still present in the target (or staging or host), but no longer
accounted to the package that instaleld them.

Additionally, when two or more packages install the same file and it has
the same content, we don't care much about which actually installed it,
as they would all have installed the exact same file. The size could be
assigned to any of those packages, and the licensing terms of any of
those package may be applied to that file. The case is mostly prominent
with the fftw familly of packages (soon to come) that install the same
headers and the same utilities.

Finally, there is one prominent file that gets _updated_ (and not
replaced) by many packages: the info page index, which packages update
when they install their own info pages. We currently report that file,
when in fact it does not end up in target, and thus we don't care about
how its content came to be. And more generically, we don't care any file
that we eventually remove as part of our target-finalize cleanups.

This series is thus an attempt at fixing all those issues.

First and foremost, the series addresses the limitation that causes the
first two problems: we do not have a way to know when the install steps
were started (or any other step, for that matters, but we're currently
only interested in the install steps). So, the first few patches make it
so that we can introduce an new timestamp file at the beginning of each
step.

Then, with the information about the beginning of the install step, we
can now limit the .la files tweaking to just those files that were
actually instaleld y a package. And then we use that same stamp file to
limit the listing of installed files accountable to the current package.

Then the series addreses the same-identical-file-from-many-packages. To
do so, it partially restore the md5sum of the files, but this is
limitted to only those files actually touched during the install of the
current package (see above), and is only ran at the end of the install,
not before. As thus, this is much faster than the original situation
that did the md5 of all files before and after, because it now acts on
cache-hot files only.

That part is split in two: first, the formnat of the packages-file-list
files is modified to be more resilient to weird filenames, which then
allows us to expand it with arbitrarily more fields. A python helper is
provided to abstract the new format, and the consumers of those files
are updated to use the helper (with one script being rewritten in
python). Then we make use of this new format to store the md5 of the
files contents, which we eventually use to decide whether to report the
file or not.

Now, files that are missing from the destination directory are no longer
elligible for being reported as being touched by more than ne pacakge
anymore.

And finally, now that we have a dependable check for uniqueness, we can
add an option in the menuconfig to turn the current warning into a hard
error when uniqueness is not met.

Since this is a time-sensitive topic, here are a few timings before and
after this series, over 6 runs on an idle machine, with a configuration:

  - prebuilt glibc toolchain
  - 233 packages, most pretty small and building fast
  - target/:  215MiB, 14922 files, directories, symlinks...
  - staging/: 625MiB, 29029 files, directories, symlinks...
  - host/:    2.1GiB, 44129 files, directories, symlinks...

                best           minutes:seconds          worst   mean
    before:     36:20   36:22   36:23   36:24   36:27   36:28   36:24
    after:      36:29   36:31   36:32   36:33   36:35   36:37   36:33

So, this is a 9s overhead over 2184s (36:24, before), i.e. a mere 0.4%
increase in time over the full build, or just about a 38ms overhead per
package on average. This overhead is real, but is still very far from
the huge one that was choped off by 7fb6e782542f.

Additionally, the time for re-installing the last package does not
suffer from an already large number or size of files already present.
Best result of three builds (to be cache-hot), for one target package
with a staging install, and one for host package:

            skeleton-init-common-reinstall    host-patchelf-reinstall
    before:            8.258s                       4.951s
    after:             4.514s                       5.034s
    delta:             -3.744s                     +0.083s

So, basically, what this means is that, during development, reinstalling
a previous package is faster. This is because, even though we spend (a
little tiny wee bit) more time when lisitings files due to the md5sum
(and really, thats really just a few additional millieconds per package),
we get repaid hundreths-fold because the list is now accurate, and we
can limit ourselves to tweaking only the corresponding .la file, but
also limit the check-bin-arch to only those files actually interesting.

The host packages are still slightly impacted as we can see for
host-patchelf, because the check-bin-arch does not apply to them, so the
gain from running check-bin-arch only on just-installed files can't
apply to host packages. Still, the impact is minor.

I'd like to particularly thank Nicolas Cavallari for their valuable
input about the issues they encountered with the previous and current
situations. Many thanks! :-)


Regards,
Yann E. MORIN.


The following changes since commit 8e928a8389d88e0f64f04ee1b3aa4985dcfd373f

  Makefile, manual, website: Bump copyright year (2019-01-06 21:30:34 +0100)


are available in the git repository at:

  git://git.buildroot.org/~ymorin/git/buildroot.git

for you to fetch changes up to c7478b1fd1c92508f346f1a8626374d742c9c327

  core: add optional failure when 2+ packages touch the same file (2019-01-07 23:04:09 +0100)


----------------------------------------------------------------
Yann E. MORIN (19):
      infra/pkg-generic: display MESSAGE before running PRE_HOOKS
      infra/pkg-generic: create $(@D) before running PRE_HOOKS
      infra/pkg-generic: introduce new stampfile at the beginning of all steps
      infra/pkg-generic: use \0 to separate .la files as they are found
      infra/pkg-generic: tweak only .la files installed by the current package
      infra/pkg-generic: only list files installed by the current package
      infra/pkg-generic: offload same-package filtering to check-uniq-file
      support/check-uniq-files: decode as many strings as possible
      support: add parser in python for packages-file-list files
      support: rewrite check-bin-arch in python
      support: introduce new format for packages-file-list files
      infra/pkg-generic: store md5 of just-installed files
      support/check-uniq-file: invert condition logic
      support/check-uniq-files: don't report files of the same content
      support/check-uniq-files: use argparse to enfore required options
      core: check unique files in the corresponding finalize step
      core: check for unique target files after all our cleanups
      core: ignore non-unique files that have disapeared
      core: add optional failure when 2+ packages touch the same file

 Config.in                        |   8 ++
 Makefile                         |  22 ++++-
 package/pkg-generic.mk           |  41 +++++---
 support/scripts/brpkgutil.py     |  38 ++++++++
 support/scripts/check-bin-arch   | 205 +++++++++++++++++++++------------------
 support/scripts/check-uniq-files |  69 +++++++------
 support/scripts/size-stats       |  14 +--
 7 files changed, 255 insertions(+), 142 deletions(-)

-- 
.-----------------.--------------------.------------------.--------------------.
|  Yann E. MORIN  | Real-Time Embedded | /"\ ASCII RIBBON | Erics' conspiracy: |
| +33 662 376 056 | Software  Designer | \ / CAMPAIGN     |  ___               |
| +33 223 225 172 `------------.-------:  X  AGAINST      |  \e/  There is no  |
| http://ymorin.is-a-geek.org/ | _/*\_ | / \ HTML MAIL    |   v   conspiracy.  |
'------------------------------^-------^------------------^--------------------'


More information about the buildroot mailing list