[Buildroot] [PATCH] core/instrumentation: filter-out packages that install identical files

Yann E. MORIN yann.morin.1998 at free.fr
Thu Jan 3 22:38:32 UTC 2019


Thomas, All,

On 2019-01-03 22:37 +0100, Thomas Petazzoni spake thusly:
> On Thu,  3 Jan 2019 17:04:31 +0100, Yann E. MORIN wrote:
> > Currently, we check that no two packages write to the same files, as a
> > sanity check. We do so by checking which files were touched since the
> > end of the build (aka begining of the installation).
> > 
> > However, when the packages do install the exact same file, i,e, the
> > same content, we in fact do not really care what package had provided
> > said file.
> > 
> > In the past, we avoided that situation because we were md5sum-inf every
> > files before and after installation. Anything that changed was new or
> > modified, and everything that did not change was not modified (but could
> > have been reinstalled).
> > 
> > However, since 7fb6e78254 (core/instrumentation: shave minutes off the
> > build time), we're now using mtimes, and we're in the situation that the
> > exact same file installed by two-or-more packages is reported.
> > 
> > In such a situation, it is not very interesting to know what package
> > installed the file, because whatever the ordering, or whatever the
> > subset of said packages, we'd have ended up with the same file anyway.
> > One prominent case where this happens, if the fftw familly of pcackages,
> > that all install different libraries, but install the same set of
> > identical headers and some common utilities.
> > 
> > The rationale for 7fb6e78254 was that the impact on the build time was
> > horrible, so we can't revert it.
> > 
> > However, we can partially restore use of md5 to detect if the files were
> > actually modified or not, and limiting the md5sum-ing only to those
> > actually modified files. The comparison of the md5 is then offloaded to
> > later, during the final check-uniq-files calls.
> > 
> > Since the md5sum is run only on new files, they are cache-hot, and so
> > the md5 is not going to storage to fetch them (and if they were not
> > cache-hot, it would mean memory pressure for a reason or another, and
> > the system would already be slowed for other reasons).
> > 
> > Using a defconfig with a various set of ~236 packages, the build times
> > reported by running "time make >log.file 2>&1", on an otherwise unloaded
> > system, were (smallest value of 5 runs each):
> > 
> >     before:     35:15.92
> >     after:      35:22.03
> > 
> > Which is a slight overhead of just 6.11s, which is less than 26ms per
> > package. Also, the system, although pretty idle, was still doing
> > background work like fetching mails and such, so the overhead is not
> > even consistent across various measurements...
> 
> Overall, I find the idea good and useful.
> 
> A couple of comments/questions (here and below):
> 
>  - I believe we should still fix the fact that the .la file timestamp
>    is changed everytime we $(SED) the .la files, even if they are not
>    changed. This will also help reduce the number of files to check
>    using md5sum: a .la file should be fixed only once, and its
>    timestamp never touched again.

That is a different change, indeed, that should be done in the staging
install commands. I can tackle that.

>  - Could we make a check-uniq-files error a hard build failure, as a
>    follow-up patch from yours ? So far, check-uniq-files errors have
>    been a bit useless, because they do not cause the build to fail, so
>    nobody really cared.

That's fine, I already have the patch ready here: ;-)
    https://git.buildroot.org/~ymorin/git/buildroot/commit/?h=yem/misc&id=2c1ba76ee86a8f6bd9b9d4a67fa1e9bf28e840da

> > diff --git a/package/pkg-generic.mk b/package/pkg-generic.mk
> > index f5cab2b9c2..7c6a995c19 100644
> > --- a/package/pkg-generic.mk
> > +++ b/package/pkg-generic.mk
> > @@ -64,12 +64,14 @@ GLOBAL_INSTRUMENTATION_HOOKS += step_time
> >  # $(3): suffix of file  (optional)
> >  define step_pkg_size_inner
> >  	@touch $(BUILD_DIR)/packages-file-list$(3).txt
> > -	$(SED) '/^$(1),/d' $(BUILD_DIR)/packages-file-list$(3).txt
> > +	$(SED) '/^$(1)  /d' $(BUILD_DIR)/packages-file-list$(3).txt
> 
> Why is the separator changed from , to space ?

See below, to match the separator with md5.

> >  	cd $(2); \
> > -	find . \( -type f -o -type l \) \
> > +	find . -xtype f \
> 
> Why are you changing from -type f -o -type l to -xtype f ? Is this
> really related to the change ? Is -xtype supported in old version of
> find ?

-xtype is what we were using before 7fb6e78254:
    https://git.buildroot.org/buildroot/commit/package/pkg-generic.mk?id=7fb6e782542fc440c2da226ec4525236d0508b77

> >  		-newer $($(PKG)_DIR)/.stamp_built \
> > -		-exec printf '$(1),%s\n' {} + \
> > -		>> $(BUILD_DIR)/packages-file-list$(3).txt  
> > +		-print0 \
> > +	|xargs -0 -r md5sum \
> 
> Ah, you're changing the separator to two spaces, because md5sum's output
> already uses two spaces as a separator ?

Yes.

> This could break people who have scripts that parse the
> package-file-list.txt contents. Maybe we should keep the compatibility,
> and keep using , as a separator, and push the md5sum as an additional
> third column ?

Well, if they were splitting on the first comma, adding the md5 as third
column would also break their scripts.

Also, it is not sane to break on the comma after the file name, becuase
a filename is arbitrary. There should be nothing after it.

> Like:
> 	|xargs -0 -r md5sum \
> 	| sed -r -e 's/([0-9a-f]{32})  (.*)/$(1),\2,\1/'
> 
> One drawback is that it won't work with filenames that contain a comma.
> But in this case, we're forced to keep the filename field as the last
> one, and therefore break backward compatibility of the
> package-file-lists.txt format every time we want to add a new
> information to it.

In any case, we have to break backward compatibility. I actually thought
about it, but forgot to say so in the commit message, my bad...

> > +	|sed -r -e 's/^/$(1)  /' \
> > +	>> $(BUILD_DIR)/packages-file-list$(3).txt
> >  endef
> >  
> >  define step_pkg_size
> > diff --git a/support/scripts/check-uniq-files b/support/scripts/check-uniq-files
> > index fbc6b5d6e7..94deea01db 100755
> > --- a/support/scripts/check-uniq-files
> > +++ b/support/scripts/check-uniq-files
> > @@ -24,24 +24,30 @@ def main():
> >          sys.stderr.write('No type was provided\n')
> >          return False
> >  
> > -    file_to_pkg = defaultdict(list)
> > +    file_to_pkg = defaultdict(set)
> 
> You're changing from list to set, it's a bit unrelated. Should be a
> separate commit ?

Yep, will do.

> > +    file_md5 = defaultdict(set)
> >      with open(args.packages_file_list[0], 'rb') as pkg_file_list:
> >          for line in pkg_file_list.readlines():
> > -            pkg, _, file = line.rstrip(b'\n').partition(b',')
> > -            file_to_pkg[file].append(pkg)
> > +            pkg, md5, file = line.rstrip(b'\n').split(b'  ', 2)
> 
> The ", 2" in split() is to make sure that if the filename contains two
> consecutive spaces, we don't split it up ?

Exactly. We know we have exactly three fields, so we need two splits.

> > +            file_to_pkg[file].add(pkg)
> > +            file_md5[file].add(md5)
> >  
> >      for file in file_to_pkg:
> > -        if len(file_to_pkg[file]) > 1:
> > -            # If possible, try to decode the binary strings with
> > -            # the default user's locale
> > -            try:
> > -                sys.stderr.write(warn.format(args.type, file.decode(),
> > -                                             [p.decode() for p in file_to_pkg[file]]))
> > -            except UnicodeDecodeError:
> > -                # ... but fallback to just dumping them raw if they
> > -                # contain non-representable chars
> > -                sys.stderr.write(warn.format(args.type, file,
> > -                                             file_to_pkg[file]))
> > +        if len(file_to_pkg[file]) == 1 or len(file_md5[file]) == 1:
> > +            # If only one package installed that file, or they all
> > +            # installed the same content, don't report that file.
> > +            continue
> > +
> > +        # If possible, try to decode the binary strings with
> > +        # the default user's locale
> > +        try:
> > +            sys.stderr.write(warn.format(args.type, file.decode(),
> > +                                         [p.decode() for p in file_to_pkg[file]]))
> > +        except UnicodeDecodeError:
> > +            # ... but fallback to just dumping them raw if they
> > +            # contain non-representable chars
> > +            sys.stderr.write(warn.format(args.type, file,
> > +                                         file_to_pkg[file]))
> 
> I'm not sure why the whole test is inverted, it causes a lot of noise
> in the diff. If there is a good reason for it, should be a separate
> patch.

Sure, will do.

Thanks!

Regards,
Yann E. MORIN.

> Best regards,
> 
> Thomas
> -- 
> Thomas Petazzoni, CTO, Bootlin
> Embedded Linux and Kernel engineering
> https://bootlin.com

-- 
.-----------------.--------------------.------------------.--------------------.
|  Yann E. MORIN  | Real-Time Embedded | /"\ ASCII RIBBON | Erics' conspiracy: |
| +33 662 376 056 | Software  Designer | \ / CAMPAIGN     |  ___               |
| +33 223 225 172 `------------.-------:  X  AGAINST      |  \e/  There is no  |
| http://ymorin.is-a-geek.org/ | _/*\_ | / \ HTML MAIL    |   v   conspiracy.  |
'------------------------------^-------^------------------^--------------------'



More information about the buildroot mailing list