[Buildroot] [PATCH v4 5/5] support/scripts/pkgstats: add CPE reporting
Arnout Vandecappelle
arnout at mind.be
Wed May 16 23:32:38 UTC 2018
Hi Matt, Ricardo,
I haven't been following things very closely so I may miss something entirely,
however...
On 16-05-18 05:43, Ricardo Martincoski wrote:
> Hello,
>
> See below:
> - an important typo to fix;
> - some considerations about verbosity;
> - a possible trivial patch to split from this one;
> - some considerations about performance (to possibly run 2x or 4x faster, iff
> we can make some assumptions about the XML).
I thought the performance was completely dominated by the network, but
apparently it isn't?
Oh, now I see, the set_cpe_info is in fact O(#cpe_packages *
#buildroot_packages), so 17 minutes is just 1ms for the inner loop, which
doesn't sound entirely unreasonable.
> I hope other people can give more input about the last three items above.
>
> On Thu, May 10, 2018 at 03:58 PM, Matt Weber wrote:
>
>> support/scripts/pkgstats: add CPE reporting
>
> nit: support/scripts/pkg-stats: add CPE reporting
>
> [snip]
>> + def find_partial(self, cpe_str):
>> + print("CPE: Searching for partial [%s]" % cpe_str)
>
> This ...
>
>> + for cpe in self.all_cpes['cpe-list']['cpe-item']:
>> + if cpe_str in cpe['cpe-23:cpe23-item']['@name']:
>> + return cpe['cpe-23:cpe23-item']['@name']
>> +
>> + def find(self, cpe_str):
>> + print("CPE: Searching for [%s]" % cpe_str)
>
> ... and this make the script very verbose.
>
> It generates 4000+ lines to stdout when generating the html.
> Should we remove it from the final version?
> Should we add a -v option for debug?
>
> It seems to me it can be removed.
Well, if it takes 17 minutes, it's nice to see some progress... But if you can
reduce it to 4 minutes the need isn't there so much.
[snip]
>> + cpe_dict = CPE()
>> + cpe_dict.get_xml_dict()
>> + if args.cpe_report:
>> + print("Performing Target CPE Report Analysis...")
>> + get_target_cpe_report(args.cpe_report, cpe_dict)
>> + elif args.output:
>
> This is not common in other scripts in the tree. All checks between options are
> done at the start of __main__.
> But having two different code paths is not common either (in the scripts in the
> tree), so it seems to me it makes sense here.
> Maybe others disagree.
The way I see it, there is almost no commonality between the script with the -c
option and the normal pkg-stats, so it makes no sense to have them in the same
script. I would split off the CPE class into a separate module that would be
imported into this script and a new cpe_report script.
>
>> + print("Build package list ...")
>> + packages = get_pkglist(args.npackages, package_list)
>> + print("Getting package make info ...")
>> + package_init_make_info()
>> + print("Getting package details ...")
>> + for pkg in packages:
>> + pkg.set_infra()
>> + pkg.set_license()
>> + pkg.set_hash_info()
>> + pkg.set_patch_count()
>> + pkg.set_check_package_warnings()
>> + pkg.set_current_version()
>> + pkg.set_cpe_info(cpe_dict)
>> + print("Calculate stats")
>> + stats = calculate_stats(packages)
>> + print("Write HTML")
>> + dump_html(packages, stats, args.output)
>> + else:
>> + print("Please provide the -o HTML output file arg")
>>
>>
>> __main__()
>
> About the performance:
> 17 minutes is not incredibly good but it is not terrible for 2000+ packages.
IMO it is terrible...
>
> I created 2 *poorly-tested* patches for comparison purposes only. They seem
> (again *poorly-tested*) to produce the same html output as v4 when called
> without -c.
>
> series v4 + typo fixed:
> 17 minutes
> series v4 + typo fixed + comparison patch 1:
> 8 minutes
> series v4 + typo fixed + comparison patch 2:
> 4 minutes
>
> I am OK with first having the script functional and later, in another patch,
> improving its performance while keeping the same functionality.
> Maybe others disagree.
> Maybe you like any patch below and find the time to fully test that it really
> covers all cases and decide to merge to this one.
> Maybe you foresee any use for other fields from the XML. In that case, both
> comparison patches would not fit, as they assume we will always only need to
> know cpe-23:cpe23-item/@name and that there will be always one, and only one,
> cpe-23:cpe23-item/@name per cpe-item. Notice: I did not read the specs for the
> CPE dictionary, I looked only to this script.
>
>
> comparison patch 1:
> Can we assume we will only need cpe-23:cpe23-item/@name and pre-process?
We can easily keep both the dict and the list.
> ----->8-----
> +++ b/support/scripts/pkg-stats
> @@ -561,3 +561,3 @@ def dump_html(packages, stats, output):
> class CPE:
> - all_cpes = dict()
> + all_cpes = list()
This (both the dict() and the list()) is a bit unconventional. If two instances
of class CPE are created, they will refer to the same dict/list... Normally
members are set in the __init__ constructor. Was this intentional?
Personally, I'm a fan of the attrs module, which allows you to do this very
elegantly. Unfortunately, attrs is not in the standard library.
>
> @@ -569,4 +569,6 @@ class CPE:
> cpe_file = gzip.GzipFile(fileobj=StringIO(compressed_cpe_file.read())).read()
> - print("CPE: Converting xml manifest to dict...")
> - self.all_cpes = xmltodict.parse(cpe_file)
> + print("CPE: Converting xml manifest to list...")
> + raw_dict = xmltodict.parse(cpe_file)
> + for cpe in raw_dict['cpe-list']['cpe-item']:
> + self.all_cpes.append(cpe['cpe-23:cpe23-item']['@name'])
> except urllib2.HTTPError:
> @@ -580,5 +582,5 @@ class CPE:
> print("CPE: Searching for partial [%s]" % cpe_str)
> - for cpe in self.all_cpes['cpe-list']['cpe-item']:
> - if cpe_str in cpe['cpe-23:cpe23-item']['@name']:
> - return cpe['cpe-23:cpe23-item']['@name']
> + for cpe in self.all_cpes:
> + if cpe_str in cpe:
> + return cpe
>
> @@ -586,5 +588,4 @@ class CPE:
> print("CPE: Searching for [%s]" % cpe_str)
> - for cpe in self.all_cpes['cpe-list']['cpe-item']:
> - if cpe['cpe-23:cpe23-item']['@name'] == cpe_str:
> - return cpe['cpe-23:cpe23-item']['@name']
> + if cpe_str in self.all_cpes:
> + return cpe_str
> ----->8-----
>
>
> comparison patch 2:
> Note: I usually don't parse XML using Python, so I am not sure how future-proof
> is the string that includes the namespace. It seems (from google search results)
> people invented few hacks to disable namespaces in ElementTree. xmltodict has
> process_namespaces disabled by default.
> ----->8-----
> +++ b/support/scripts/pkg-stats
> @@ -27,3 +27,3 @@ import sys
> import urllib2
> -import xmltodict
> +import xml.etree.ElementTree as ET
> import gzip
> @@ -561,3 +561,3 @@ def dump_html(packages, stats, output):
> class CPE:
> - all_cpes = dict()
> + all_cpes = None
This should still be list() IMO, so that it has a consistent interface also in
case of e.g. HTTPError.
>
> @@ -569,4 +569,5 @@ class CPE:
> cpe_file = gzip.GzipFile(fileobj=StringIO(compressed_cpe_file.read())).read()
> - print("CPE: Converting xml manifest to dict...")
> - self.all_cpes = xmltodict.parse(cpe_file)
> + print("CPE: Converting xml manifest to list...")
> + tree = ET.fromstring(cpe_file)
> + self.all_cpes = [i.get('name') for i in tree.iter('{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item')]
So after this you get basically the same as after comparison patch 1, right? So
the xmltodict takes 4 minutes? Or am I missing something?
Oh, actually, the [... for ... iter...] is also more efficient than
for...: append() so that could be an effect here as well. But this part of the
code is only O(#cpe packages) so it shouldn't have that much impact.
> except urllib2.HTTPError:package
> @@ -580,5 +581,5 @@ class CPE:
> print("CPE: Searching for partial [%s]" % cpe_str)
> - for cpe in self.all_cpes['cpe-list']['cpe-item']:
> - if cpe_str in cpe['cpe-23:cpe23-item']['@name']:
> - return cpe['cpe-23:cpe23-item']['@name']
> + for cpe in self.all_cpes:
> + if cpe.startswith(cpe_str):
Originally it was 'in' instead of startswith(). Obviously startswith() will be
more efficient. And also more correct, I guess, or does the partial match not
need to be anchored at the beginning of the string? Well, since it always starts
with 'cpe:2.3:a:' a match somewhere in the middle seems unlikely...
> + return cpe
>
> @@ -586,5 +587,5 @@ class CPE:
> print("CPE: Searching for [%s]" % cpe_str)
> - for cpe in self.all_cpes['cpe-list']['cpe-item']:
> - if cpe['cpe-23:cpe23-item']['@name'] == cpe_str:
> - return cpe['cpe-23:cpe23-item']['@name']
> + for cpe in self.all_cpes:
> + if cpe == cpe_str:
> + return cpe
Looks to me like the lookup time can be cut in half again by merging the two
find functions:
def find(self, cpe_str):
partial_matches = []
partial_cpe_str = self.get_cpe_no_version(cpe_str)
for cpe in self.all_cpes:
if cpe == cpe_str:
return self.get_nvd_url(cpe_str)
elif cpe.startswith(partial_cpe_str):
partial_matches.append(cpe)
if partial_matches:
return 'Updated' # Or we could return the 'best' version?
Although... that probably doesn't make much of a difference.
What would probably make a difference, however, is to make two sets of cpe strings:
all_cpes = set()
all_cpes_no_version = set()
...
print("CPE: Converting xml manifest to dict...")
for cpe in raw_dict['cpe-list']['cpe-item']:
cpe_str = cpe['cpe-23:cpe23-item']['@name']
cpe_str_no_version = self.get_cpe_no_version(cpe_str)
self.all_cpes.add(cpe_str)
self.all_cpes_no_version.add(cpe_str_no_version)
...
def find(self, cpe_str):
if cpe_str in self.all_cpes:
return self.get_cpe_no_version(cpe_str)
cpe_str_no_version = self.get_cpe_no_version(cpe_str)
if cpe_str_no_version in self.all_cpes_no_version:
return 'Updated'
Regards,
Arnout
> ----->8-----
>
> Regards,
> Ricardo
>
>
>
> _______________________________________________
> buildroot mailing list
> buildroot at busybox.net
> http://lists.busybox.net/mailman/listinfo/buildroot
>
--
Arnout Vandecappelle arnout at mind be
Senior Embedded Software Architect +32-16-286500
Essensium/Mind http://www.mind.be
G.Geenslaan 9, 3001 Leuven, Belgium BE 872 984 063 RPR Leuven
LinkedIn profile: http://www.linkedin.com/in/arnoutvandecappelle
GPG fingerprint: 7493 020B C7E3 8618 8DEC 222C 82EB F404 F9AC 0DDF
More information about the buildroot
mailing list