[Buildroot] [PATCH v4 2/5] autobuild-run: initial implementation of get_reproducibility_failure_reason()
Arnout Vandecappelle
arnout at mind.be
Sun Sep 8 17:06:11 UTC 2019
Hi Atharva,
I'm very sorry that I'm so late reviewing this...
On 20/08/2019 16:52, Atharva Lele wrote:
> Analyze the JSON formatted output from diffoscope and check if
> the differences are due to a filesystem reproducibility issue
> or a package reproducibility issue.
>
> Also, discard the deltas because they might take up too much space.
>
> Signed-off-by: Atharva Lele <itsatharva at gmail.com>
> ---
> Changes v2 -> v4:
> - Change if-else to try-except
> - remove blank line
> Changes v1 -> v2:
> - Refactor using subfunctions and local variables (suggested by Thomas)
> - Added comments (suggested by Thomas)
> - Use more pythonic loops (suggested by Thomas)
> ---
> scripts/autobuild-run | 88 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 88 insertions(+)
>
> diff --git a/scripts/autobuild-run b/scripts/autobuild-run
> index 99b57dd..384cf73 100755
> --- a/scripts/autobuild-run
> +++ b/scripts/autobuild-run
> @@ -131,6 +131,7 @@ import csv
> import docopt
> import errno
> import hashlib
> +import json
> import mmap
> import multiprocessing
> import os
> @@ -599,6 +600,93 @@ class Builder:
> if reject_results():
> return
>
> + def get_reproducibility_failure_reason(reproducible_results):
This function really warrants a docstring. Especially because it does more than
the title says: it not only returns the reason, but it also updates the JSON
file with it and changes the way the diffs are stored.
> + def split_delta(delta):
> + # Take a delta and split it into added, deleted lines.
> + added = []
> + deleted = []
> + for line in delta:
> + if line.startswith("+"):
> + added.append(line)
> + if line.startswith("-"):
> + deleted.append(line)
> + return added, deleted
A more pythonesque way to do this would be
added = [line for line in delta if line.startswith("+")]
> +
> + def get_package(sourcef):
> + # Returns which package the source file belongs to.
> + with open(packages_file_list, "r") as packagef:
It is strange to have packages_file_list defined at a higher scope and then
used within this function. I think it's better to move its definition inside
this function as well.
> + for line in packagef:
> + if sourcef in line:
> + package = line.split(',')[0]
I think it is better to keep a list of packages for the JSON output, and use
the last one for the reason output. So here you would have:
packages = [line.split(',', 1)[0] for line in packagef if sourcef in line]
> +
> + try:
> + # Get package version
> + package_info = json.loads(subprocess.check_output(["make", "--no-print-directory",
> + "O=%s" % self.outputdir,
> + "-C", self.srcdir,
> + "%s-show-info" % package]))
> + if "version" in package_info[package]:
> + version = package_info[package]["version"]
> + return [package, version]
I don't know how useful it is to extract the package version... It is currently
part of the reason file mostly by accident, I think.
> + else:
> + return [package]
> + except:
> + return ["not found"]
> +
> + def cleanup(l):
> + # Takes a list and removes data which is redundant (source2) or data
> + # that might take up too much space (like huge diffs).
> + if "unified_diff" in l:
> + l.pop("unified_diff")
Condition can be avoided with
l.pop("unified_diff", None)
(or maybe that's only python3? I'm not used to legacy python any more :-)
> + if "source2" in l:
> + l.pop("source2")
> +
> + packages_file_list = os.path.join(self.outputdir, "build", "packages-file-list.txt")
> +
> + with open(reproducible_results, "r") as reproduciblef:
> + json_data = json.load(reproduciblef)
> +
> + if json_data["unified_diff"] == None:
== None makes little sense - you should just use 'not' in that case. If you
really want to check that it's None, you should use 'is None'. However, do you
really get a null entry in the json output? Isn't it that the "unified_diff"
section is missing in the json output? In that case, the
json_data["unified_diff"] would give a KeyError exception...
> + # Remove the file list because it is not useful, i.e. it only shows
> + # which files vary, and nothing more.
> + if json_data["details"][0]["source1"] == "file list":
> + json_data["details"].pop(0)
> +
> + # Iterate over details in the diffoscope output.
> + for item in json_data["details"]:
> + diff_src = item["source1"]
> + item["package"] = get_package(diff_src)
> +
> + # In some cases, diffoscope uses multiple commands to get various
> + # diffs. Due to this, it generates a "details" key for those files
> + # instead of just storing the diff in the "unified_diff" key.
> + if item["unified_diff"] == None:
> + for item_details in item["details"]:
> + diff = item_details["unified_diff"].split("\n")
> + split_deltas = split_delta(diff)
> + item_details["added"] = split_deltas[0][:100]
> + item_details["deleted"] = split_deltas[1][:100]
> + cleanup(item_details)
> + else:
> + diff = item["unified_diff"].split("\n")
> + split_deltas = split_delta(diff)
> + item["added"] = split_deltas[0][:100]
> + item["deleted"] = split_deltas[1][:100]
> + cleanup(item)
This cleanup has nothing to do with getting the failure reason. I think it
would be better to do it in a separate function (and a separate patch). Also,
I'm not really happy yet with this diff cleanup, because we loose all context
information. So maybe, for now, it's better to just use the --max-report-size
option.
> + # We currently just set the reason from first non-reproducible package in the
> + # dictionary.
> + reason = json_data["details"][0]["package"]
> +
> + # If there does exist a unified_diff directly for the .tar images, it is probably
> + # a filesystem reproducibility issue.
This comment should come after the else, or before the condition. Like this, it
looks like it belongs to the == None condition.
Maybe it's cleaner to swap the condition and put this filesystem assumption first.
Regards,
Arnout
> + else:
> + reason = ["filesystem"]
> +
> + with open(reproducible_results, "w") as reproduciblef:
> + json.dump(json_data, reproduciblef, sort_keys=True, indent=4)
> +
> + return reason
> +
> def get_failure_reason():
> # Output is a tuple (package, version), or None.
> lastlines = decode_bytes(subprocess.Popen(
>
More information about the buildroot
mailing list