Here’s a couple of relevant papers discussing reproducibility and testing. The first one defines 4 categories of reproducibility, and statistical tests to automate categorising the non-bit-for-bit cases
Changes, additions and updates to CICE fall into four categories: (I) BFB [bit-for-bit] with no further assessment required; (II) non-BFB but unlikely to be climate changing; (III) non-BFB and climate changing; and (IV) a new model configuration option requiring separate scientific assessment. This section describes the automated methods used to flag the first three categories.