Gravitational-wave observations of binary black holes allow new tests of general relativity to be performed on strong, dynamical gravitational fields. These tests require accurate waveform models of the gravitational-wave signal, otherwise waveform errors can erroneously suggest evidence for new physics. Existing waveforms are generally thought to be accurate enough for current observations, and each of the events observed to date appears to be individually consistent with general relativity. In the near future, with larger gravitational-wave catalogs, it will be possible to perform more stringent tests of gravity by analyzing large numbers of events together. However, there is a danger that waveform errors can accumulate among events: even if the waveform model is accurate enough for each individual event, it can still yield erroneous evidence for new physics when applied to a large catalog. We presents a simple linearised analysis, in the style of a Fisher matrix calculation, that reveals the conditions under which the apparent evidence for new physics due to waveform errors grows as the catalog size increases. We estimate that, in the worst-case scenario, evidence for a deviation from general relativity might appear in some tests using a catalog containing as few as 10-30 events above a signal-to-noise ratio of 20. This is close to the size of current catalogs and highlights the need for caution when performing these sorts of experiments.