[go: up one dir, main page]

Page MenuHomePhabricator

Flaky test: ApiMainTest::testAddRequestedFieldsCurTimestamp
Closed, ResolvedPublic

Description

https://integration.wikimedia.org/ci/job/mediawiki-quibble-vendor-mysql-php73-docker/1077/console
https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/538920/

There was 1 failure:

1) ApiMainTest::testAddRequestedFieldsCurTimestamp
…
Failed asserting that 2 is equal to 1 or is less than 1.

/workspace/src/tests/phpunit/includes/api/ApiMainTest.php:180
/workspace/src/tests/phpunit/MediaWikiIntegrationTestCase.php:454

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

For this to happen, this bit of code would have to take over 1 second.

$api->execute();
$timestamp = $api->getResult()->getResultData()['curtimestamp'];
$this->assertLessThanOrEqual( 1, abs( strtotime( $timestamp ) - time() ) );

On my local system it takes about 0.02 seconds.

I suppose it wouldn't hurt to adjust the code to be able to take advantage of ConvertibleTimestamp::setFakeTime() for this test, but the question I'd really have is "why is CI so slow here?"

I suspect it relates to time() being subject to clock drift and jumps in time, given it's within a container, within a VM, so pauses in execution aren't unimaginable.

microtime() would probably be less likely to be subject to that, not sure though.

I've been seeing a lot of cases in the past few days where CI would randomly take over an hour, eventually being killed and resulting in a failing test. This failure is probably just another symptom of that, if whatever slowdown happens to hit at just the wrong time.

Change 539362 had a related patch set uploaded (by Anomie; owner: Anomie):
[mediawiki/core@master] API: Use ConvertibleTimestamp::setFakeTime for testing curtimestamp

https://gerrit.wikimedia.org/r/539362

I've been seeing a lot of cases in the past few days where CI would randomly take over an hour, eventually being killed and resulting in a failing test. [..]

For what it's worth, the CI infrastructure hasn't gotten slower (if anything, it's gotten quite a bit faster). Rather, there's just a lot more stuff being thrown into the jobs making them take longer (20min instead of 5min). There's also a fair amount of flakiness, which whenever it happens, requires all other jobs in the gate-and-submit pipeline around that time to restart without the unrelated preceding change applied (20min -> 40min), and then all this causes more congestion and exhaustion of CI workers, thus also leading to more general wait times before jobs start (40min -> 50min). Overall tracking task for a bunch of this stuff is T225730.

I'm not sure that explains https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/6354/console or https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/6372/console. The problem there seems to be that the test run took a long time, not that it waited a long time to start or that it had to restart.

These were aborted after the 60min allowance was exceeded. (These used to take 8-10min, and that was on lesser optimised infra). Note that these two jobs had already been running for 40+ minutes before the main PHPUnit-with-db task began.

The problem is that 40min of actual (non-phpunit) stuff was added to these jobs, as well as added 10min of stuff to PHPUnit (mostly Wikibase's PHPUnit tests). Jobs not running Wikibase and other extensions remain unaffected. For example, the plain MW PHPUnit job still takes 5-6 minutes and includes all of core's Parser, QUnit, and WDIO tests.

The problem is that 40min of actual (non-phpunit) stuff was added to these jobs,

Was it though? https://integration.wikimedia.org/ci/job/wmf-quibble-core-vendor-mysql-php72-docker/buildTimeTrend shows the job typically takes about 20 minutes. Which is still not great, but also not the 60 minutes those runs took.

I see. Note that the workload depends on the repo that triggers it. The Wikidata-initiated ones have additional work and are consistently taking 50-60 minutes.

The main ones for core and (most) wmf extensions are indeed around 20 minutes now. A lot of work is on-going in this area. A few weeks ago they were nearing 22-30min T232626, some of which has been resolved since then. When I filed T225730 we were around 12-14min.

The odd core jobs taking 60 minutes is indeed suspicious. Some kind of CPU slowdown seems most likely, especially since there is no single point where the job appears to have hung still for a while, either (based on the Jenkins timestamps). Perhaps to do with concurrent load on the infra etc.

My point was mainly that the CI infra hasn't gotten slower "in general". Looking at core's PHPUnit/ParserTests+QUnit and most gate extensions (e.g. excluding recent gate additions and slower integration tests from Wdio, Wikibase, and Cirrus), what remains runs in a shorter time today than it has ever run. In part due to MySQL and LCStore being fully in RAM, and the workers having more CPU/RAM.

The random 60min spikes, though, is new I guess. I haven't seen it on patches I submitted or reviewed yet. Worth filing a task for to get to the bottom of.

Change 539362 merged by jenkins-bot:
[mediawiki/core@master] API: Use ConvertibleTimestamp::setFakeTime for testing curtimestamp

https://gerrit.wikimedia.org/r/539362

The random 60min spikes, though, is new I guess. I haven't seen it on patches I submitted or reviewed yet. Worth filing a task for to get to the bottom of.

Thanks