See parent task for more details.
AC:
- relforge cluster is running elasticsearch 6.8.23
See parent task for more details.
AC:
Status | Subtype | Assigned | Task | ||
---|---|---|---|---|---|
Resolved | None | T248925 Make MediaWiki release tarball compatible with PHP 8.0 | |||
Resolved | Jdforrester-WMF | T300463 Make PHP 8.0 voting on MW master | |||
Resolved | None | T283275 Make MW master tests pass on PHP 8.0 | |||
Resolved | Reedy | T268861 CirrusSearch uses Elastica's Match class | |||
Resolved | Reedy | T268863 Translate uses Elastica's Match class | |||
Resolved | matthiasmullie | T268866 WikibaseMediaInfo uses Elastica's Match class | |||
Invalid | None | T268864 WikibaseCirrusSearch uses Elastica's Match class | |||
Resolved | Reedy | T268865 WikibaseLexemeCirrusSearch uses Elastica's Match class | |||
Resolved | EBernhardson | T271777 Bump rufin/elastica (and related libraries) to versions that support PHP 8.0 | |||
Resolved | Gehel | T263142 [EPIC] Upgrade Elasticsearch to version 7.10 | |||
Resolved | Gehel | T295666 Upgrade Cirrus elasticsearch clusters to 6.8.23 | |||
Resolved | bking | T301955 Upgrade relforge to elasticsearch 6.8.23 | |||
Resolved | RKemper | T278378 Pull Elasticsearch config out of Spicerack |
Change 763479 had a related patch set uploaded (by Gehel; author: Gehel):
[operations/puppet@production] elasticsearch: upgrade deployment-prep to elasticsearch 6.8
Change 769109 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):
[operations/cookbooks@master] elastic: relax & restore perms during upgrade
Mentioned in SAL (#wikimedia-operations) [2022-03-09T20:48:13Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-03-09T20:49:21Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-03-09T20:51:15Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-03-09T21:06:06Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-03-09T21:10:11Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-03-09T21:10:14Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955
Change 769109 merged by jenkins-bot:
[operations/cookbooks@master] elastic: relax & restore perms during upgrade
Change 769789 had a related patch set uploaded (by Bking; author: Bking):
[operations/cookbooks@master] elastic: add missing restart flag
Change 763479 merged by Razzi:
[operations/puppet@production] elasticsearch: upgrade relforge to elasticsearch 6.8
Change 769789 merged by jenkins-bot:
[operations/cookbooks@master] elastic: add missing restart flag
Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:02:52Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:02:56Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:04:06Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:04:46Z] <bking@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:05:53Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:08:03Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-03-15T21:55:37Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-03-15T21:56:47Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955
Change 771072 had a related patch set uploaded (by Bking; author: Bking):
[operations/cookbooks@master] elasticsearch: remove custom restart handling
Change 771072 merged by Bking:
[operations/cookbooks@master] elasticsearch: remove custom restart handling
Mentioned in SAL (#wikimedia-operations) [2022-03-21T21:45:44Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-03-21T21:59:15Z] <ryankemper> T301955 Downtimed relforge for 2 days; stuck in yellow status during upgrade b/c replica shards cannot be scheduled to a host of lower elasticsearch version than primary shards. Working on patch for our rolling-operation cookbook to disable replication during operation
Mentioned in SAL (#wikimedia-operations) [2022-03-21T22:26:39Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-03-21T22:29:12Z] <ryankemper> T301955 Lifted downtime on relforge now that cluster upgrade is complete and cluster is back to green status
Upgrade complete.
Note that we ran into the following, which we had to work around by manually upgrading the second host:
{"index":"queries_27012021","shard":3,"primary":false,"current_state":"unassigned","unassigned_info":{"reason":"CLUSTER_RECOVERED","at":"2022-03-21T21:46:37.871Z","last_allocation_status":"no_attempt"},"can_allocate":"no","allocate_explanation":"cannot allocate because allocation is not permitted to any of the nodes","node_allocation_decisions":[{"node_id":"E7e7HF1YTvSql8UdZVrLBQ","node_name":"relforge1003-relforge-eqiad","transport_address":"10.64.5.37:9300","node_attributes":{"hostname":"relforge1003","rack":"A2","fqdn":"relforge1003.eqiad.wmnet","row":"A"},"node_decision":"no","deciders":[{"decider":"same_shard","decision":"NO","explanation":"the shard cannot be allocated to the same node on which a copy of the shard already exists [[queries_27012021][3], node[E7e7HF1YTvSql8UdZVrLBQ], [P], s[STARTED], a[id=4DkiEULDRum86eYAs1T9_g]]"}]},{"node_id":"JYN55FKeSpSEuEqGsMzjIA","node_name":"relforge1004-relforge-eqiad","transport_address":"10.64.21.126:9300","node_attributes":{"hostname":"relforge1004","rack":"B2","row":"B","fqdn":"relforge1004.eqiad.wmnet"},"node_decision":"no","deciders":[{"decider":"node_version","decision":"NO","explanation":"cannot allocate replica shard to a node with version [6.5.4] since this is older than the primary version [6.8.23]"}]}]}
This problem is only a hard blocker on relforge given it's a two host cluster. For production, we don't have that constraint. However the row awareness / allocation constraint will make things complicated so we'll want to be sure to remove that constraint before we upgrade production.
Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:13:32Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:13:41Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:14:58Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin1001 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:15:05Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin1001 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:16:31Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:19:34Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-04-13T14:23:23Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-04-13T14:23:26Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-04-13T14:24:01Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955
Mentioned in SAL (#wikimedia-operations) [2022-04-13T14:27:07Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955