Maniphest T301955

Upgrade relforge to elasticsearch 6.8.23
Closed, ResolvedPublic3 Estimated Story Points
Actions

Assigned To

Authored By

	Gehel
	Feb 17 2022, 10:06 AM

Description

See parent task for more details.

AC:

relforge cluster is running elasticsearch 6.8.23

Details

Subject	Repo	Branch	Lines +/-
elastic: relax & restore perms during upgrade	operations/cookbooks	master	+35 -30
elasticsearch: remove custom restart handling	operations/cookbooks	master	+41 -50
elastic: add missing restart flag	operations/cookbooks	master	+4 -2
elasticsearch: upgrade relforge to elasticsearch 6.8	operations/puppet	production	+1 -1

Customize query in gerrit

Related Objects
Search...

Status	Assigned	Task
Resolved	None	T248925 Make MediaWiki release tarball compatible with PHP 8.0
Resolved	Jdforrester-WMF	T300463 Make PHP 8.0 voting on MW master
Resolved	None	T283275 Make MW master tests pass on PHP 8.0
Resolved	Reedy	T268861 CirrusSearch uses Elastica's Match class
Resolved	Reedy	T268863 Translate uses Elastica's Match class
Resolved	matthiasmullie	T268866 WikibaseMediaInfo uses Elastica's Match class
Invalid	None	T268864 WikibaseCirrusSearch uses Elastica's Match class
Resolved	Reedy	T268865 WikibaseLexemeCirrusSearch uses Elastica's Match class
Resolved	EBernhardson	T271777 Bump rufin/elastica (and related libraries) to versions that support PHP 8.0
Resolved	Gehel	T263142 [EPIC] Upgrade Elasticsearch to version 7.10
Resolved	Gehel	T295666 Upgrade Cirrus elasticsearch clusters to 6.8.23
Resolved	bking	T301955 Upgrade relforge to elasticsearch 6.8.23
Resolved	RKemper	T278378 Pull Elasticsearch config out of Spicerack

Event Timeline

Gehel created this task.Feb 17 2022, 10:06 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptFeb 17 2022, 10:06 AM

Change 763479 had a related patch set uploaded (by Gehel; author: Gehel):

[operations/puppet@production] elasticsearch: upgrade deployment-prep to elasticsearch 6.8

https://gerrit.wikimedia.org/r/763479

gerritbot added a project: Patch-For-Review.Feb 17 2022, 10:15 AM

Gehel edited projects, added Discovery-Search (Current work); removed Discovery-Search.Feb 17 2022, 11:11 AM

Gehel moved this task from Incoming to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

• MPhamWMF moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.Feb 22 2022, 8:04 PM

• MPhamWMF moved this task from In Progress to Ready for Dev -- SWE on the Discovery-Search (Current work) board.

Gehel added a subtask: T278378: Pull Elasticsearch config out of Spicerack.Mar 7 2022, 1:02 PM

Current upgrade cookbook needs to be adapted to add the steps that have been executed manually on deployment-prep (write access to /etc/elasticsearch required during upgrade). The cookbooks also rely on hardcoded list of servers in Spicerack, which should be addressed in T278378 first.

Gehel set the point value for this task to 3.Mar 7 2022, 4:59 PM

Change 769109 had a related patch set uploaded (by Ryan Kemper; author: Ryan Kemper):

[operations/cookbooks@master] elastic: relax & restore perms during upgrade

https://gerrit.wikimedia.org/r/769109

• razzi closed subtask T278378: Pull Elasticsearch config out of Spicerack as Resolved.Mar 9 2022, 5:00 PM

Mentioned in SAL (#wikimedia-operations) [2022-03-09T20:48:13Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-09T20:49:21Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-09T20:51:15Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-09T21:06:06Z] <ryankemper@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-09T21:10:11Z] <ryankemper@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-09T21:10:14Z] <ryankemper@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) restart without plugin upgrade (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - ryankemper@cumin1001 - T301955

RKemper reopened subtask T278378: Pull Elasticsearch config out of Spicerack as Open.Mar 9 2022, 9:27 PM

Change 769109 merged by jenkins-bot:

[operations/cookbooks@master] elastic: relax & restore perms during upgrade

https://gerrit.wikimedia.org/r/769109

Change 769789 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] elastic: add missing restart flag

https://gerrit.wikimedia.org/r/769789

Change 763479 merged by Razzi:

[operations/puppet@production] elasticsearch: upgrade relforge to elasticsearch 6.8

https://gerrit.wikimedia.org/r/763479

Change 769789 merged by jenkins-bot:

[operations/cookbooks@master] elastic: add missing restart flag

https://gerrit.wikimedia.org/r/769789

Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:02:52Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:02:56Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:04:06Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:04:46Z] <bking@cumin1001> END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:05:53Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-10T22:08:03Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Gehel closed subtask T278378: Pull Elasticsearch config out of Spicerack as Resolved.Mar 14 2022, 2:52 PM

Gehel moved this task from Ready for Dev -- SWE to In Progress on the Discovery-Search (Current work) board.Mar 14 2022, 3:50 PM

Gehel updated Other Assignee, added: bking.

Mentioned in SAL (#wikimedia-operations) [2022-03-15T21:55:37Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-15T21:56:47Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

bking mentioned this in P22359 ES restart cookbook error T301955.Mar 15 2022, 10:02 PM

Change 771072 had a related patch set uploaded (by Bking; author: Bking):

[operations/cookbooks@master] elasticsearch: remove custom restart handling

https://gerrit.wikimedia.org/r/771072

bking claimed this task.Mar 21 2022, 3:12 PM

bking updated Other Assignee, removed: bking.

Change 771072 merged by Bking:

[operations/cookbooks@master] elasticsearch: remove custom restart handling

https://gerrit.wikimedia.org/r/771072

Mentioned in SAL (#wikimedia-operations) [2022-03-21T21:45:44Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-21T21:59:15Z] <ryankemper> T301955 Downtimed relforge for 2 days; stuck in yellow status during upgrade b/c replica shards cannot be scheduled to a host of lower elasticsearch version than primary shards. Working on patch for our rolling-operation cookbook to disable replication during operation

Mentioned in SAL (#wikimedia-operations) [2022-03-21T22:26:39Z] <bking@cumin1001> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster restart - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-03-21T22:29:12Z] <ryankemper> T301955 Lifted downtime on relforge now that cluster upgrade is complete and cluster is back to green status

Upgrade complete.

Note that we ran into the following, which we had to work around by manually upgrading the second host:

{"index":"queries_27012021","shard":3,"primary":false,"current_state":"unassigned","unassigned_info":{"reason":"CLUSTER_RECOVERED","at":"2022-03-21T21:46:37.871Z","last_allocation_status":"no_attempt"},"can_allocate":"no","allocate_explanation":"cannot allocate because allocation is not permitted to any of the nodes","node_allocation_decisions":[{"node_id":"E7e7HF1YTvSql8UdZVrLBQ","node_name":"relforge1003-relforge-eqiad","transport_address":"10.64.5.37:9300","node_attributes":{"hostname":"relforge1003","rack":"A2","fqdn":"relforge1003.eqiad.wmnet","row":"A"},"node_decision":"no","deciders":[{"decider":"same_shard","decision":"NO","explanation":"the shard cannot be allocated to the same node on which a copy of the shard already exists [[queries_27012021][3], node[E7e7HF1YTvSql8UdZVrLBQ], [P], s[STARTED], a[id=4DkiEULDRum86eYAs1T9_g]]"}]},{"node_id":"JYN55FKeSpSEuEqGsMzjIA","node_name":"relforge1004-relforge-eqiad","transport_address":"10.64.21.126:9300","node_attributes":{"hostname":"relforge1004","rack":"B2","row":"B","fqdn":"relforge1004.eqiad.wmnet"},"node_decision":"no","deciders":[{"decider":"node_version","decision":"NO","explanation":"cannot allocate replica shard to a node with version [6.5.4] since this is older than the primary version [6.8.23]"}]}]}

This problem is only a hard blocker on relforge given it's a two host cluster. For production, we don't have that constraint. However the row awareness / allocation constraint will make things complicated so we'll want to be sure to remove that constraint before we upgrade production.

Gehel closed this task as Resolved.Mar 29 2022, 1:32 PM

Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:13:32Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:13:41Z] <bking@cumin2002> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:14:58Z] <bking@cumin1001> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:15:05Z] <bking@cumin1001> END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin1001 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:16:31Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T13:19:34Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T14:23:23Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T14:23:26Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T14:24:01Z] <bking@cumin2002> START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955

Mentioned in SAL (#wikimedia-operations) [2022-04-13T14:27:07Z] <bking@cumin2002> END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: relforge testing - bking@cumin2002 - T301955

bking mentioned this in rCCKBb110ab7f474f: elastic: relax & restore perms during upgrade.Dec 14 2022, 3:30 PM

RKemper mentioned this in rCCKBa3c120d7dcf2: elastic: add missing restart flag.Dec 14 2022, 3:30 PM

bking mentioned this in rCCKB60123a329665: elasticsearch: remove custom restart handling.Dec 14 2022, 3:30 PM

Upgrade relforge to elasticsearch 6.8.23Closed, ResolvedPublic3 Estimated Story PointsActions

Description

Details

Related ObjectsSearch...

Event Timeline

Upgrade relforge to elasticsearch 6.8.23
Closed, ResolvedPublic3 Estimated Story Points
Actions

Related Objects
Search...