[go: up one dir, main page]

Page MenuHomePhabricator

Remove servicerunner dependency for cxserver
Open, In Progress, MediumPublic16 Estimated Story Points

Description

Servicerunner, the base library for most of WMF's nodejs based services is unmaintained. Service runner, and its unmaintained dependencies are causing many issues:

Since cxserver is deployment is based on kubernetes and not baremetal deployments as used to happen many years back, a standard nodejs based service is sufficient.

Plan

  • Replace preq library with native http requests T350773: Remove preq and use node fetch
  • Simplify the routers. The current approach is reading all files in the routes directory, and loading the routes. Instead, use express's recommended way of handling routes - import the routes definitions and register using app.use('/', rootRoutes) ;app.use('/v1', v1Routes); etc
  • Replace the complex testing orchestration with standard express framework testing.
  • Use an ECS logger https://github.com/elastic/ecs-logging-nodejs. Remove custom error logging and request logging with ECS express middleware.
  • Remove various methods of stats reporting(statsd and Prometheus) and use one and only one way of reporting stats using a Prometheus client
  • Simplify the configuration - Current configuration assumes there are multiple services in the repo(service runner was designed for this kind of setup).
  • And finally, replace servicerunner with node cluster

Event Timeline

Change 1003609 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] WIP: Migrate from service-runner to node cluster

https://gerrit.wikimedia.org/r/1003609

The above patch is a quick run to identify the required efforts to migrate from servicerunner. It is not for merge. My proposal is to modernize various parts of cxserver, while using servicerunner as process manager. Do this migrations in iterations and at later stage when cxserver does not have a strong dependency on servicerunner other than a process manager, replace it. Doing everything in one go is too risky as cxserver is the backbone of our heavily used translation system.

Plan

  • Replace preq library with native http requests T350773: Remove preq and use node fetch
  • Simplify the routers. The current approach is reading all files in routes directory, and loading the routes. Instead, use express's recommended way of handling routes - import the routes definitions and register using app.use('/', rootRoutes) ;app.use('/v1', v1Routes); etc
  • Replace the complex testing orchestration with standard express framework testing.
  • Use an ECS logger https://github.com/elastic/ecs-logging-nodejs. Remove custom error logging and request logging with ECS express middleware.
  • Remove various methods of stats reporting(statsd and Prometheus) and use one and only one way of reporting stats using a statsd client
  • Simplify the configuration - Current configuration assumes there are multiple services in the repo(service runner was designed for this kind of setup).
  • And finally replace servicerunner with node cluster
  • Remove various methods of stats reporting(statsd and Prometheus) and use one and only one way of reporting stats using a statsd client

My understanding was that statsd was deprecated and we are supposed to use Prometheus instead: https://wikitech.wikimedia.org/wiki/Statsd "Statsd is supported (on the production realm), but new metrics producers are encouraged to use Prometheus instead."

Node 20 supprtes native testrunner. So we can use that opportunity to untangle our tests from servicerunner. cxserver tests need not depend on service runner, may be it can use express for a test server.
Initial exploration https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/1055769/

Change #1055769 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] test: Replace mocha with node 20 native test runner

https://gerrit.wikimedia.org/r/1055769

Change #1055769 merged by jenkins-bot:

[mediawiki/services/cxserver@master] test: Replace mocha with node 20 native test runner

https://gerrit.wikimedia.org/r/1055769

Change #1059746 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2024-08-05-063332-production

https://gerrit.wikimedia.org/r/1059746

Change #1059746 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2024-08-05-063332-production

https://gerrit.wikimedia.org/r/1059746

Change #1064734 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] major: Drop servicerunner and setup standards express service

https://gerrit.wikimedia.org/r/1064734

This is great work! One question that I have is whether there's a plan to incorporate the changes into the service template. Would that be in scope?

This is great work! One question that I have is whether there's a plan to incorporate the changes into the service template. Would that be in scope?

Not in scope. However, once this task is completed, just like example extension, we can think of an example nodejs service - a minimal code repo with essential parts like logging, config parsing, prometheus client, routing, unit testing, CI setups etc.

(this task will take some more time to complete as the deployment without disruption need to be figured out. Also worker management- pm2? node cluster part need to be figured out)

Change #1003609 abandoned by Santhosh:

[mediawiki/services/cxserver@master] WIP: Migrate from service-runner to node cluster

Reason:

See https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/1064734?usp=dashboard

https://gerrit.wikimedia.org/r/1003609

Change #1064734 merged by jenkins-bot:

[mediawiki/services/cxserver@master] major: Drop servicerunner and setup standard express service

https://gerrit.wikimedia.org/r/1064734

santhosh updated the task description. (Show Details)
santhosh updated the task description. (Show Details)

Change #1080431 had a related patch set uploaded (by Santhosh; author: Santhosh):

[operations/deployment-charts@master] Update cxserver to 2024-10-15-033213-production

https://gerrit.wikimedia.org/r/1080431

Change #1080431 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2024-10-15-033213-production

https://gerrit.wikimedia.org/r/1080431

Observations from staging deployment:

  1. Metrics endpoint is functional. curl -vLk https://staging.svc.eqiad.wmnet:4002/metrics returns the metrics
  2. /list api is functional curl -vLk https://staging.svc.eqiad.wmnet:4002/v2/list/mt/en/el
  3. MT with Apertium and MinT is working. Google and others need testing- difficult to test as they require JWT secret
  4. Logging is working but wrong format as per logstash 2024-10-16T07:19:25.484793362Z stdout F {"@timestamp":"2024-10-16T07:19:25.484Z","ecs.version":"8.10.0","log.level":"info","message":"GET /_info 200"} is the raw entry it picked up. That is not json, logstash fails to parse it
  5. Grafana does not seem to pick up the stats from prometheus
  6. All API endpoints that require connecting to MW API is failing.

I will start from fixing logging.

Change #1080577 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] Logging: Use json formatting instead of ecs

https://gerrit.wikimedia.org/r/1080577

Change #1080578 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] Fix configuration reading for mwapi_req and restbase_req

https://gerrit.wikimedia.org/r/1080578

Change #1080590 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] Fix: name field in options is undefined

https://gerrit.wikimedia.org/r/1080590

Change #1080577 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Logging: Use json formatting instead of ecs

https://gerrit.wikimedia.org/r/1080577

Change #1080578 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Fix configuration reading for mwapi_req and restbase_req

https://gerrit.wikimedia.org/r/1080578

Change #1080590 merged by jenkins-bot:

[mediawiki/services/cxserver@master] Fix: name field in options is undefined

https://gerrit.wikimedia.org/r/1080590

Change #1080623 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2024-10-16-114208-production

https://gerrit.wikimedia.org/r/1080623

Change #1080623 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2024-10-16-114208-production

https://gerrit.wikimedia.org/r/1080623

Current status:

All API endpoints are functional. Metrics is working fine. Logging is working, but logstash is not parsing the ecs format logs for some reason.

@akosiaris We deployed this code in staging. Only issue we observe is ECS logging is not parsed by logstash.

the msg field has values like 2024-10-16T07:19:25.484793362Z stdout F {"@timestamp":"2024-10-16T07:19:25.484Z","ecs.version":"8.10.0","log.level":"info","message":"GET /_info 200"}

Example: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.10.16?id=E9I2lJIB-wz6FOsqP_-w
(please ignore the info level log, we changed the log level after this)

The application prints logs to stdout as json lines
{"@timestamp":"2024-10-16T07:19:25.484Z","ecs.version":"8.10.0","log.level":"info","message":"GET /_info 200"}

Do you know why this parsing failed?

@akosiaris We deployed this code in staging. Only issue we observe is ECS logging is not parsed by logstash.

the msg field has values like 2024-10-16T07:19:25.484793362Z stdout F {"@timestamp":"2024-10-16T07:19:25.484Z","ecs.version":"8.10.0","log.level":"info","message":"GET /_info 200"}

Example: https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.10.16?id=E9I2lJIB-wz6FOsqP_-w
(please ignore the info level log, we changed the log level after this)

The application prints logs to stdout as json lines
{"@timestamp":"2024-10-16T07:19:25.484Z","ecs.version":"8.10.0","log.level":"info","message":"GET /_info 200"}

Do you know why this parsing failed?

It's because of the underlying move away from docker (which is no longer supported by Kubernetes past 1.23) to containerd. Relevant task was T377132: containerd logs are not properly parsed during ingestion to logstash. I see the task was resolved on Oct 18th, the example log lines from above are from Oct 16th. Newer logs should be fine, mind checking again?

Change #1081937 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] routes: refactor and remove v1

https://gerrit.wikimedia.org/r/1081937

Mentioned in SAL (#wikimedia-operations) [2024-10-22T05:41:41Z] <kart_> Remove servicerunner dependency for cxserver (T357950, T373777)

Change #1082156 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] metrics: Fix request time histogram

https://gerrit.wikimedia.org/r/1082156

status: cxserver is deployed in production. A few metrics reporting related issues were noticed(some metrics missing) and currently being fixed. Other than that, all APIs are functional.

Change #1082156 merged by jenkins-bot:

[mediawiki/services/cxserver@master] metrics: Fix request time histogram

https://gerrit.wikimedia.org/r/1082156

Change #1082193 had a related patch set uploaded (by KartikMistry; author: KartikMistry):

[operations/deployment-charts@master] Update cxserver to 2024-10-22-112806-production

https://gerrit.wikimedia.org/r/1082193

Change #1082193 merged by jenkins-bot:

[operations/deployment-charts@master] Update cxserver to 2024-10-22-112806-production

https://gerrit.wikimedia.org/r/1082193

Mentioned in SAL (#wikimedia-operations) [2024-10-22T11:48:10Z] <kart_> Updated cxserver to 2024-10-22-112806-production (T357950)

Change #1084711 had a related patch set uploaded (by Santhosh; author: Santhosh):

[mediawiki/services/cxserver@master] refactor: Use standard way of registering express server routes

https://gerrit.wikimedia.org/r/1084711

Change #1081937 abandoned by Santhosh:

[mediawiki/services/cxserver@master] routes: refactor and remove v1

Reason:

In favor of https://gerrit.wikimedia.org/r/c/mediawiki/services/cxserver/+/1084711

https://gerrit.wikimedia.org/r/1081937

Nikerabbit set the point value for this task to 16.Mon, Nov 11, 9:35 AM

Change #1084711 merged by jenkins-bot:

[mediawiki/services/cxserver@master] refactor: Use standard way of registering express server routes

https://gerrit.wikimedia.org/r/1084711

All items in the checklist is completed. Can be closed once latest patches are deployed