[go: up one dir, main page]

Page MenuHomePhabricator

CDanis (Chris Danis)
SRE @ WMF

Projects (11)

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Nov 5 2018, 2:54 PM (315 w, 6 d)
Availability
Available
IRC Nick
cdanis
LDAP User
CDanis
MediaWiki User
CDanis (WMF) [ Global Accounts ]

Recent Activity

Fri, Nov 22

Joe awarded T380541: deploy RBD-backed persistent storage in the aux clusters a Love token.
Fri, Nov 22, 4:51 PM · Kubernetes, Infrastructure-Foundations

Thu, Nov 21

CDanis created T380541: deploy RBD-backed persistent storage in the aux clusters.
Thu, Nov 21, 9:13 PM · Kubernetes, Infrastructure-Foundations
CDanis added a comment to T370745: Integrate requestctl haproxy rules into our TLS terminator.

The good news: this works, we now see some hap: prefix rules appearing in x-requestctl.

Thu, Nov 21, 8:40 PM · Patch-For-Review, User-CDanis, User-Joe, conftool, Traffic
CDanis added a comment to T370745: Integrate requestctl haproxy rules into our TLS terminator.

Now deployed on all cp hosts in codfw.

Thu, Nov 21, 2:56 PM · Patch-For-Review, User-CDanis, User-Joe, conftool, Traffic

Wed, Nov 20

CDanis triaged T380402: idp.wikimedia.org should have a paging blackbox probe as High priority.
Wed, Nov 20, 6:51 PM · Observability-Alerting, Infrastructure-Foundations
CDanis created T380402: idp.wikimedia.org should have a paging blackbox probe.
Wed, Nov 20, 6:51 PM · Observability-Alerting, Infrastructure-Foundations
CDanis closed T380226: Install globaljsonlinks* tables on X1 for use with commons commons for Charts deployment, a subtask of T379689: Deploy Charts to test2wiki + Commons, as Resolved.
Wed, Nov 20, 2:31 PM · Charts (Sprint 11)
CDanis closed T380226: Install globaljsonlinks* tables on X1 for use with commons commons for Charts deployment as Resolved.
💙cdanis@mwmaint2002.codfw.wmnet ~ 🕤☕ mwscript sql.php --wiki=commonswiki  --cluster=extension1  --query 'show tables like "%globaljsonlink%"'
stdClass Object
(
    [Tables_in_commonswiki (%globaljsonlink%)] => globaljsonlinks
)
stdClass Object
(
    [Tables_in_commonswiki (%globaljsonlink%)] => globaljsonlinks_target
)
stdClass Object
(
    [Tables_in_commonswiki (%globaljsonlink%)] => globaljsonlinks_wiki
)
Wed, Nov 20, 2:31 PM · DBA, Charts (Sprint 11)

Tue, Nov 19

CDanis reopened T370745: Integrate requestctl haproxy rules into our TLS terminator as "Open".

I'm reopening this to track the rollout of this feature beyond cp4044.

Tue, Nov 19, 8:34 PM · Patch-For-Review, User-CDanis, User-Joe, conftool, Traffic
CDanis reopened T370745: Integrate requestctl haproxy rules into our TLS terminator, a subtask of T369606: Allow integrating requestctl rules into haproxy, as Open.
Tue, Nov 19, 8:31 PM · User-CDanis, User-Joe, conftool, Traffic
CDanis added a project to T313634: Survey the third-party library market for UA policy compliance: User-CDanis.
Tue, Nov 19, 5:46 PM · User-CDanis, SRE
CDanis updated subscribers of T376438: Download to PDF: HTTP 500 error on some wikis for some users.

@ihurbain just deployed the crashpad flag flip patch and (at least for now) Proton looks happier.

Tue, Nov 19, 2:15 PM · serviceops, Content-Transform-Team-WIP, Essential-Work, Electron-PDFs

Mon, Nov 18

CDanis awarded T378989: eqiad: (2x) aux-k8s-worker nodes a Party Time token.
Mon, Nov 18, 6:57 PM · Kubernetes, vm-requests, SRE
CDanis created T380218: First draft an SLO for Charts.
Mon, Nov 18, 6:41 PM · Charts (Sprint 11)
CDanis added a comment to T380142: Reimaging a kubernetes control-plane invalidates service-account tokens issued by it.

+1 to fingerprint as the key of keys

Mon, Nov 18, 5:22 PM · serviceops, Kubernetes, Prod-Kubernetes
CDanis awarded T356241: Move video transcoding to use Shellbox a Party Time token.
Mon, Nov 18, 2:53 PM · Patch-For-Review, Video, TimedMediaHandler, MW-on-K8s, serviceops

Fri, Nov 15

CDanis added a comment to T379901: Create tool to monitor and automatically delete misbehaving pods.

And now some notes pseudo-inlined with the task description.

Fri, Nov 15, 9:37 PM · serviceops, Kubernetes
CDanis added a comment to T379901: Create tool to monitor and automatically delete misbehaving pods.

I would like to see some more details on how this compares to "proper" readiness/liveness probes and in which cases we would use this instead of the builtin probes. I do understand that thumbor might be a bit of a special case here but I fear a tool like this could become a silver bullet for everything that is not working properly and needs kicking from time to time.

Fri, Nov 15, 7:43 PM · serviceops, Kubernetes
CDanis added a comment to T379570: [wmcs-cookbooks] wmcs.openstack.cloudvirt.vm_console cookbook is not working from cloudcumin hosts.

If you look at /etc/ssh/userkeys/root.d/cumin on a prod host, it contains the restrict option, which disallows PTY.

Fri, Nov 15, 3:43 PM · Cloud-VPS, Patch-For-Review, cloud-services-team (FY2024/2025-Q1-Q2)

Thu, Nov 14

CDanis edited projects for T379831: upgrade oauth2-proxy, added: observability; removed SRE.
Thu, Nov 14, 3:19 PM · Observability-Tracing

Wed, Nov 13

CDanis closed T375123: Archiving a trace more than once will duplicate all spans as Resolved.

fixed in 1.63.0 🎉

Wed, Nov 13, 9:28 PM · Observability-Tracing
CDanis closed T375123: Archiving a trace more than once will duplicate all spans, a subtask of T320549: distributed tracing v0 [minimum viable], as Resolved.
Wed, Nov 13, 9:27 PM · Epic, Observability-Tracing
CDanis created T379831: upgrade oauth2-proxy.
Wed, Nov 13, 8:52 PM · Observability-Tracing
CDanis closed T376939: New Service Request: chart-renderer as Resolved.
Wed, Nov 13, 3:52 PM · Infrastructure-Foundations, Charts, Service-deployment-requests, Services, SRE
CDanis created P71034 <Amir1> Is there a way to figure out the container connecting from "10.194.147.226:42062"?.
Wed, Nov 13, 3:03 PM

Tue, Nov 12

CDanis closed T379199: Install globaljsonlinks* tables on testcommons for Charts deployment as Resolved.
Tue, Nov 12, 9:13 PM · Charts (Sprint 10), DBA
CDanis added a comment to T379647: Timeout errors when making requests to Firebase for push notifications.

Given the shape of the error message and stack trace, I suspect that there's some new code path in FCM or its dependencies for which only setting the httpAgent param isn't sufficient, and we also need to set an environment variable.

Tue, Nov 12, 5:50 PM · Push-Notification-Service, Essential-Work, Content-Transform-Team-WIP, serviceops, Wikipedia-Android-App-Backlog
CDanis created P71024 docker-compose for simulating an app that must run behind an http proxy .
Tue, Nov 12, 5:41 PM
CDanis updated subscribers of T379628: MW script "eval.php" failing for "testcommonswiki" during train operations.

@bvibber @aude @Jdlrobson @CCiufo-WMF @Seddon FYI

Tue, Nov 12, 2:54 PM · User-brennen, Release-Engineering-Team (Radar), Train Deployments, SRE, serviceops

Thu, Nov 7

CDanis updated the task description for T379199: Install globaljsonlinks* tables on testcommons for Charts deployment.
Thu, Nov 7, 7:25 PM · Charts (Sprint 10), DBA
CDanis added a comment to T379199: Install globaljsonlinks* tables on testcommons for Charts deployment.
> show tables like '%globaljsonlinks%';
stdClass Object
(
    [Tables_in_testcommonswiki (%globaljsonlinks%)] => globaljsonlinks
)
stdClass Object
(
    [Tables_in_testcommonswiki (%globaljsonlinks%)] => globaljsonlinks_target
)
stdClass Object
(
    [Tables_in_testcommonswiki (%globaljsonlinks%)] => globaljsonlinks_wiki
)
Thu, Nov 7, 7:25 PM · Charts (Sprint 10), DBA

Wed, Nov 6

CDanis added a comment to T374683: Switchover plan from RESTbase to REST Gateway for rest_v1/page/html and rest_v1/page/title endpoints.

That would be nice, but it wouldn't have caught the problem we just encoutnered either - I wouldn't have thought to check for redirect=false. Known good cases are easy to cover, unknown bad cases are easy to miss...

Also, we may encounter problems that are more subtle - etag handling or cache control headers or CORS or whatnot. I checked all the stuff that I could think of a while ago, see https://docs.google.com/spreadsheets/d/10FaxUcD6y4Xjss21HfXUwVsH98RCWO7Bs9hhZuDTfFg/edit?pli=1&gid=0#gid=0. But as we just found out, I didn't think of all the relevgant things to check...

Wed, Nov 6, 1:18 PM · User-notice, MW-1.44-notes (1.44.0-wmf.4; 2024-11-19), Wikimedia-Incident, serviceops, RESTBase Sunsetting, MW-Interfaces-Team

Tue, Nov 5

CDanis updated the task description for T369944: Epic: Deploy Chart extension in production.
Tue, Nov 5, 10:05 PM · serviceops-radar, Epic, Wikimedia-extension-review-queue, Wikimedia-Extension-setup, Charts
CDanis updated the task description for T372081: Deploy Chart service in production.
Tue, Nov 5, 9:59 PM · Charts (Sprint 10), serviceops-radar
CDanis updated the task description for T372081: Deploy Chart service in production.
Tue, Nov 5, 6:01 PM · Charts (Sprint 10), serviceops-radar
CDanis closed T376948: Write `chart-renderer` Helm chart, a subtask of T372081: Deploy Chart service in production, as Resolved.
Tue, Nov 5, 6:00 PM · Charts (Sprint 10), serviceops-radar
CDanis closed T376948: Write `chart-renderer` Helm chart as Resolved.
Tue, Nov 5, 6:00 PM · Charts (Sprint 9)

Mon, Nov 4

CDanis added a comment to T378989: eqiad: (2x) aux-k8s-worker nodes.

LGTM! please use groups C and D if possible, that would give full diversity across ganeti groups

Mon, Nov 4, 5:03 PM · Kubernetes, vm-requests, SRE
CDanis triaged T378453: Testing liberica with ncredir@eqiad as Medium priority.
Mon, Nov 4, 4:08 PM · Infrastructure-Foundations, netops
CDanis triaged T378742: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams as Medium priority.
Mon, Nov 4, 4:00 PM · Infrastructure-Foundations, SRE, Kubernetes, Epic

Fri, Nov 1

CDanis updated the task description for T376948: Write `chart-renderer` Helm chart.
Fri, Nov 1, 4:48 PM · Charts (Sprint 9)
CDanis added a comment to T376948: Write `chart-renderer` Helm chart.

The chart itself looks good, thanks! Also ran it on minikube locally.

Fri, Nov 1, 4:28 PM · Charts (Sprint 9)
CDanis added a comment to T378809: ganeti1025 VMs unresponsive Nov 1 2024.

I'm pretty confident this is the same as T348730, and I think it would be okay to return ganeti1025 to service and close this task as a dup

Fri, Nov 1, 2:27 PM · Infrastructure-Foundations, netops, SRE
CDanis renamed T348730: repeated Ganeti VMs deadlocks due to DRBD bug on bullseye from DRBD kernel error on ganeti2031 led to kernel hang to repeated Ganeti VMs deadlocks due to DRBD bug on bullseye.
Fri, Nov 1, 1:57 PM · Infrastructure-Foundations, SRE

Thu, Oct 31

CDanis updated the task description for T378742: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams.
Thu, Oct 31, 4:22 PM · Infrastructure-Foundations, SRE, Kubernetes, Epic
CDanis created T378742: aux-k8s: eqiad expansion, codfw creation, & future hopes and dreams.
Thu, Oct 31, 4:20 PM · Infrastructure-Foundations, SRE, Kubernetes, Epic
CDanis added a subtask for T321211: distributed tracing v1: tech debt blockers: T329657: Fix naming of the aux-k8s discovery profile in PKI.
Thu, Oct 31, 4:12 PM · Observability-Tracing, Epic
CDanis added a parent task for T329657: Fix naming of the aux-k8s discovery profile in PKI: T321211: distributed tracing v1: tech debt blockers.
Thu, Oct 31, 4:12 PM · serviceops-radar, Observability-Tracing
CDanis added a subtask for T321211: distributed tracing v1: tech debt blockers: T358189: aux-k8s cluster prometheus setup is incomplete.
Thu, Oct 31, 4:11 PM · Observability-Tracing, Epic
CDanis added a parent task for T358189: aux-k8s cluster prometheus setup is incomplete: T321211: distributed tracing v1: tech debt blockers.
Thu, Oct 31, 4:11 PM · Infrastructure-Foundations, Observability-Tracing
CDanis added a subtask for T321211: distributed tracing v1: tech debt blockers: T345894: Improve jaeger ingress deployment .
Thu, Oct 31, 4:11 PM · Observability-Tracing, Epic
CDanis added a parent task for T345894: Improve jaeger ingress deployment : T321211: distributed tracing v1: tech debt blockers.
Thu, Oct 31, 4:11 PM · User-fgiunchedi, Observability-Tracing

Mon, Oct 28

CDanis edited projects for T378385: Spike in JobQueue job backlog time (500ms -> 4-8 minutes), added: serviceops; removed SRE, observability.
Mon, Oct 28, 5:20 PM · FlaggedRevs, serviceops, WMF-JobQueue
CDanis added a comment to T376948: Write `chart-renderer` Helm chart.

I think choose generic-application and add on istio and service-mesh to get started.

Mon, Oct 28, 5:16 PM · Charts (Sprint 9)

Oct 25 2024

CDanis added a comment to T348763: Make eventstreams-internal available to WMF staff without an ssh tunnel.
  • add an authenticating reverse proxy using envoy or some other kind of service.

At first glance, the envoy based solution looks pretty neat and tidy, given that we already have envoy installed in every pod.

Oct 25 2024, 2:38 PM · Event-Platform, Data-Engineering, serviceops

Oct 24 2024

CDanis renamed T378076: Parsercache issues in codfw causing large-scale outage from Error: 502, Broken pipe via cp6013.drmrs.wmnet to Parsercache issues in codfw causing large-scale outage.
Oct 24 2024, 1:10 PM · SRE, DBA, Wikimedia-production-error

Oct 22 2024

CDanis created P70545 (An Untitled Masterwork).
Oct 22 2024, 6:24 PM

Oct 21 2024

CDanis updated the task description for T372081: Deploy Chart service in production.
Oct 21 2024, 8:06 PM · Charts (Sprint 10), serviceops-radar
CDanis closed T377595: Prometheus export of basic request success & latency metrics as Resolved.

Yep!

Oct 21 2024, 7:40 PM · Charts (Sprint 9)
CDanis closed T377595: Prometheus export of basic request success & latency metrics, a subtask of T372081: Deploy Chart service in production, as Resolved.
Oct 21 2024, 7:40 PM · Charts (Sprint 10), serviceops-radar
CDanis added a comment to T376762: Remove `.cluster.local.` suffix in PTR responses.

21.75.192.10.in-addr.arpa. 5 IN PTR 10-192-75-21.eventstreams-production-tls-service.eventstreams.svc.cluster.local.
This is actually "eventstreams-production" on the staging-codfw cluster. Impossible to tell from this DNS response.

This is badly named.

Oct 21 2024, 3:19 PM · Kubernetes
CDanis added a comment to T376949: UEFI and software RAID.

Jesse, not sure how much things have changed since T215183: Redundant bootloaders for software RAID but there's at least a partman recipe in there that used to work.

Oct 21 2024, 2:23 PM · Patch-For-Review, Infrastructure-Foundations
CDanis renamed T376762: Remove `.cluster.local.` suffix in PTR responses from CoreDNS upgrade so we can rewrite `.cluster.local.` suffix in PTR responses to Remove `.cluster.local.` suffix in PTR responses.
Oct 21 2024, 1:51 PM · Kubernetes
CDanis closed T344171: Reverse DNS for k8s pods IPs as Resolved.
root@db1169:~# ss -tr | grep :mysql
ESTAB 0      945     db1169.eqiad.wmnet:mysql   10-67-163-233.mediawiki.mw-api-ext.svc.cluster.local:57846            
ESTAB 0      0       db1169.eqiad.wmnet:mysql       10-67-163-180.mediawiki.mw-web.svc.cluster.local:54236            
ESTAB 0      0       db1169.eqiad.wmnet:mysql        10-67-172-94.mediawiki.mw-web.svc.cluster.local:40890            
ESTAB 0      0       db1169.eqiad.wmnet:mysql   10-67-163-151.mediawiki.mw-api-ext.svc.cluster.local:42186            
ESTAB 0      0       db1169.eqiad.wmnet:mysql   10-67-133-187.mediawiki.mw-api-int.svc.cluster.local:57268            
ESTAB 0      674     db1169.eqiad.wmnet:mysql   10-67-190-118.mediawiki.mw-api-ext.svc.cluster.local:42074            
ESTAB 0      11      db1169.eqiad.wmnet:mysql    10-67-158-92.mediawiki.mw-api-ext.svc.cluster.local:33186            
ESTAB 0      753     db1169.eqiad.wmnet:mysql       10-67-172-220.mediawiki.mw-web.svc.cluster.local:60126            
ESTAB 0      0       db1169.eqiad.wmnet:mysql       10-67-189-105.mediawiki.mw-web.svc.cluster.local:32812            
ESTAB 0      673     db1169.eqiad.wmnet:mysql        10-67-167-12.mediawiki.mw-web.svc.cluster.local:35374
Oct 21 2024, 1:46 PM · Patch-For-Review, Traffic, serviceops, Prod-Kubernetes, Kubernetes
CDanis closed T344171: Reverse DNS for k8s pods IPs, a subtask of T372943: In the aftermath of T370304: Brainstorming of short- and medium-term observability / quality-of-life production changes, as Resolved.
Oct 21 2024, 1:43 PM · Sustainability (Incident Followup), MediaWiki-Platform-Team (Radar), serviceops, DBA, SRE
CDanis added a comment to T376762: Remove `.cluster.local.` suffix in PTR responses.

Should we attempt this on a cluster or two? One of the stagings, and then perhaps aux?

Oct 21 2024, 1:40 PM · Kubernetes

Oct 18 2024

CDanis added a comment to T377595: Prometheus export of basic request success & latency metrics.

works on beta cluster as well:

Oct 18 2024, 6:23 PM · Charts (Sprint 9)
CDanis added a comment to T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model.

@pmiazga @mszabo will either one of you have some time soon to at least do #3 above? That's the one I know the least about.

Oct 18 2024, 5:45 PM · MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Hackathon-2024, MediaWiki-libs-HTTP, Observability-Tracing
CDanis added a comment to T377595: Prometheus export of basic request success & latency metrics.

1💙cdanis@wmftop ~/work/gits/chart-renderer 🕚☕ curl localhost:9100
2
3# HELP nodejs_process_heap_rss_bytes process heap usage
4# TYPE nodejs_process_heap_rss_bytes gauge
5nodejs_process_heap_rss_bytes{service="chart-service"} 69545984
6
7# HELP nodejs_process_heap_used_bytes process heap usage
8# TYPE nodejs_process_heap_used_bytes gauge
9nodejs_process_heap_used_bytes{service="chart-service"} 13865760
10
11# HELP nodejs_process_heap_total_bytes process heap usage
12# TYPE nodejs_process_heap_total_bytes gauge
13nodejs_process_heap_total_bytes{service="chart-service"} 21782528
14
15# HELP chart_renderer_duration_milliseconds Duration of chart_renderer in milliseconds
16# TYPE chart_renderer_duration_milliseconds histogram
17chart_renderer_duration_milliseconds_bucket{le="1",path="/v1/chart/render",method="POST",status_code="200"} 0
18chart_renderer_duration_milliseconds_bucket{le="5",path="/v1/chart/render",method="POST",status_code="200"} 0
19chart_renderer_duration_milliseconds_bucket{le="10",path="/v1/chart/render",method="POST",status_code="200"} 0
20chart_renderer_duration_milliseconds_bucket{le="25",path="/v1/chart/render",method="POST",status_code="200"} 0
21chart_renderer_duration_milliseconds_bucket{le="50",path="/v1/chart/render",method="POST",status_code="200"} 5
22chart_renderer_duration_milliseconds_bucket{le="100",path="/v1/chart/render",method="POST",status_code="200"} 8
23chart_renderer_duration_milliseconds_bucket{le="200",path="/v1/chart/render",method="POST",status_code="200"} 9
24chart_renderer_duration_milliseconds_bucket{le="500",path="/v1/chart/render",method="POST",status_code="200"} 9
25chart_renderer_duration_milliseconds_bucket{le="1000",path="/v1/chart/render",method="POST",status_code="200"} 9
26chart_renderer_duration_milliseconds_bucket{le="2000",path="/v1/chart/render",method="POST",status_code="200"} 9
27chart_renderer_duration_milliseconds_bucket{le="10000",path="/v1/chart/render",method="POST",status_code="200"} 9
28chart_renderer_duration_milliseconds_bucket{le="+Inf",path="/v1/chart/render",method="POST",status_code="200"} 9
29chart_renderer_duration_milliseconds_sum{path="/v1/chart/render",method="POST",status_code="200"} 572
30chart_renderer_duration_milliseconds_count{path="/v1/chart/render",method="POST",status_code="200"} 9
31chart_renderer_duration_milliseconds_bucket{le="1",path="/v1/chart/render",method="GET",status_code="500"} 0
32chart_renderer_duration_milliseconds_bucket{le="5",path="/v1/chart/render",method="GET",status_code="500"} 1
33chart_renderer_duration_milliseconds_bucket{le="10",path="/v1/chart/render",method="GET",status_code="500"} 1
34chart_renderer_duration_milliseconds_bucket{le="25",path="/v1/chart/render",method="GET",status_code="500"} 1
35chart_renderer_duration_milliseconds_bucket{le="50",path="/v1/chart/render",method="GET",status_code="500"} 1
36chart_renderer_duration_milliseconds_bucket{le="100",path="/v1/chart/render",method="GET",status_code="500"} 1
37chart_renderer_duration_milliseconds_bucket{le="200",path="/v1/chart/render",method="GET",status_code="500"} 1
38chart_renderer_duration_milliseconds_bucket{le="500",path="/v1/chart/render",method="GET",status_code="500"} 1
39chart_renderer_duration_milliseconds_bucket{le="1000",path="/v1/chart/render",method="GET",status_code="500"} 1
40chart_renderer_duration_milliseconds_bucket{le="2000",path="/v1/chart/render",method="GET",status_code="500"} 1
41chart_renderer_duration_milliseconds_bucket{le="10000",path="/v1/chart/render",method="GET",status_code="500"} 1
42chart_renderer_duration_milliseconds_bucket{le="+Inf",path="/v1/chart/render",method="GET",status_code="500"} 1
43chart_renderer_duration_milliseconds_sum{path="/v1/chart/render",method="GET",status_code="500"} 5
44chart_renderer_duration_milliseconds_count{path="/v1/chart/render",method="GET",status_code="500"} 1

Oct 18 2024, 3:41 PM · Charts (Sprint 9)
CDanis updated the title for P70353 Demo of T377595 from untitled to Demo of T377595.
Oct 18 2024, 3:40 PM
CDanis created P70353 Demo of T377595.
Oct 18 2024, 3:17 PM
CDanis created T377595: Prometheus export of basic request success & latency metrics.
Oct 18 2024, 3:17 PM · Charts (Sprint 9)

Oct 17 2024

CDanis updated the task description for T372081: Deploy Chart service in production.
Oct 17 2024, 7:42 PM · Charts (Sprint 10), serviceops-radar
CDanis updated the task description for T372081: Deploy Chart service in production.
Oct 17 2024, 7:41 PM · Charts (Sprint 10), serviceops-radar
CDanis added a comment to T372081: Deploy Chart service in production.

What are the expectations regarding traffic here? Could this use Ingress instead of LVS?

Oct 17 2024, 11:02 AM · Charts (Sprint 10), serviceops-radar

Oct 16 2024

CDanis moved T376939: New Service Request: chart-renderer from Backlog to Tracking on the Charts board.
Oct 16 2024, 6:46 PM · Infrastructure-Foundations, Charts, Service-deployment-requests, Services, SRE

Oct 15 2024

CDanis added a comment to T340552: Implement and wire-up minimal OpenTelemetry tracing client compatible with OTEL data model.

@CDanis @Krinkle - what do you think would be the best areas to instrument? IMHO we should wrap entire request with a root span and instrument Database calls and the HTTPFactory. We could instrument the Parser too. For the first run we should keep it small.

Oct 15 2024, 3:34 PM · MediaWiki-Platform-Team, Patch-For-Review, Wikimedia-Hackathon-2024, MediaWiki-libs-HTTP, Observability-Tracing

Oct 11 2024

CDanis added a comment to T347430: [Data Platform] Install a Prometheus connector for Presto, pointed at thanos-query.

@Ahoelzl why was this moved to "Radar (External Teams)" column? Per @BTullis's post, I think this was awaiting DE approval before DE would work on it...?

Oct 11 2024, 1:57 PM · Data-Engineering, Data-Platform-SRE, SRE Observability

Oct 10 2024

CDanis updated the task description for T372081: Deploy Chart service in production.
Oct 10 2024, 8:08 PM · Charts (Sprint 10), serviceops-radar
CDanis created T376948: Write `chart-renderer` Helm chart.
Oct 10 2024, 8:07 PM · Charts (Sprint 9)
CDanis renamed T372081: Deploy Chart service in production from Deploy Chart service in production (placeholder, not actionable yet) to Deploy Chart service in production.
Oct 10 2024, 7:54 PM · Charts (Sprint 10), serviceops-radar
CDanis created T376939: New Service Request: chart-renderer.
Oct 10 2024, 7:24 PM · Infrastructure-Foundations, Charts, Service-deployment-requests, Services, SRE
CDanis closed T285569: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display as Resolved.

In practice the very basic alerting from systemd unit failures has been enough for every statograph issue so far.

Oct 10 2024, 6:40 PM · User-jbond, SRE-OnFire, observability, SRE
CDanis closed T285569: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display, a subtask of T202061: Implement an accurate and easy to understand status page for all wikis, as Resolved.
Oct 10 2024, 6:38 PM · Incident Tooling, SRE
CDanis added a comment to T376795: mwscript-k8s creates too many resources.

Turns out the object counts are already in Prometheus. Here's a quick plot on a dashboard: https://grafana.wikimedia.org/goto/u3dyc3kHg?orgId=1

Oct 10 2024, 4:34 PM · SRE-OnFire, Patch-For-Review, Sustainability (Incident Followup), serviceops, MW-on-K8s
CDanis created T376904: Upgrade to Jaeger v1.62.0.
Oct 10 2024, 2:08 PM · Patch-For-Review, Observability-Tracing
CDanis added a comment to T376762: Remove `.cluster.local.` suffix in PTR responses.

As for the rewrite part, we are currently setting cluster.local via hiera (it is also the default). There is a discussion we can have here about whether we want that setting to be the same across all clusters or whether it makes sense to have it per cluster. We probably should have had this conversation a bit earlier before we started using it more extensively, but it's probably not too late. Does it make sense to make this unique and per cluster? This would mean some deployments will need overrides per cluster in helmfile.d, but that's probably not too bad?

Oct 10 2024, 1:25 PM · Kubernetes

Oct 9 2024

CDanis added a comment to T376267: ☂ Wikitech account linking and SUL error reporting .

@CDanis, I renamed your account, can you try doing password reset with "CDanis (WMF)" again?

Oct 9 2024, 2:44 PM · wikitech.wikimedia.org
CDanis added a comment to T376267: ☂ Wikitech account linking and SUL error reporting .
Wikitech account/LDAP:CDanis
SUL accountCDanis (WMF)
Account linked on IDMIDK! IDM serves me a 500 Internal Server Error when logged in
I have visited MediaWiki:LoginpromptY
I have tried to reset my password using Special:PasswordResetY
Oct 9 2024, 2:39 PM · wikitech.wikimedia.org
CDanis added a comment to T376762: Remove `.cluster.local.` suffix in PTR responses.

I'm not sure though what implications the rewrite might have for cluster internal clients (where .cluster.local is perfectly valid and recommended to use). Can we make it so that the rewrite only happens for external clients?

Oct 9 2024, 12:29 PM · Kubernetes

Oct 8 2024

CDanis added a comment to T344171: Reverse DNS for k8s pods IPs.
Works in prod now:
Oct 8 2024, 9:39 PM · Patch-For-Review, Traffic, serviceops, Prod-Kubernetes, Kubernetes
CDanis added a subtask for T344171: Reverse DNS for k8s pods IPs: T376762: Remove `.cluster.local.` suffix in PTR responses.
Oct 8 2024, 9:36 PM · Patch-For-Review, Traffic, serviceops, Prod-Kubernetes, Kubernetes
CDanis added a parent task for T376762: Remove `.cluster.local.` suffix in PTR responses: T344171: Reverse DNS for k8s pods IPs.
Oct 8 2024, 9:36 PM · Kubernetes
CDanis created T376762: Remove `.cluster.local.` suffix in PTR responses.
Oct 8 2024, 9:35 PM · Kubernetes

Oct 2 2024

CDanis added a comment to T344171: Reverse DNS for k8s pods IPs.

OK, one weird issue I've found which is confounding but not fatal: the NodePort isn't working on v6.

Oct 2 2024, 4:06 PM · Patch-For-Review, Traffic, serviceops, Prod-Kubernetes, Kubernetes
CDanis added a comment to T376291: Authdns: automate reverse DNS zone delegation for k8s pod IP ranges.

Is there plan to try to get away from the very long hardcoded lists in hiera?

Oct 2 2024, 3:50 PM · Patch-For-Review, Traffic, Infrastructure-Foundations, SRE

Sep 30 2024

CDanis added a comment to T354169: Evaluate usage of Kubernetes/Wikikube Tags in netbox and replace them with something if possible.

A tag or custom field or some other machine-parsable data that indicated the name of the k8s cluster associated with each prefix would be suuuuper helpful for programmatically generating the delegations for T344171.

Sep 30 2024, 6:06 PM · Infrastructure-Foundations, netbox
CDanis added a project to T373054: Use OpenTelemetry to trace the behavior of hook handlers: Observability-Tracing.
Sep 30 2024, 5:51 PM · Observability-Tracing, FY2024-25 KR 5.2 Simplify feature development

Sep 27 2024

CDanis renamed T373527: puppetserver* thrashing and requiring a power cycle as a result from puppetserver1002 thrashing and requiring a power cycle as a result to puppetserver* thrashing and requiring a power cycle as a result.
Sep 27 2024, 2:25 PM · User-Elukey, Infrastructure-Foundations, SRE

Sep 25 2024

CDanis updated subscribers of T344171: Reverse DNS for k8s pods IPs.

Janis agrees with me that this use case seems like a great excuse to experiment with externalIPs.

Sep 25 2024, 8:04 PM · Patch-For-Review, Traffic, serviceops, Prod-Kubernetes, Kubernetes