Ceph SIG/Meetings/2024-10-15

Agenda

Welcome and introductions
SIG terms of reference: https://www.mediawiki.org/wiki/Ceph_SIG
- What is it for?
  - Developing best practices and helping to address shared problems.
  - Helping to understand current and future use cases.
- Who is it for?
  - Both administrators and users of these storage services.
- How do we keep in touch?
  - No objection to a mailing list
- Any special considerations?
  - None
- Meeting frequency?
  - 1x/month for now?
Current Ceph deployments
Current challenges
Build a list of future topics

Meeting Notes

Not recording, just taking notes
Discussed terms reference
Monthly seems about right.
Introductions
- Matthew from Data Persistence - runs two small reef/cephadm clusters focused on S3 - looking to onboard first user shortly. Lots of experience with multi-PB cluster in the past with scientific establishments. Going to Cepholocon.
- Chris - SRE in I/F team - WMF is maybe at the point where he believes that benefits of distributed storage may outweigh downsides. Has experience in Google's Bigtable/Colossus/D stack, and Longhorn on k8s, but not Ceph.
- David - SRE in WMCS team - responsible for the WMCS cluster - they are using it for VMs and other items- Plugged into openstack, authenticates with keystone etc. They are a bit old, based on 15 Octopus. Some current problems with disks.
- Fabian - Research engineer. Doesn’t feel that he has a lot of experience with storage infrastructure, but is very interested in how it can be applied to research use cases.
- Joseph - Data Engineer won’t be maintaining infrastructure, but will have a lot of input into how it is used by the end users. not a lot of experience with Ceph itself.
- Ben - SRE in the Data Platform group - has some experience with small Ceph clusters in the past. Proposed the DPE cluster a little over 2 years ago, focusing on flexibility and integration with k8s as a way of solving some challenges when new requirements arise e.g.server sprawl. Took time, but the DPE cluster is now almost production-ready. Uses RBD, S3 and now CephFS interfaces.
Current ceph deployments
- BT - Spoke about the cephosd cluster, naming difficulty, sharing puppet with WMCS. Shared the DPE SRE Learning Circle - Ceph (Ben) slides.
- MV: Not using WMCS puppet classes, but cephadm.
- DC Have you tried to do any upgrades?
- MV Only minor releases, but these have been good. Handles doing daemon restarts in the right order. Not tried a major upgrade yet. We use the WMF monitoring system - so didn’t want to set up the integrated monitoring. Disk replacement is super easy. ceph orch is used for adding/removing OSDs. There is an NVMe in each of the disk servers, but each of the OSDs is backed by an HDD.
- FK - Really interested in seeing how this develops in terms of strategy. There are lots of use cases where data needs to get created in one realm and used in another. e.g. ML work, where data is created in DE and then moved to swift, then used in ml-server clusters. Are these ways of working approved, templated, used in different ways already?
- MV - Worth noting that the S3 interface is really robust and user friendly. Probably easier to use than cephfs for generic moving data around kind of tasks. There is still a bus number of 1 in various tech, such as cassandra and swift. If we can get to the point where the Ceph-SIG can be called upon to help with incident response etc, that would be great.
- BT - Would David like to discuss what the issues with disks are?
- DC - We have had some issues for around a year so far. https://phabricator.wikimedia.org/T348643#9948101
- DC Error counter increasing on different hosts in a single batch. Having trouble finding the cause. Dell has been called in to investigate the causes and have sent some disks away for analysis. Still investigating. In the meantime, lots of issues with networks and HA. The usage of the cluster is increasing and it gets to the point where depooling a rack causes lots of issues. There are 40 ceph osd nodes. Looking to reduce the amount of effort required for maintenance. Using packages and puppet like the DPE cluster, but use cookbooks much more. This was the first ceph cluster in production at WMF.
- FK: What is stored in the ceph cluster in WMCS?
- DC The main data is all of the VM hard drives in Horizon. Also offer S3 storage to users, but it’s not used much. Also used for backups etc. Would like to integrate with k8s, but this is difficult because of the fact that k8s clusters are built within horizon projects. If the Ceph cluster goes down, this brings everything down, quarry, PAWS, tools etc. There is a small cluster in codfw, but it’s still not used much. Have tried to get more people involved. Andrew, Francisco. But it’s still difficult to get more people involved.
- FK. CloudVPS tooling is great. Would love to have more access to data from WMCS, but the data gap is making certain use cases unfeasible.
- DC: Yes, separation of networks between WMCS and production causes lots of issues, so we are aware of this problem. Mixing HA with network isolation is difficult.
BT: Any future topics?
- FK I always have future use cases. Would like to discuss model weight use case. These get bigger quickly. There is a 5GB limit on swift right now. Would love to discuss how we could support this kind of thing better.
- DC: Are these models public or private?
- FK: They are sort of public, so they are owned and modified, but they are mainly served privately.
- DC: They could possibly be served from WMCS.
- FK: These are loaded into pods in ml-serve, then loaded into the GPU, so there will be a lot of data movement.
- BT: Getting people to understand the different performance tiers of the DPE cluster is pretty tricky.
- MV: Yes, this is not unique to your cluster. We often have requirements arise where people ask for X amount of swift storage, where Swift seems a poor technology choice. No easy answer, but there is a concerted effort to shift left and get involved with requirements early on.