Slow revscoring, started logging queries on the pod side, so that is gone when the pod is killed.
Answer "Is there a reason we are not logging the query into logstash?"

Aug 27 2024, 2:18 PM · Goal, Machine-Learning-Team

calbon added a comment to T371396: Goal 2: People outside the ML team can ssh into an ml-lab machine, run a Jupyter Notebook, and run PyTorch powered by a GPU..

machines are racked but not set up. Will set up one first to figure out disk layout and then the other one. Then will release to the research team

Aug 27 2024, 2:15 PM · Data-Platform-SRE, Goal, Machine-Learning-Team

calbon added a comment to T371395: Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that is fast in production..

GPU hosts are racked but not set up yet
Software side slower

Aug 27 2024, 2:11 PM · Goal, Machine-Learning-Team

Aug 13 2024

calbon added a comment to T371398: Goal 4: Support product teams in deploying production models..

Update

Modernized recommendation API has been deployed to production
API gateway setup underway
Article quality LA: Ready on staging and want to bring it into production. Should we group models into common namespaces? Suggestion: create namespaces per area where the model is used: articles, revisions, images, etc.

Aug 13 2024, 2:56 PM · Goal, Machine-Learning-Team

calbon added a comment to T371396: Goal 2: People outside the ML team can ssh into an ml-lab machine, run a Jupyter Notebook, and run PyTorch powered by a GPU..

Update:

Waiting for ml-lab machines to be delivered to the eqiad data center.

Aug 13 2024, 2:35 PM · Data-Platform-SRE, Goal, Machine-Learning-Team

calbon renamed T371395: Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that is fast in production. from Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that uses an inference optimization engine in production. to Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that is fast in production..

Aug 13 2024, 2:33 PM · Goal, Machine-Learning-Team

calbon added a comment to T371395: Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that is fast in production..

Infra

Setting up the puppet roles
Can't commit puppet roles until the machines are there
Reached out to vendor

Aug 13 2024, 2:32 PM · Goal, Machine-Learning-Team

Jul 31 2024

calbon added a project to T371398: Goal 4: Support product teams in deploying production models.: Goal.

Jul 31 2024, 3:15 PM · Goal, Machine-Learning-Team

calbon added a project to T371397: Goal 3: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.: Goal.

Jul 31 2024, 3:14 PM · Goal, Machine-Learning-Team

calbon added a project to T371396: Goal 2: People outside the ML team can ssh into an ml-lab machine, run a Jupyter Notebook, and run PyTorch powered by a GPU.: Goal.

Jul 31 2024, 3:14 PM · Data-Platform-SRE, Goal, Machine-Learning-Team

calbon added a project to T371395: Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that is fast in production.: Goal.

Jul 31 2024, 3:14 PM · Goal, Machine-Learning-Team

Jul 30 2024

calbon moved T369712: Request to update Readability model on Lift Wing from Unsorted to Ready To Go on the Machine-Learning-Team board.

Jul 30 2024, 2:56 PM · Lift-Wing, Machine-Learning-Team

calbon assigned T369712: Request to update Readability model on Lift Wing to AikoChou.

Jul 30 2024, 2:56 PM · Lift-Wing, Machine-Learning-Team

calbon created T371398: Goal 4: Support product teams in deploying production models..

Jul 30 2024, 2:28 PM · Goal, Machine-Learning-Team

calbon created T371397: Goal 3: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services..

Jul 30 2024, 2:28 PM · Goal, Machine-Learning-Team

calbon created T371396: Goal 2: People outside the ML team can ssh into an ml-lab machine, run a Jupyter Notebook, and run PyTorch powered by a GPU..

Jul 30 2024, 2:28 PM · Data-Platform-SRE, Goal, Machine-Learning-Team

calbon created T371395: Goal 1: Non-technical users can make a request to a Hugging Face Large Language Model that is fast in production..

Jul 30 2024, 2:28 PM · Goal, Machine-Learning-Team

Jun 18 2024

calbon moved T366528: Deployment of model updates from Unsorted to Backlog/Lift Wing on the Machine-Learning-Team board.

Jun 18 2024, 2:55 PM · Research-engineering, Machine-Learning-Team, Research

calbon assigned T366772: Solve revscoring models increased latencies for big revision sizes to AikoChou.

Jun 18 2024, 2:55 PM · Machine-Learning-Team

calbon reassigned T367293: Update blubber version in docker images from klausman to isarantopoulos.

Jun 18 2024, 2:54 PM · Machine-Learning-Team

calbon assigned T367293: Update blubber version in docker images to klausman.

Jun 18 2024, 2:53 PM · Machine-Learning-Team

calbon assigned T367537: Cloud VPS "machine-learning" project Buster deprecation to klausman.

Jun 18 2024, 2:50 PM · Machine-Learning-Team, Cloud-VPS (Debian Buster Deprecation)

calbon moved T367537: Cloud VPS "machine-learning" project Buster deprecation from Unsorted to Backlog/SRE on the Machine-Learning-Team board.

Jun 18 2024, 2:50 PM · Machine-Learning-Team, Cloud-VPS (Debian Buster Deprecation)

calbon moved T367562: Cloud VPS "wikilabels" project Buster deprecation from Unsorted to Watching on the Machine-Learning-Team board.

Jun 18 2024, 2:49 PM · Wikilabels, Machine-Learning-Team, Cloud-VPS (Debian Buster Deprecation)

calbon moved T367875: Reimage all ml-serve machines with Bookworm from Unsorted to Backlog/SRE on the Machine-Learning-Team board.

Jun 18 2024, 2:46 PM · Machine-Learning-Team

May 21 2024

calbon added a comment to T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models.

People can now pip install and use models. Right now we only have a few models - the number of models should increase over time.

May 21 2024, 2:49 PM · Goal, Machine-Learning-Team

calbon moved T363505: Pass the maximum number of uploads to the logo detection service from Unsorted to Watching on the Machine-Learning-Team board.

May 21 2024, 2:48 PM · Machine-Learning-Team, Structured-Data-Backlog

calbon moved T364089: Have problem with migrating to LiftWing from ores from Unsorted to Watching on the Machine-Learning-Team board.

May 21 2024, 2:48 PM · Machine-Learning-Team

calbon assigned T363505: Pass the maximum number of uploads to the logo detection service to kevinbazira.

May 21 2024, 2:47 PM · Machine-Learning-Team, Structured-Data-Backlog

calbon assigned T364089: Have problem with migrating to LiftWing from ores to isarantopoulos.

May 21 2024, 2:46 PM · Machine-Learning-Team

calbon moved T365226: Investigate a way to return other 2xx status code from predict in kserve from Unsorted to Backlog/Other on the Machine-Learning-Team board.

May 21 2024, 2:45 PM · Machine-Learning-Team

calbon assigned T365226: Investigate a way to return other 2xx status code from predict in kserve to achou.

May 21 2024, 2:44 PM · Machine-Learning-Team

calbon moved T365166: Update Pytorch base image to 2.3.0 from Unsorted to Ready To Go on the Machine-Learning-Team board.

May 21 2024, 2:34 PM · Machine-Learning-Team

calbon moved T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) from Unsorted to Ready To Go on the Machine-Learning-Team board.

May 21 2024, 2:34 PM · Machine-Learning-Team

calbon set the point value for T365246: Upgrade Huggingface image to kserve 0.13-rc0 (torch 2.3.0 ROCm 6.0) to 1.

May 21 2024, 2:33 PM · Machine-Learning-Team

calbon set the point value for T365166: Update Pytorch base image to 2.3.0 to 1.

May 21 2024, 2:33 PM · Machine-Learning-Team

calbon assigned T365253: Allow Kubernetes workers to be deployed on Bookworm to elukey.

May 21 2024, 2:32 PM · Machine-Learning-Team, serviceops, Kubernetes

calbon moved T365253: Allow Kubernetes workers to be deployed on Bookworm from Unsorted to Ready To Go on the Machine-Learning-Team board.

May 21 2024, 2:32 PM · Machine-Learning-Team, serviceops, Kubernetes

calbon set the point value for T365253: Allow Kubernetes workers to be deployed on Bookworm to 3.

May 21 2024, 2:31 PM · Machine-Learning-Team, serviceops, Kubernetes

calbon moved T365291: ml-serve2002 memory errors on DIMM_B1 from Unsorted to Watching on the Machine-Learning-Team board.

May 21 2024, 2:29 PM · SRE, Machine-Learning-Team, ops-codfw, DC-Ops

calbon moved T365439: Investigate why article-descriptions LiftWing API returns 404 when encoded colon is used in request URL from Unsorted to Watching on the Machine-Learning-Team board.

May 21 2024, 2:25 PM · Machine-Learning-Team

calbon added a comment to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.

Calico improvements makes the whole workflow more streamlived
Improve our incident response procedure
Investigate CPU spikes

May 21 2024, 2:18 PM · Goal, Machine-Learning-Team

calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.

Still can't use GPU with ROCm. But we figured out what the bug is - if the control version is upgraded to Bookworm it will be fixed.
Next step is to upgrade ml-staging to Bookworm then test.
Working on upgrading HF with newer versions with ROCm 6.0. Tested them and they work and will be posting watch.
Goal is to utilize GPU so we can deploy models from HuggingFace.

May 21 2024, 2:16 PM · Goal, Machine-Learning-Team

calbon added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.

Trying to fix up a Calico networking issue in Kubernetes
- After credentials, will send patched revert risk server to ml-staging

May 21 2024, 2:07 PM · Goal, Machine-Learning-Team

May 7 2024

calbon placed T360455: Add Article Quality Model to LiftWing up for grabs.

May 7 2024, 2:24 PM · Patch-For-Review, Content-Transform-Team, Research, Machine-Learning-Team

calbon assigned T360455: Add Article Quality Model to LiftWing to kevinbazira.

May 7 2024, 2:23 PM · Patch-For-Review, Content-Transform-Team, Research, Machine-Learning-Team

calbon added a comment to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.

Narrowed down cause of symptoms of spike in CPU usage to feature extraction in revscoring isvc. Might be caused by some specific revids.

May 7 2024, 2:19 PM · Goal, Machine-Learning-Team

calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.

Wait for vendor (Supermicro) to finalize order of 2x for ml-staging.
- Chris's guess is ml-staging installed at end of quarter

May 7 2024, 2:10 PM · Goal, Machine-Learning-Team

calbon added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.

Working on plumbing on staging, should be done within week
- Feeling good about it

May 7 2024, 2:08 PM · Goal, Machine-Learning-Team

Apr 30 2024

calbon added a comment to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.

Logging queries and logging when things are slow is the short term goal. Knowing WHY a query takes a long time is a future question

Apr 30 2024, 2:22 PM · Goal, Machine-Learning-Team

calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.

We have a theory that the ROCm drivers on the debian package is not required.

Apr 30 2024, 2:19 PM · Goal, Machine-Learning-Team

calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.

Decision point: Do we upgrade ROCm drivers?

Apr 30 2024, 2:15 PM · Goal, Machine-Learning-Team

calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.

Update: No update

Apr 30 2024, 2:14 PM · Goal, Machine-Learning-Team

calbon added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.

Rebased code after prototype.
- Waiting for istio change for making a new service, which is imminent
- Need to add new visual service that is tcp

Apr 30 2024, 2:13 PM · Goal, Machine-Learning-Team

Apr 25 2024

calbon moved T360455: Add Article Quality Model to LiftWing from Watching to Unsorted on the Machine-Learning-Team board.

Apr 25 2024, 5:07 PM · Patch-For-Review, Content-Transform-Team, Research, Machine-Learning-Team

Apr 23 2024

calbon added a comment to T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.

GPU order for the first GPU 2x chassis is close to complete. There are some supply issues with the chassis, so the question is going to be if we want to use an upgraded chassis for the ml-staging server.

Apr 23 2024, 2:25 PM · Goal, Machine-Learning-Team

calbon added a comment to T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.

Merged puppet machinery to allow network policies to be generated for assorted cluster. So we can automatically generated the network policy without the 60 lines of istio config.
Will merge change to network policy to allow Istio to talk to Cassandra.

Apr 23 2024, 2:18 PM · Goal, Machine-Learning-Team

Apr 16 2024

calbon renamed T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models from 2024 Q4: Lift Wing Python Package to 2024 Q4: Users can "pip install liftwing" and access 20% of models.

Apr 16 2024, 2:59 PM · Goal, Machine-Learning-Team

calbon added a project to T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services: Goal.

Apr 16 2024, 2:58 PM · Goal, Machine-Learning-Team

calbon created T362674: 2024 Q4 Goal: Operational Excellence - Improve base monitoring, alerting and logging of Lift Wing services.

Apr 16 2024, 2:57 PM · Goal, Machine-Learning-Team

calbon renamed T359140: 2024 Q4: Users can "pip install liftwing" and access 20% of models from Q4: Lift Wing Python Package to 2024 Q4: Lift Wing Python Package.

Apr 16 2024, 2:57 PM · Goal, Machine-Learning-Team

calbon moved T348153: Q3 2024 Goal: Lift Wing users can request multiple predictions using a single request. from 2023-2024 Q4 Quarter Goals to Previous Quarter Goals on the Machine-Learning-Team board.

Apr 16 2024, 2:53 PM · Goal, Machine-Learning-Team

calbon moved T353333: Q3 2024 Goal: Implement caching for revertrisk-language-agnostic from 2023-2024 Q4 Quarter Goals to Previous Quarter Goals on the Machine-Learning-Team board.

Apr 16 2024, 2:53 PM · Goal, Machine-Learning-Team

calbon moved T353337: Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models from 2023-2024 Q4 Quarter Goals to Previous Quarter Goals on the Machine-Learning-Team board.

Apr 16 2024, 2:53 PM · Goal, Machine-Learning-Team

calbon moved T353338: Q3 2024 Goal: Expand Lift Wing Cluster and add GPU capacity to production from 2023-2024 Q4 Quarter Goals to Previous Quarter Goals on the Machine-Learning-Team board.

Apr 16 2024, 2:53 PM · Goal, Machine-Learning-Team

calbon moved T353814: Q3 2024 Goal: A plan for a training infrastructure from 2023-2024 Q4 Quarter Goals to Previous Quarter Goals on the Machine-Learning-Team board.

Apr 16 2024, 2:52 PM · Goal, Machine-Learning-Team

calbon renamed T348153: Q3 2024 Goal: Lift Wing users can request multiple predictions using a single request. from Goal: Lift Wing users can request multiple predictions using a single request. to Q3 2024 Goal: Lift Wing users can request multiple predictions using a single request..

Apr 16 2024, 2:52 PM · Goal, Machine-Learning-Team

calbon renamed T353333: Q3 2024 Goal: Implement caching for revertrisk-language-agnostic from Goal: Implement caching for revertrisk-language-agnostic to Q3 2024 Goal: Implement caching for revertrisk-language-agnostic.

Apr 16 2024, 2:52 PM · Goal, Machine-Learning-Team

calbon renamed T353337: Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models from Goal: Inference Optimization for Hugging face/Pytorch models to Q3 2024 Goal: Inference Optimization for Hugging face/Pytorch models.

Apr 16 2024, 2:51 PM · Goal, Machine-Learning-Team

calbon renamed T353338: Q3 2024 Goal: Expand Lift Wing Cluster and add GPU capacity to production from Goal: Expand Lift Wing Cluster and add GPU capacity to production to Q3 2024 Goal: Expand Lift Wing Cluster and add GPU capacity to production .

Apr 16 2024, 2:51 PM · Goal, Machine-Learning-Team

calbon renamed T353814: Q3 2024 Goal: A plan for a training infrastructure from Goal: A plan for a training infrastructure to Q3 2024 Goal: A plan for a training infrastructure .

Apr 16 2024, 2:51 PM · Goal, Machine-Learning-Team

calbon created T362672: 2024 Q4 Goal: Revert Risk models are supported by caching in production.

Apr 16 2024, 2:51 PM · Goal, Machine-Learning-Team

calbon moved T362671: ------ from 2023-2024 Q4 Quarter Goals to Task Archive on the Machine-Learning-Team board.

Apr 16 2024, 2:46 PM · Machine-Learning-Team

calbon closed T362671: ------ as Declined.

Apr 16 2024, 2:45 PM · Machine-Learning-Team

calbon created T362671: ------.

Apr 16 2024, 2:45 PM · Machine-Learning-Team

calbon created T362670: 2024 Q4 Goal: An HuggingFace 7B LLM is hosted on ml-staging on Lift Wing powered by GPU.

Apr 16 2024, 2:45 PM · Goal, Machine-Learning-Team

Mar 26 2024

calbon renamed T353333: Q3 2024 Goal: Implement caching for revertrisk-language-agnostic from Goal: Implement caching for revertrisk-multilingual to Goal: Implement caching for revertrisk-language-agnostic.

Mar 26 2024, 2:41 PM · Goal, Machine-Learning-Team

calbon added a comment to T353338: Q3 2024 Goal: Expand Lift Wing Cluster and add GPU capacity to production .

At risk because we don't have a GPU in the data centers yet.

Mar 26 2024, 2:40 PM · Goal, Machine-Learning-Team

calbon moved T360455: Add Article Quality Model to LiftWing from Unsorted to Watching on the Machine-Learning-Team board.

Mar 26 2024, 2:35 PM · Patch-For-Review, Content-Transform-Team, Research, Machine-Learning-Team

calbon moved T360593: Create an examples directory in the repository and add a basic README.md from Unsorted to Ready To Go on the Machine-Learning-Team board.

Mar 26 2024, 2:31 PM · Machine-Learning-Team

calbon moved T360637: Bump memory for registry[12]00[34] VMs from Unsorted to Ready To Go on the Machine-Learning-Team board.

Mar 26 2024, 2:27 PM · Patch-For-Review, serviceops, Machine-Learning-Team

calbon moved T360638: Create a Pytorch base image from Unsorted to Ready To Go on the Machine-Learning-Team board.

Mar 26 2024, 2:27 PM · Patch-For-Review, Machine-Learning-Team