[25Q1] Performance improvements
Open, MediumPublic
Actions

Assigned To

Authored By

	Sharvaniharan
	Jun 19 2024, 12:03 AM

Description

Background

Long function runtimes, particularly functions which fail due to timeout are impacting user experience on Wikifunctions.

This problem is because of two contributing factors:
[1] Lack of control over runtime resources, like not being able to allocate the memory we need to execute on priority, which is beyond our control.
[2] Some functions are doing a lot of work with many rest based calls which slow them down

This epic focuses on the second issue [2], and will contain tasks to identify and improve areas within our function calls that can be done more efficiently.

This task is not for

Re-architecting code to completely eliminate usage of REST calls.
Frequency of function calls and recency are not measurable with just a cache in backend but solution might be not feasible as part of this work
The spikes we run could result in many possibilities for improvement. We might not have time to get to all of them as part of this work, but our spike should be thorough.

Approach

We will run a spike to identify potential areas of improvement
We have identified that there are few areas that could be improved by restructuring our code as part of this work
We will implement the initial ideas identified for enhancing cache management and integrate the insights gained from our Q4 metrics analysis for the same.
We will continue cache improvements

Acceptance Criteria

Based on the spike, we have a full understanding of the underlying issues - both fixable and not feasible ones, and document it.
We have fixed at least one major area where function execution code was doing more work than required.
Once fixes are in, reach out to 3 users to see if they’ve noticed a change.

Goals & Success Metrics

Fixing one major area where function execution is doing too much work results in X number of functions running faster.
We identify a set of functions (e.g. deploy tests) which will not change, and measure their runtimes before and after making the performance improvements; we would like to see the average runtimes decrease by N%
We have heard positive feedback from the users we reached out to about our performance improvements.

Related Objects
Search...

Status	Assigned	Task
Open	cmassaro	T367933 [25Q1] Performance improvements
Resolved	ecarg	T364413 Improve the logging we're doing in the orchestrator and evaluator to have a better idea of where the slowness is coming from
Resolved	ecarg	T369001 Logging epic nice-to-have: Adding stack trace for every log output
Resolved	ecarg	T369213 Logging epic nice-to-have: add equivalent logs for docker logs
Resolved	ecarg	T369560 Check if message form is from an object type in Logger param
Resolved	cmassaro	T369956 Add logging data when db fetch in orchestrator
Resolved	cmassaro	T369552 Analyze Performance Numbers and Discuss an Appropriate Spike
Resolved	cmassaro	T371837 Performance Spike: Create a Pool of "Hot" Executor Processes to Eliminate wasmedge Startup Costs
In Progress	ecarg	T372847 Implement and add duration times for function calls
Resolved	Jdforrester-WMF	T374737 Add env var to switch on function_orchestrator_function_duration_seconds metrics in Prod