Using the tool built for this purpose in T316149, conduct the following manual evaluation of section-level image suggestions:
- Evaluate results in English, Portuguese, Indonesian, Russian, Arabic, Czech, Bengali, French and Spanish Wikipedias
- Evaluate evaluate 500 random section-level image suggestions across 500 random different articles, per wiki
- Ambassadors will need to count and keep track of how many suggestions they have evaluated in their language -- the tool will not capture that.
- For each result for each unillustrated article, manually decide whether the match is good, okay, or bad. Evaluators also have the option to choose "unsure" if they're not confident in their selection.
- General comments or questions during evaluation can be posted as comments in this ticket.
The estimated time of work for manual evaluation is 3 hours for the 500 images. However, if the 3 hours are passed without finishing the test, please leave a comment.
- As a result of the evaluation done in this ticket, product managers will determine what confidence scores to move forward with and if any algorithm changes are necessary.
- We will also evaluate what percentage of bad matches come from section alignment, and which wikis specifically, so we can decide if there is a next step here
- We will also evaluate whether excluding images that are not .jpgs in order to prevent icons from being suggested is something we want to do in the production dataset
Evaluation note
- We first want to evaluate small datasets iteratively internally on the SD&Research and PM team before involving ambassadors