In the manual evaluation of the link recommenation T278864 , it was reported that for viwiki many of the link recommendations were wrong T278864#6961431. The reason seems to be that the link recommendation algorithm is not able to distinguish words with different tone. For example for the page Hải Linh the first suggested link is (this was evaluated as wrong):
context_after " cho sự ph"
context_before " đóng góp "
link_index 0
link_target "Quản Trọng"
link_text "quan trọng"
match_index 0
score 0.5204381346702576
wikitext_offset
The selected anchor text in this case is "quan trọng", however, the correct anchor text for the link should actually be "quản trọng".
The reason seems to be a character-encoding issue in the MySQL-table of the anchor-dictionary. A query for the two different words yields the same result, suggesting that it is unable to distinguish the different tones.
MariaDB [addlink]> SELECT id, value FROM lr_viwiki_anchors WHERE lookup = 'quan trọng'; +--------+--------------------------------+ | id | value | +--------+--------------------------------+ | 558069 | �}q X Quản TrọngqK1s. | +--------+--------------------------------+ 1 row in set (0.001 sec)MariaDB [addlink]> SELECT id, value FROM lr_viwiki_anchors WHERE lookup = 'quản trọng'; +--------+--------------------------------+ | id | value | +--------+--------------------------------+ | 558069 | �}q X Quản TrọngqK1s. | +--------+--------------------------------+ 1 row in set (0.001 sec)
One possible solution is to choose a different character-encoding such as utf8_bin (instead of utf8), see here.
This would allow the query to distinguish between the two words with different tone. As a result, the model would not suggest the link shown as an example above.
This problem seems to only occur in the MySQL-tables and I could not reproduce this behaviour with the corresponding files in pickle- or sqlite-format used in the local evaluation of the backtesting dataset.