[go: up one dir, main page]

Page MenuHomePhabricator

Character encoding issues in MySQL anchor dictionaries for viwiki
Closed, ResolvedPublic

Description

In the manual evaluation of the link recommenation T278864 , it was reported that for viwiki many of the link recommendations were wrong T278864#6961431. The reason seems to be that the link recommendation algorithm is not able to distinguish words with different tone. For example for the page Hải Linh the first suggested link is (this was evaluated as wrong):

context_after " cho sự ph"
context_before " đóng góp "
link_index 0
link_target "Quản Trọng"
link_text "quan trọng"
match_index 0
score 0.5204381346702576
wikitext_offset

The selected anchor text in this case is "quan trọng", however, the correct anchor text for the link should actually be "quản trọng".

The reason seems to be a character-encoding issue in the MySQL-table of the anchor-dictionary. A query for the two different words yields the same result, suggesting that it is unable to distinguish the different tones.

MariaDB [addlink]> SELECT id, value FROM lr_viwiki_anchors WHERE lookup = 'quan trọng';
+--------+--------------------------------+
| id     | value                          |
+--------+--------------------------------+
| 558069 | �}q X   Quản TrọngqK1s.      |
+--------+--------------------------------+
1 row in set (0.001 sec)MariaDB [addlink]> SELECT id, value FROM lr_viwiki_anchors WHERE lookup = 'quản trọng';
+--------+--------------------------------+
| id     | value                          |
+--------+--------------------------------+
| 558069 | �}q X   Quản TrọngqK1s.      |
+--------+--------------------------------+
1 row in set (0.001 sec)

One possible solution is to choose a different character-encoding such as utf8_bin (instead of utf8), see here.
This would allow the query to distinguish between the two words with different tone. As a result, the model would not suggest the link shown as an example above.

This problem seems to only occur in the MySQL-tables and I could not reproduce this behaviour with the corresponding files in pickle- or sqlite-format used in the local evaluation of the backtesting dataset.

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

From a debugging session yesterday:

MariaDB [addlink]> SELECT id, value FROM lr_viwiki_anchors WHERE lookup = 'quan trọng';
+--------+--------------------------------+
| id     | value                          |
+--------+--------------------------------+
| 558069 | �}q X   Quản TrọngqK1s.      |
+--------+--------------------------------+
1 row in set (0.001 sec)

MariaDB [addlink]> SET NAMES utf8;
Query OK, 0 rows affected (0.000 sec)

MariaDB [addlink]> SELECT id, value FROM lr_viwiki_anchors WHERE lookup = 'quan trọng';
+--------+--------------------------------+
| id     | value                          |
+--------+--------------------------------+
| 558069 | �}q X   Quản TrọngqK1s.      |
+--------+--------------------------------+
1 row in set (0.000 sec)

Then I converted the table to utf8bin:

MariaDB [addlink]> ALTER TABLE lr_viwiki_anchors CONVERT TO CHARACTER SET utf8 COLLATE utf8_bin;

And that fixed the issue:

MariaDB [addlink]> SELECT id, value, lookup FROM lr_viwiki_anchors WHERE lookup = 'quản trọng';
+--------+--------------------------------+----------------+
| id     | value                          | lookup         |
+--------+--------------------------------+----------------+
| 558069 | �}q X   Quản TrọngqK1s.      | quản trọng     |
+--------+--------------------------------+----------------+
1 row in set (0.002 sec)
MariaDB [addlink]> SELECT id, value, lookup FROM lr_viwiki_anchors WHERE lookup = 'quan trọng';
Empty set (0.000 sec)

Change 676286 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[research/mwaddlink@main] Adjust character set to utf8mb4_bin

https://gerrit.wikimedia.org/r/676286

Apparently case insensitive collations (we use utf8mb4_unicode_ci now) are also accent-insensitive. Modern versions of MySQL / MariaDB allow setting those two flags separately but that's since 8.0 / 10.4 and we are still on 10.1 (which is roughly equivalent to 5.7 I think).

If we don't care about case insensitivity, binary is certainly the best choice for a lookup table as it is easy to reason about and also improves performance somewhat.

Apparently case insensitive collations (we use utf8mb4_unicode_ci now) are also accent-insensitive. Modern versions of MySQL / MariaDB allow setting those two flags separately but that's since 8.0 / 10.4 and we are still on 10.1 (which is roughly equivalent to 5.7 I think).

If we don't care about case insensitivity, binary is certainly the best choice for a lookup table as it is easy to reason about and also improves performance somewhat.

When storing the anchors in the anchor-dictionary they are all lower-cased via python's str.lower()
(see code). When looking up candidate-anchors extracted from an article's wikitext, I do the lookup with the lower-cased string (see code).
From this I think we dont care about case-sensitivity, but do care about accent-sensitivity.

Change 676286 merged by jenkins-bot:

[research/mwaddlink@main] Adjust character set to utf8mb4_bin

https://gerrit.wikimedia.org/r/676286

Change 678820 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[research/mwaddlink@main] Remove ALTER statements for primary key and utf8mb4

https://gerrit.wikimedia.org/r/678820

Change 678820 merged by jenkins-bot:

[research/mwaddlink@main] Remove ALTER statements for primary key and utf8mb4

https://gerrit.wikimedia.org/r/678820

Change 681298 had a related patch set uploaded (by Kosta Harlan; author: Kosta Harlan):

[operations/deployment-charts@master] linkrecommendation: Bump version

https://gerrit.wikimedia.org/r/681298

Change 681298 merged by jenkins-bot:

[operations/deployment-charts@master] linkrecommendation: Bump version

https://gerrit.wikimedia.org/r/681298

Etonkovidova subscribed.

Based on this @PPham comment in T278864 it seems the issue has been solved.