[go: up one dir, main page]

Page MenuHomePhabricator

RFC: Increase the strictness of mediawiki SQL code and leverage database code blockers for scalability
Closed, ResolvedPublic

Description

Introduce the following policies and recommendations

  • Code that touches the database must be compatible with the following MySQL SQL modes:
    • TRADITIONAL (equivalent to STRICT_TRANS_TABLES, STRICT_ALL_TABLES, NO_ZERO_IN_DATE, NO_ZERO_DATE, ERROR_FOR_DIVISION_BY_ZERO, NO_AUTO_CREATE_USER)
    • ONLY_FULL_GROUP_BY
  • The MediaWiki software must enforce the SQL mode used (once the review period is complete) to the one mentioned above. Extensions must not attempt to change the SESSION value of sql_mode.
  • Unsupported (old) versions of databases must be compatible as the MediaWiki compatibility list indicates, but performance efforts should be concentrated on supported versions of databases and its default or widely recommended configuration parameters
  • All tables must have a primary key. When a candidate for primary key could not be created (for example, if all columns can be repeated), an auto_increment or another arbitrary value, depending on the case, has to be added.
  • Undeterministic queries and unsafe statements for binlog should be avoided as they would return/write different results in a replication environment. The latter can be detected as warnings with the text " [Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT". Those include "INSERT ... SELECT" when using an auto_increment key, "UPDATE ... LIMIT" without an ORDER BY, using undeterministic function like SYSDATE(). More info at: https://dev.mysql.com/doc/refman/5.6/en/replication-rbr-safe-unsafe.html
  • To help understand which code is and isn't compatible with the above recommendations, Continuous Integration checks must run MySQL in sql_mode='TRADITIONAL,ONLY_FULL_GROUP_BY'. For no longer than 6 months, it should be "non-voting". Also, production must log its MySQL warnings (errors after the sql_mode change to TRADITIONAL), all or a sample of them, to easily identify the pending issues.
  • In order to enforce the previous policies, new functionalities must be rejected if they are not compatible with the above stricter standards. A process will be created to review existing functionalities, with the goal of enforcing them for all MediaWiki-related code within 6 months.

Some of these guidelines are now in https://www.mediawiki.org/wiki/Development_policy#Database_policy

Rationale

MySQL in the past was a "toy" database, not scalable, non-transactional and very little constraints to input queries. While that allowed fast code development, it also allows very insecure practices. In the past 10 years, MySQL has evolved to be a scalable, fully ACID storage engine, and allows better enforcement of inputs, to the point that that strict mode will be the compiled defaults in the soon to be released MySQL 5.7 (and they were the defaults in configuration in 5.6).

How strict MySQL is with input is controlled by its sql_mode variable. I am not suggesting becoming incompatible with the loose old defaults, but making sure we are compatible with the stricter configuration, and, at the same time, make MediaWiki code more secure (and WMF infrastucture, too, by switching to the stricter defaults).

TRADITIONAL mode

Here is an example of the effects of this change:

MariaDB [test]> CREATE TABLE test (
  id int NOT NULL PRIMARY KEY, 
  multivalue enum('wikipedia','wiktionary') NOT NULL, 
  date datetime NOT NULL, 
  lang_code CHAR(2) NOT NULL
);
Query OK, 0 rows affected (0.01 sec)

MariaDB [test]> INSERT INTO test (multivalue, date, lang_code) VALUES ('mediawiki', -1, 'this is a very long text');
Query OK, 1 row affected, 4 warnings (0.00 sec)

MariaDB [test]> SELECT * FROM test;
+----+------------+---------------------+-----------+
| id | multivalue | date                | lang_code |
+----+------------+---------------------+-----------+
|  0 |            | 0000-00-00 00:00:00 | th        |
+----+------------+---------------------+-----------+
1 row in set (0.00 sec)

(The MySQL command-line client does not show the four warnings unless you SHOW WARNINGS or change its config.)
Full example: http://www.slideshare.net/jynus/query-optimization-with-mysql-57-and-mariadb-10-even-newer-tricks/41

Silent truncation without error has caused lots of bugs (some of them, security issues) in the few months I have been a DBA for MediaWiki, specially due to mistakes on binary length vs character length. In code terms we have the equivalent to a "stack overflow".

If we enforce sql_mode TRADITIONAL, this results in:

MariaDB [test]> SET sql_mode='TRADITIONAL';
Query OK, 0 rows affected (0.00 sec)

MariaDB [test]> INSERT INTO test (multivalue, date, lang_code) VALUES ('mediawiki', -1, 'this is a very long text');
ERROR 1364 (HY000): Field 'id' doesn't have a default value
MariaDB [test]> INSERT INTO test (id, multivalue, date, lang_code) VALUES (1, 'mediawiki', -1, 'this is a very long text');
ERROR 1265 (01000): Data truncated for column 'multivalue' at row 1
MariaDB [test]> INSERT INTO test (id, multivalue, date, lang_code) VALUES (1, 'wikipedia', -1, 'this is a very long text');
ERROR 1292 (22007): Incorrect datetime value: '-1' for column 'date' at row 1
MariaDB [test]> INSERT INTO test (id, multivalue, date, lang_code) VALUES (1, 'wikipedia', now(), 'this is a very long text');
ERROR 1406 (22001): Data too long for column 'lang_code' at row 1
MariaDB [test]> INSERT INTO test (id, multivalue, date, lang_code) VALUES (1, 'wikipedia', now(), 'es');
Query OK, 1 row affected (0.00 sec)

Setting the mode in session will make us potentially incompatible (and it is a waste of a round trip to the database), so the proposal here is

Set the mode in WMF database configuration and reject (-2) new code that is incompatible, while reviewing existing code.

Being compatible with stricter mode in SQL code is 100% compatible with looser modes.

A simplified, but larger explanation of TRADITIONAL is:

TRADITIONAL (equivalent to STRICT_TRANS_TABLES, STRICT_ALL_TABLES, NO_ZERO_IN_DATE, NO_ZERO_DATE, ERROR_FOR_DIVISION_BY_ZERO, NO_AUTO_CREATE_USER, NO_ENGINE_SUBSTITUTION)

  • STRICT_TRANS_TABLES: avoids invalid values (e.g truncations) for transactional tables (InnoDB)
  • STRICT_ALL_TABLES: avoids invalid values (e.g truncations) for all tables, including MyISAM
  • NO_ZERO_IN_DATE: avoids setting dates with day of the month 0, month 0, etc.
  • NO_ZERO_DATE: disallows the '0000-00-00 00:00:00' value
  • ERROR_FOR_DIVISION_BY_ZERO: error on division by 0 instead of returning null
  • NO_AUTO_CREATE_USER: only affects MySQL user creation (DBA task), disallows creating MySQL users with GRANT, avoiding creating passwordless users
  • NO_ENGINE_SUBSTITUTION: only affects table creation (DBA task), makes CREATE TABLE fail if the engine is unavailable, instead of changing it to the default engine

Not enforcing this mode has created issues in the past, like the recent T99941

ONLY_FULL_GROUP_BY

TRADITIONAL mode affects insertions and updates. There is also a mode that affects SELECTs that are non-deterministic, such as creating GROUP BYs and selecting columns that we have not grouped by:

MariaDB [test]> CREATE TABLE test2 (id int PRIMARY KEY, name varchar(22), value int);
Query OK, 0 rows affected (0.02 sec)

MariaDB [test]> insert into test2 VALUES (1, 'tree', 45), (2, 'tree', 32), (3, 'rock', 12), (4, 'rock', 67);
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0

MariaDB [test]> SELECT name, value FROM test2 GROUP BY name;
+------+-------+
| name | value |
+------+-------+
| rock |    12 |
| tree |    45 |
+------+-------+
2 rows in set (0.00 sec)

What 'values' will we get? It is undetermined by SQL standard, and even if engines were deterministic (MySQL usually sends the first value found), it could change from slave to slave, as it actually happened in T102915#1538739

If we enforce sql_mode ONLY_FULL_GROUP_BY, this results in:

MariaDB [test]> SET sql_mode='ONLY_FULL_GROUP_BY';
Query OK, 0 rows affected (0.00 sec)

MariaDB [test]> SELECT name, value FROM test2 GROUP BY name;
ERROR 1055 (42000): 'test.test2.value' isn't in GROUP BY

Other undeterministic behaviours by unsafe SQL queries may actually compromise the data integrity of the slave, as producing lateral effects on slaves. While ROW-based replication will avoid that, there are several blockers that restrict that for now.

Primary keys

The full rationale for primary keys is at T17441#1420139 and the list of pending tables, just below that.

Primary keys are fundamental not only for consistency, but also for proper performance in MySQL's InnoDB and row based replication. They are also very important for DBA tools to work properly such as pt-table-checksum or pt-online-schema-change. Without them, we cannot check the integrity of the replicas and we cannot do schema changes online without a failover.

It is necessary to point out that InnoDB, MySQL's default storage engine since 5.5, and the one we use at the WMF, does always require a PRIMARY KEY to work internally, and if one is not provided explicitly, and there are not other candidates (I think unique keys not null), a hidden 6-bytes identifier is created for them, but that is not user-accessible.

Who

In summary, these changes need investment, but will save in the long term hundreds of developer hours and will allow MediaWiki to scale better.

DBAs, with current policy, need to review all schema changes. If those are adopted, I can reject unsafe code, but I need it in writing. Of course, code review should be a team effort, and I will be happy to help.

How

Leaving this intentionally open, I proposed a 1 year period of detecting this on existing code, while applying it immediately for new changes. MediaWiki core probably will need no change ("it works for me" under strict configuration), but lots of extensions need fixing.

Related Objects

View Standalone Graph
This task is connected to more than 200 other tasks. Only direct parents and subtasks are shown here. Use View Standalone Graph to show more of the graph.

Event Timeline

jcrespo claimed this task.
jcrespo raised the priority of this task from to Needs Triage.
jcrespo updated the task description. (Show Details)
jcrespo subscribed.

I fixed a minor typo in this. @jcrespo, I'm planning on moving this to mediawiki.org with a "{{draft}}" or some other template on it (assuming you don't beat me to it). I hope you (and everyone else here) are available in 70 minutes on #wikimedia-office to discuss this in more detail! (see E66 for meeting details)

Yes, please! I wrote this very quickly and didn't even proof read it, which combined with my lack of English writing skills, it is not precisely a piece of art :-)

I like the idea, but I have gotten pushback before when proposing similar (esp. ONLY_FULL_GROUP_BY) in the name of performance.

RobLa-WMF claimed this task.

@jcrespo - thanks for pushing this one through the process! Since TechCom approved this, I'm going to declare this one done.