[go: up one dir, main page]

Page MenuHomePhabricator

Investigate watchlist sizes (limiting or handling large ones properly)
Closed, ResolvedPublic

Description

Following from T245866: Watchlist Expiry: Create 20,000 cap and the general Watchlist Expiry project, there's a need to make sure that the new feature does not cause a significant increase in the size of the watchlist, especially since this is already an existing performance problem.

The watchlist table and functionality is, at the moment, not limited at all, allowing users to add as many items as they manage to it. This results in many technical issues like users not being able to load their EditWatchlist page or have issues with displaying the content of the watchlist in general. Also, watching a million pages has its own limited use cases, and the number of users who watch significant amounts of pages is relatively low.

There are two probable ways to fix the issues:

  1. Limit the size of the watchlist. The limit can be large to allow users to watch many items, but still needs to have some sensible limit for the technical and performant usage of the wikis in general.
  2. Fix all the places where having a large watchlist is causing problems (such as adding pagination and search to Special:Watchlist etc.)

This ticket will explore what these options mean and whether it is feasible to properly correct this problem as we introduce more features into the watchlist.

(MORE INFO COMING, this ticket is in the process of being hashed out, please do not tag external teams just yet until the phrasing, structure, and plan is written clearly)

TODO for the ticket:

  • Check how many users have more than 50,000 watched items in enwiki, commons, and wikidata
  • Gather previous tickets related to the watchlist having no limit
  • Format/Frame the contents of the intro properly with the plan and solution statement

Event Timeline

Check how many users have more than 50,000 watched items in enwiki, commons, and wikidata

Probably not the most efficient query, but I believe this works (excludes current/former bots):

SELECT COUNT(*)
FROM (
    SELECT COUNT(wl_id)
    FROM watchlist
    WHERE NOT EXISTS (
        SELECT 1
        FROM user_groups
        WHERE ug_user = wl_user
            AND ug_group = 'bot'
    ) AND NOT EXISTS (
        SELECT 1
        FROM user_former_groups
        WHERE ufg_user = wl_user
            AND ufg_group = 'bot'
    )
    GROUP BY wl_user
    HAVING COUNT(wl_id) > 50000
) a;
database> 50,000> 100,000> 250,000> 500,000> 1,000,000
enwiki242801631
commonswiki35115850227
wikidatawiki4422640

Meanwhile it would seem most of the accounts with a very large watchlist are bots. Out curiosity I compared how many rows in the watchlist table belong to bots versus the total size of the table:

SELECT SUM(count) FROM (
  SELECT COUNT(*) AS count
  FROM watchlist
  JOIN user_groups ON wl_user = ug_user
  WHERE ug_group = 'bot'
  UNION
  SELECT COUNT(*) AS count
  FROM watchlist
  JOIN user_former_groups ON wl_user = ufg_user
  WHERE ufg_group = 'bot'
) a
databaserows by botstotal watchlist rowspercentage
enwiki103764172225063734.7%
commonswiki2202011415810962014%
wikidatawiki91424449855964541.1%
arwiki130602112067209063%
eswiki36720952726507313.4%
mgwiktionary132620241328887099.8%

That might be of interest to the DBAs. These bots presumably are not actually using the watchlist. The "Add pages and files I edit to my watchlist" preference is not on by default, but I'm assuming it used to be, because there are a lot of bots on enwiki polluting the watchlist table. The "Add new files I upload to my watchlist" is on by default, so that explains why the percentage is so high on Commons (e.g. bots like User:Flickr_upload_bot)

Samwilson renamed this task from [draft] Check into limiting Watchlist table sensibly to [draft] Check into watchlist sizes (limiting or handling large ones properly).May 19 2020, 1:49 AM
Samwilson updated the task description. (Show Details)

One of the assumptions has been that we don't want the watchlist_expiry or watchlist tables growing too big, but it sounds like this isn't actually the crux of it. If all the places that use the watchlist tables do so with proper limits etc., then is it correct to say that we don't mind these tables getting large? In T245866 we mention 'DB constraints', and T240094 has discussion about limiting to "prevent possible abuses" and to ensure we don't try to purge all at once (which we're doing), but it doesn't seem to be spelled out anywhere that the absolute size is the problem.

@Marostegui Hey! We were hoping you could give some insight as to the performance problems we have and/or anticipate with the watchlist in general (see T252812#6147259). In brief: From a DBA perspective, is the concern with the overall size of the watchlist table, or how many pages a single user is watching?

Thanks for poking me!
We definitely do care about the size of the watchlist, which is already pretty big already on enwiki (54GB on disk) and on commonswiki (28GB).
Having such big and monolithic tables is dangerous and we should not add more stuff to them, they are already big enough that it could cause issues.
Big tables are hard to maintain (purge, run schema changes, backup, copy...) and there is a point where they are simply unmaintainable ie: we cannot even run schema changes on then. Where at this point we are still not there with watchlist certainly throwing more stuff to it will definitely not help.

Big tables tend to confuse the optimizer and we can have unexpected query plans which can end up with big regressions and long long long query times, full scans, large joins, and all that fun - unfortunately we have experienced this quite a few times already that we even have a tag for it :-( mariadb-optimizer-bug. Catching these bugs is hard, and can only be noticed _once_ the feature is in production, and of course fixing that involves quite a bunch of time and code changes.

Lastly, big big tables can be harder to fit into the innodb buffer pool, which means that they might run less fast than they should.

Using the following and repeating for each wiki (includes all users including bots):

SELECT 'enwiki' AS wiki, 50000 AS threshold, COUNT(*) FROM (SELECT COUNT(wl_id) count FROM watchlist GROUP BY wl_user HAVING count > 50000) a
UNION
SELECT 'enwiki' AS wiki, 100000 AS threshold, COUNT(*) FROM (SELECT COUNT(wl_id) count FROM watchlist GROUP BY wl_user HAVING count > 100000) a
UNION
SELECT 'enwiki' AS wiki, 250000 AS threshold, COUNT(*) FROM (SELECT COUNT(wl_id) count FROM watchlist GROUP BY wl_user HAVING count > 250000) a
UNION
SELECT 'enwiki' AS wiki, 500000 AS threshold, COUNT(*) FROM (SELECT COUNT(wl_id) count FROM watchlist GROUP BY wl_user HAVING count > 500000) a
UNION
SELECT 'enwiki' AS wiki, 1000000 AS threshold, COUNT(*) FROM (SELECT COUNT(wl_id) count FROM watchlist GROUP BY wl_user HAVING count > 1000000) a
database> 50,000> 100,000> 250,000> 500,000> 1,000,000
commonswiki383182632911
enwiki265982753
wikidatawiki7948211712
dewiki6725510
frwiki5823100
eswiki3814531
ptwiki3714400
itwiki2612830
nlwiki197211
zhwiki164000
svwiki116100
hewiki72000

Note again that many of these users are not human, especially for the higher thresholds. More at T252812#6140681

@Marostegui (feel free to tag anyone else that may need to participate here?), we've discussed this extensively in the team after running into some challenges, and ran some analysis on the usage of the watchlist table. We want to see if we can reframe the problem and make sure we can find a viable way forward.

As an aside, I'll be moving to a new team, and @dmaza is stepping in to be the Tech Lead of Community Tech; I will be happy to provide insight and help out, but he will take this forward from now onwards (thanks for all the incredible help, @Marostegui !)

As a follow-up to previous discussions, we have discussed the new watchlist expiry table as a team. In particular, we’ve looked into several strategies to account for unknown user behavior.

  • One of these was limiting the timespan to watch a page temporarily (i.e., 6 months). We have already implemented the 6-month limit,
  • The other was to set up a watch cap, per user, for the watchlist expiry table. However, when we began to think about the watch cap, we encountered many issues, as described below:
    1. Adding a cap to the watchlist expiry alone may not fix the problem long-term: When users encounter the cap, that may not stop them from adding more pages. Instead, they might just add the pages permanently to their watchlist. This may lead us back to the original problem, too many watched pages overall.
    2. Adding a cap to the general watchlist is an alternative, but it may not be an especially viable approach because as we think of more features for the watchlist (like expiry), we need to deal with the current (ongoing / root) problems:
      1. Adding a cap to the temporary watchlist can be a quick stop-gap measure but will likely not fix the problem.
      2. A general cap may introduce further issues, depending on user behavior (such as users continuing to add to the watchlist after expiry-cap is met, thereby increasing the size of the general watchlist again).
      3. Adding a cap on the general watchlist is a risky operation, and would require an extensive, especially hands-on campaign to work with users.
    3. There are good reasons to allow users to watch a large amount of pages,regardless of the actual Special:Watchlist page itself: For example, some users like to see all pages they are watching in bold while reviewing RecentChanges. These reasons make us wonder whether we should find an alternative way to fix the load and performance problems rather than simply creating a cap.
    4. The amount of technical work required to implement a cap is not trivial: Whatever solution we may choose, the needed work is big enough for us to want to make sure that we do things properly with eyes towards the future.
    5. The amount of community consultation is non-trivial: We considered implementing a general cap on the watchlist, but this would require extensive community consultation way beyond the resources, scope, and capacity of the team and its plans for 2020-21.

Meanwhile, as part of our investigation, we have discovered many elements that, if directly addressed by the DBA team, could significantly reduce the primary challenges now faced by managing the table. They are the following:

  1. Based on our analysis (see T254528: Watchlist Expiry: Analyze Highest Number of Watched Pages on Wikis and T252812#6140681), the accounts with the highest number of watched pages are most often bots.
    1. By default, every page an account creates is automatically watched, which seems to explain why bots are watching so many pages. For example:
      1. On arwiki, bots consume about 63% of the total watchlist table.
      2. On commonswiki, bots consume about 14% of of the total watchlist table. Most of the bots appear to mass-upload files, suggesting they’re watching the pages only because the preference to do so is on by default.
      3. On eswiki, bots consume about 13.4% of the total watchlist table.
      4. On mgwiktionary, a single bot has mass-created nearly every entry on the wiki. That bot is watching 13,259,382 pages, which is about 99.8% of the entire watchlist table.
    2. Many bots are also inactive or blocked and have been for many years.
    3. Based on this, if there is a pressing concern about the size of the watchlist table, it may be worthwhile looking into ways to purge the existing data given many of the accounts (particularly bots) don’t appear to be using the watchlist, either because they are being watched automatically or the account has long left the wiki (retired bot, Foundation-banned users, etc.).
  2. Based on the data we have collected, the majority of the users watch a relatively small number of pages. Meanwhile, a very small number of users watch a very high number of pages. If this small group can be addressed in the future by the DBA team, this can potentially optimize the system.

Given the findings above, we are left with making some decisions about further development. We still believe that watchlist expiry may decrease the size of the watchlist table, but we also want to be sensible in protecting against use cases that may not.

So, we would like to ask if:

  1. Is it still crucial that we implement a watchlist cap per user?
  2. If we need to implement a general cap, it would need to not impact any current users. (Given that any impact on users, per the Community Relations team, would require in-depth consultations.) Right now, the largest number of watched pages is slightly over 28 million. Would this still be useful?
aezell renamed this task from [draft] Check into watchlist sizes (limiting or handling large ones properly) to Check into watchlist sizes (limiting or handling large ones properly).Jun 30 2020, 11:12 PM
aezell renamed this task from Check into watchlist sizes (limiting or handling large ones properly) to Investigate watchlist sizes (limiting or handling large ones properly).
aezell claimed this task.
aezell updated the task description. (Show Details)

Thanks for the data and all the details provided @Mooeypoo!
May I ask about these sentences:

Meanwhile, as part of our investigation, we have discovered many elements that, if directly addressed by the DBA team, could significantly reduce the primary challenges now faced by managing the table. They are the following:
..
Based on the data we have collected, the majority of the users watch a relatively small number of pages. Meanwhile, a very small number of users watch a very high number of pages. If this small group can be addressed in the future by the DBA team, this can potentially optimize the system.

Addressed in which way? Not sure I get what you exactly mean with those.

Regarding the specific questions:

  1. Is it still crucial that we implement a watchlist cap per user?

Based on the figures you've given, if we can purge bots watchlists, that's definitely something that will help to keep the table size under control. It would be interesting to know which % of corresponds to bots on enwiki, as that is the biggest table. Is that something you guys can find out?
Commons is also big, (29G), so if 14% are bots, that's around 4GB, so that's some space we'd get back after we'd optimize the table.

I am not super worried about smaller wikis like arwiki (6GB on disk) or eswiki (6.2GB) as those would probably stay within a reasonable size. The big ones like commons or enwiki are the ones that worries me the most, as those are already quite big and as I said previously, they can start making the query optimizer to start behaving on unpredictable ways and starting to get impossible to apply schema changes to.

  1. If we need to implement a general cap, it would need to not impact any current users. (Given that any impact on users, per the Community Relations team, would require in-depth consultations.) Right now, the largest number of watched pages is slightly over 28 million. Would this still be useful?

That is hard to know without really knowing any predictions. It would definitely help to have the current tables under control. A combination of purging bots data + a future cap might be a good idea to be fully sure that the tables won't grow massively.

Lastly, I would like to ask about:

Given the findings above, we are left with making some decisions about further development. We still believe that watchlist expiry may decrease the size of the watchlist table, but we also want to be sensible in protecting against use cases that may not.

Do you have some estimations on how much that'd be decreased?

Thank you!

I would like to hear Platform Engineering's opinion too, to see if they can foresee any unexpected issues.

@Marostegui Thank you so much for your reply. I'll try to answer your questions below

Meanwhile, as part of our investigation, we have discovered many elements that, if directly addressed by the DBA team, could significantly reduce the primary challenges now faced by managing the table. They are the following:
..
Based on the data we have collected, the majority of the users watch a relatively small number of pages. Meanwhile, a very small number of users watch a very high number of pages. If this small group can be addressed in the future by the DBA team, this can potentially optimize the system.

Addressed in which way? Not sure I get what you exactly mean with those.

What we mean is that bots are a problem on the watchlist table and purging the data can help. This will probably involve some community consultation and technical work, which is out of scope for our team and current project, but it might be something you or some other team could be interested in.


Based on the figures you've given, if we can purge bots watchlists, that's definitely something that will help to keep the table size under control. It would be interesting to know which % of corresponds to bots on enwiki, as that is the biggest table. Is that something you guys can find out?
Commons is also big, (29G), so if 14% are bots, that's around 4GB, so that's some space we'd get back after we'd optimize the table.

@MusikAnimal already queried for that here . On enwiki, bots take about 4.7% of total rows.


Given the findings above, we are left with making some decisions about further development. We still believe that watchlist expiry may decrease the size of the watchlist table, but we also want to be sensible in protecting against use cases that may not.

Do you have some estimations on how much that'd be decreased?

We can't know for sure, since this is a new feature that hasn’t been deployed yet. We have good reason to believe that at least 5% of new (non-bot) watchlist items will be added with an expiry and thus be purged after they've expired. Also, 5% is a conservative estimate, so a higher percentage of pages may be watched temporarily.


If we need to implement a general cap, it would need to not impact any current users. (Given that any impact on users, per the Community Relations team, would require in-depth consultations.) Right now, the largest number of watched pages is slightly over 28 million. Would this still be useful?

That is hard to know without really knowing any predictions. It would definitely help to have the current tables under control. A combination of purging bots data + a future cap might be a good idea to be fully sure that the tables won't grow massively.

Overall, we’re optimistic that our work will help improve the state of the watchlist. Although we don't know how this new feature will be adopted, it is unlikely that it will increase the rate at which the watchlist is currently growing. If anything, we believe that it will probably help contain the ever growing table by purging rows after the expiry has passed. We plan to release the feature incrementally, starting with 5 wikis (we haven't decided which ones yet). After the release, we will monitor usage of the feature, and we can share our findings.

Can we schedule a brief meeting to chat with you? This way, we can ensure that we reach a common agreement and know what steps to take next. If that works for you, our program manager can go ahead and schedule the meeting.

Thanks in advance.

BPirkle added subscribers: tstarling, BPirkle.

@tstarling , do you have any thoughts here? (Note to other people on this task - Tim is on vacation until July 20, so an answer may be delayed.)

Is there a specific piece of feedback we are waiting on? If it's just a general heads up, we should proceed as planned. This project can't wait another week on this topic if there's not something specifically being addressed.

Overall, we've been asked to develop a feature that we believe will actually decrease the size of the watchlist table. Our mandate has not and should not be to solve issues with the watchlist table that have existed for years. Or, to rephrase, our work on this new feature shouldn't be delayed to fix those issues.

Unless it can be proven that the expiry feature will materially contribute to degradation of the existing watchlist table, I'm would like to proceed with the consultation that @dmaza asked for above and the rest of the work on this feature.

What we mean is that bots are a problem on the watchlist table and purging the data can help. This will probably involve some community consultation and technical work, which is out of scope for our team and current project, but it might be something you or some other team could be interested in.

To be honest I would like to see a specific task filed, so all these thoughts do not get forgotten amongst comments in this task.


@MusikAnimal already queried for that here . On enwiki, bots take about 4.7% of total rows.

Thanks - I missed that. That's then a not huge % and won't really save that much if purged.


We can't know for sure, since this is a new feature that hasn’t been deployed yet. We have good reason to believe that at least 5% of new (non-bot) watchlist items will be added with an expiry and thus be purged after they've expired. Also, 5% is a conservative estimate, so a higher percentage of pages may be watched temporarily.

What's the plan if for any reason that goes out of control? I assume this can be disabled entirely and easily?


If we need to implement a general cap, it would need to not impact any current users. (Given that any impact on users, per the Community Relations team, would require in-depth consultations.) Right now, the largest number of watched pages is slightly over 28 million. Would this still be useful?

That is hard to know without really knowing any predictions. It would definitely help to have the current tables under control. A combination of purging bots data + a future cap might be a good idea to be fully sure that the tables won't grow massively.

Overall, we’re optimistic that our work will help improve the state of the watchlist. Although we don't know how this new feature will be adopted, it is unlikely that it will increase the rate at which the watchlist is currently growing. If anything, we believe that it will probably help contain the ever growing table by purging rows after the expiry has passed. We plan to release the feature incrementally, starting with 5 wikis (we haven't decided which ones yet). After the release, we will monitor usage of the feature, and we can share our findings.

I would suggest to start with testwiki first to make sure nothing obvious is broken, and later maybe the following wikis are good to get some idea of the usage and the performance of the queries and table growth:

testwiki
dewiki
fawiki
frwiki
metawiki

We should probably monitor the size of the tables too for a few weeks before going directly for the big wikis, especially enwiki.

This project can't wait another week on this topic if there's not something specifically being addressed.

If this doesn't go as planned (fingers crossed) I guess we are ready to rollback and keep investigating, right?
We are dealing with one of the most critical tables we have, and it can potentially cause lots of issues, so let's try to be careful. Further, it took me a few days to answer as I have had some days off.

Overall, we've been asked to develop a feature that we believe will actually decrease the size of the watchlist table. Our mandate has not and should not be to solve issues with the watchlist table that have existed for years. Or, to rephrase, our work on this new feature shouldn't be delayed to fix those issues.

As stated before, your team has found possible issues to improve the current issues we are facing on the watchlist table, can we please create a task so those are somewhere for the Growth-Team and not forgotten? It will be useful to have them documented for future work.

Unless it can be proven that the expiry feature will materially contribute to degradation of the existing watchlist table, I'm would like to proceed with the consultation that @dmaza asked for above and the rest of the work on this feature.

As you have also communicated during the task, there is also no data to prove that this will improve it, I am happy that we are optimistic about it, but we also need to be careful, as I have said before, this table is already having lots of issues on the big wikis hence all the discussion, as well as the back and forth.
We are placing a new feature on top of an already misbehaving table which we deal with quite often thus all the questions and comments.
We not only have disk space issues, but also query performance issues as a result of such a big and monolithic table. I want to make sure that we are going step by step and we have all the gaps covered.

@Marostegui Thanks for that perspective. I was more replying to bringing in even more people into the conversation without a specific question for them to answer.

I think that you and @dmaza are working through how we can proceed safely. I defer to your knowledge of how risky this is in the database. I likely underestimated that risk and I apologize for that.

I guess what I was trying to communicate is that we want to make sure we do the work safely but not take on the project of "fixing the watchlist table."

Thanks for explaining what I was missing.

@Marostegui I just want to add that I apologize for being so flippant about the risks here. I was wrong. I appreciate your patience with explaining it to me.

@aezell no need to apologize at all! :-)
It is hard to have all the context when you are not dealing with these issues daily, and you aren't, as this is not part of your job!. And I am grateful that I was asked to participate early in the discussion, rather than finding out when the feature is in production and enabled everywhere!

If this was a new feature, we could totally proceed faster and I am sure we'd have not had that much back and forth, but as you guys are developing on top of an already dangerous table, I wanted to be extra careful - the ground you are building on top of, isn't super solid. And above all, I wanted to make sure we are aligned on where we can potentially get into :-)
As I said at T252812#6162739 we could be facing all sort of unpredictable things, mostly on big wikis.

Hopefully I am just being careful for nothing and the feature will not cause any issues, but as we deal with this (and other problematic tables) on almost on a daily basis....I wanted to raise awareness of it (which was also at T252812#6162739)
Again, it is almost impossible to be aware of those fires if you are not part of our team, so no need to apologize at all.

So before proceeding I would like to make sure we:

  • File a tasks with the data you guys found during your investigation and what can possibly help in future work towards improving the watchlist table.
  • Can rollback/revert/disable this feature easily if we start seeing issues.
  • As agreed, let's enable it on a few wikis, before proceeding with the big ones my suggestion is as follows:
testwiki
dewiki
fawiki
frwiki
metawiki

After those maybe we can go for some other a bit bigger. The small ones doesn't really worry me that much.

  • Monitor the size of those tables for at least a couple of weeks (I can take care of that), so we can gather some figures to make sure they are not growing faster than expected. Before moving forward for more weeks.

Thoughts?
Thank you!

What we mean is that bots are a problem on the watchlist table and purging the data can help.

To be honest I would like to see a specific task filed, so all these thoughts do not get forgotten amongst comments in this task.

I've created T258098: Purge unused watchlist rows

Seems a bit more on topic in here than T258098...

But I wonder if we should be working on a way to stop these bots building these watchlists. Removing the rows now just means we've got to remove them again in future. Presuming, that is, it's due to the preference of "watch pages I edit" etc, rather than passing a watch= flag on API calls

Obviously, at point of account creation, we don't really know if it's a bot account... And just because it has bot rights later, doesn't mean it's actually a bot...

Maybe we force these preference changes on bot accounts, and wikis that need bot type rights on users are made to create/use proper usergroups if they need that functionality

Seems a bit more on topic in here than T258098...

But I wonder if we should be working on a way to stop these bots building these watchlists. Removing the rows now just means we've got to remove them again in future. Presuming, that is, it's due to the preference of "watch pages I edit" etc, rather than passing a watch= flag on API calls

Obviously, at point of account creation, we don't really know if it's a bot account... And just because it has bot rights later, doesn't mean it's actually a bot...

Maybe we force these preference changes on bot accounts, and wikis that need bot type rights on users are made to create/use proper usergroups if they need that functionality

T258108: Ignore auto-watchlist preferences for bots

What we mean is that bots are a problem on the watchlist table and purging the data can help.

To be honest I would like to see a specific task filed, so all these thoughts do not get forgotten amongst comments in this task.

I've created T258098: Purge unused watchlist rows

Thank you @MusikAnimal

@Marostegui Thank you for your prompt replies and engagement in this.
It sounds like we agree that slowly rolling out to a few wikis while keeping everything behind a feature flag is a good way forward.

Our plan is to soon enable this on testwiki and once we are feature-complete we'll roll out to the other 4 wikis keeping an eye out for any negative impact. We will obviously let you know once we've picked a date to deploy so you can monitor on your end and flag any potential issues.

Does that sound good to you?