Tl;dr The iOS team is planning to implement event logging (spec slides) for the reading list feature and we need your suggestion
As you all know, the iOS team has been using Piwik to track user behaviors, but it doesn’t work very well because piwik can’t handle the volume of events from iOS app. We’re also sending data to some event logging schemas, but a lot of them haven’t been maintained for a long while (e.g. T192520) or not collecting the data we want. As a first step to sunset piwik, we decide to implement event logging for the synced reading list feature on the iOS app and adopt a format that is used by Piwik and Google Analytics. We will gradually implement EL on other features using the same format, and we will stop using piwik and clean up unused EL schemas after we finish. Before proceeding, we want to reach out to the interested & affected parties for feedback and suggestion.
The schema
We will implement several event tables and one user properties table.
Event Table
The event tables will record users’ interaction with the app using 4 fields: Category (on which screen), Label (optional, on what element of that screen), Action (what action did the user perform), Measure (optional, if there is a number associated). In addition to the standard event capsule, all app events would share a "meta schema" which provides the app specific context information. This capsule would include: app_install_id, primary_language, is_anon (whether this user is logged in), event_dt (client side timestamp) and session_id.
Because recording all the events in one table is not good for query efficiency, we will break it down by function (MobileWikiAppiOSReadingLists, MobileWikiAppiOSLoginAction, MobileWikiAppiOSSettingAction, MobileWikiAppiOSSessions), although all of them will have the same fields. Like other EL tables, the event tables will be purged after 90 days.
User Properties History Table
The user properties table (MobileWikiAppiOSUserHistory) is recording all the historical states of user properties. These properties include how many articles have they saved, how many reading lists have they created, have they turn on the reading list sync, primary language, text size choice, theme choice, etc. When users first open the app after install or update, we record these properties values (initial state) locally and send them to event logging server. At the end of each session, we take a snapshot of these properties: if ANY of these properties’ values have been changed comparing to user’s last snapshot, we send the new snapshot to the server with ALL properties values, the session_end timestamp and the session_id. If NONE of these properties’ values has changed, we won’t send the snapshot. Like the event tables, we will send a capsule with every user state, including app_install_id, event_dt and session_id.
Unlike the event tables and other EL tables, the user properties table will NOT be purged except the IP address and user agent field (we will set IP and userAgent to NULL but keep OS version, app version and country). After discussing with legal, in order to keep users from being identified by this data, we are not going to track any users whose countries have small numbers of active users. Specifically, we will only collect data from users in the top 50 countries in this list (from US to Egypt), which is the average daily unique visitors by countries from Jan 1 - March 31 2018. We will keep monitoring this number, if some of these countries end up with very small number of active users in a period of time, we will adjust the list of countries we collect data from accordingly.
Sampling
Because the volume of users who agree to share their usage data with us on iOS app is not very big, we will send all the data to Hadoop cluster (not send to Mysql at all) without sampling.
See the spec slide for more details and examples.
Why don’t we use Android team’s EL schemas?
In short, Android’s EL schema is tailor to Android's flow, and not immediately usable to iOS. Using the same schema requires adjustment on the implementation for both apps. Even after the adjustment, we still can't use the same logic to consume the data, which leaves almost no benefit to us for using the same schema.
Take the reading list EL schema as an example. We wanted to use MobileWikiAppReadingLists as Android did at first. But after reviewing the reading list flow on both app, we found that we have to add an 'addtodefault' event (see T190748#4098226 for more details). Even after this adjustment, if we want to count the number of articles added after the release, for iOS, we need to count the 'addtodefault' event; for Android, we need to count 'addtodefault', 'addtoexisting' and 'addtonew' event and then sum them up.
Why choose this format?
Using this format to store users’ events and properties can benefit us in the following ways:
- It fulfills our need and conforms to Analytics Engineering team’s guideline (although not 100%, see the question section below), which means it can be piped into Druid easily so that we can use superset to build a dashboard.
- Since all the event tables have the same fields, we can union them easily and then analyze the conversion funnel.
- This format is flexible enough for adding events, moving certain events from one table to another, and supporting new features in the future.