Background
There have been issues raised with having a schema-less nested custom data object in the Metrics Platform monoschema. To see if we can lower the complexity on the consumer side, we can try a few different options to flatten the schema to simplify querying the data.
User story
As a data scientist/product analyst, I want to see if it's easier to query data from a flattened schema as opposed to a schema with nested json.
As an engineer, I'd like to know where the complexity gets shifted to if the Metrics Platform monoschema is adapted to remove the nested custom data object and flattened custom data properties are provided instead.
Developer notes
In order to test the hypothesis that querying is easier with a flattened schema, we can try sending mock data through the pipelines using the following approaches:
- Create a table wherein the custom data object is transformed into a given set of custom data fields i.e. for 3 custom data points, there will be 9 corresponding columns:
- custom_data1_name, custom_data1_value, custom_data1_type
- custom_data2_name, custom_data2_value, custom_data2_type
- custom_data3_name, custom_data3_value, custom_data3_type
- If client code only sends 1 custom data point, the other 2 sets of custom data columns will be null.
- Create a table wherein the custom data object is transformed into a singular column for each name, value, type so that it scales vertically i.e.
- custom_data_name, custom_data_value, custom_data_type
- Monoschema core contextual attributes are matched to each row such that:
- If an event sends 2 custom data points, there will be 2 rows in the table that are linked to the same event id.
- If an event sends 3 custom data points, there will be 3 rows in the table that are linked to the same event id.
Acceptance criteria
- Test tables are created per approaches above and populated with sufficient data. << see T340702#9014296
- Queries are run/captured for each approach. << see Jupyter notebook attached in T340702#9014296
- Custom data decision matrix elucidates pros/cons/risks of each approach << see spreadsheet link in T340702#9014296