Optimizations for the response data format for speedup and cost-effectiveness in production mode #1030

tobilg · 2020-08-27T07:39:59Z

Is your feature request related to a problem? Please describe.
Currently, the JSON responses of successful queries are very verbose, and contain a lot of duplicate information, which yields the following

Increased network traffic (which will imply costs with cloud providers)
Increased response times until the data is received
Increased time consumed on the client side to parse the data
Eventually, it's possible to hit hard limits like the API Gateway 6mb response limit, leading to obscure errors
"Consistency" problems, as it's not possible for us to anticipate what kind of on-demand queries our users produce

Describe the solution you'd like
Ideally, there would be the possibility to use a "compact mode" for the query responses when the production mode is used (via configuration flag or something similar)

This would then change the response data format from something like

{
  "query": {
    "dimensions": [
      "PV.path",
      "PV.edgeCity",
      "PV.edgeCountry",
      "PV.referrerDomainName"
    ],
    "timeDimensions": [
      {
        "dimension": "PV.eventDate",
        "dateRange": [
          "2020-07-20T00:00:00.000",
          "2020-07-20T23:59:59.999"
        ],
        "granularity": "day"
      }
    ],
    "measures": [
      "PV.pageViews"
    ],
    "filters": [
      {
        "dimension": "PV.domainName",
        "operator": "equals",
        "values": [
          "mydomain.io"
        ]
      },
      {
        "dimension": "PV.customerId",
        "operator": "equals",
        "values": [
          "ac2011e1-a2f3-4002-9999-66c0fb8d839f"
        ]
      }
    ],
    "timezone": "UTC",
    "order": []
  },
  "data": [
    {
      "PV.path": "/tutorials/test",
      "PV.edgeCity": "Sao Paulo",
      "PV.edgeCountry": "Brazil",
      "PV.referrerDomainName": null,
      "PV.eventDate.day": "2020-07-20T00:00:00.000",
      "PV.eventDate": "2020-07-20T00:00:00.000",
      "PV.pageViews": 2
    },
    {
      "PV.path": "/plugins/test",
      "PV.edgeCity": "Boston",
      "PV.edgeCountry": "United States",
      "PV.referrerDomainName": "www.npmjs.com",
      "PV.eventDate.day": "2020-07-20T00:00:00.000",
      "PV.eventDate": "2020-07-20T00:00:00.000",
      "PV.pageViews": 1
    }
  ],
  "lastRefreshTime": "2020-08-27T07:22:44.424Z",
  "annotation": {
    "measures": {
      "PV.pageViews": {
        "title": "P V Number of Page Views",
        "shortTitle": "Number of Page Views",
        "type": "number",
        "drillMembers": [],
        "drillMembersGrouped": {
          "measures": [],
          "dimensions": []
        }
      }
    },
    "dimensions": {
      "PV.path": {
        "title": "P V Path",
        "shortTitle": "Path",
        "type": "string"
      },
      "PV.edgeCity": {
        "title": "P V City",
        "shortTitle": "City",
        "type": "string"
      },
      "PV.edgeCountry": {
        "title": "P V Country",
        "shortTitle": "Country",
        "type": "string"
      },
      "PV.referrerDomainName": {
        "title": "P V Referrer Domain Name",
        "shortTitle": "Referrer Domain Name",
        "type": "string"
      }
    },
    "segments": {},
    "timeDimensions": {
      "PV.eventDate.day": {
        "title": "P V Date",
        "shortTitle": "Date",
        "type": "time"
      },
      "PV.eventDate": {
        "title": "P V Date",
        "shortTitle": "Date",
        "type": "time"
      }
    }
  }
}

to

{
  "query": {
    "dimensions": [
      "PV.path",
      "PV.edgeCity",
      "PV.edgeCountry",
      "PV.referrerDomainName"
    ],
    "timeDimensions": [
      {
        "dimension": "PV.eventDate",
        "dateRange": [
          "2020-07-20T00:00:00.000",
          "2020-07-20T23:59:59.999"
        ],
        "granularity": "day"
      }
    ],
    "measures": [
      "PV.pageViews"
    ],
    "filters": [
      {
        "dimension": "PV.domainName",
        "operator": "equals",
        "values": [
          "mydomain.io"
        ]
      },
      {
        "dimension": "PV.customerId",
        "operator": "equals",
        "values": [
          "ac2011e1-a2f3-4002-9999-66c0fb8d839f"
        ]
      }
    ],
    "timezone": "UTC",
    "order": []
  },
  "data": [
    [ "/tutorials/test",
      "Sao Paulo",
      "Brazil",
      null,
      "2020-07-20T00:00:00.000",
      "2020-07-20T00:00:00.000",
      2
    ],
    [
      "/plugins/test",
      "Boston",
      "United States",
      "www.npmjs.com",
      "2020-07-20T00:00:00.000",
      "2020-07-20T00:00:00.000",
      1
    ]
  ],
  "lastRefreshTime": "2020-08-27T07:22:44.424Z",
  "annotation": {
    "mapping": {
      "columns": [
        "PV.path",
        "PV.edgeCity",
        "PV.edgeCountry",
        "PV.referrerDomainName",
        "PV.eventDate.day",
        "PV.eventDate",
        "PV.pageViews"
      ]
    },
    "measures": {
      "PV.pageViews": {
        "title": "P V Number of Page Views",
        "shortTitle": "Number of Page Views",
        "type": "number",
        "drillMembers": [],
        "drillMembersGrouped": {
          "measures": [],
          "dimensions": []
        }
      }
    },
    "dimensions": {
      "PV.path": {
        "title": "P V Path",
        "shortTitle": "Path",
        "type": "string"
      },
      "PV.edgeCity": {
        "title": "P V City",
        "shortTitle": "City",
        "type": "string"
      },
      "PV.edgeCountry": {
        "title": "P V Country",
        "shortTitle": "Country",
        "type": "string"
      },
      "PV.referrerDomainName": {
        "title": "P V Referrer Domain Name",
        "shortTitle": "Referrer Domain Name",
        "type": "string"
      }
    },
    "segments": {},
    "timeDimensions": {
      "PV.eventDate.day": {
        "title": "P V Date",
        "shortTitle": "Date",
        "type": "time"
      },
      "PV.eventDate": {
        "title": "P V Date",
        "shortTitle": "Date",
        "type": "time"
      }
    }
  }
}

So, basically, send the data as an array of records, where a record is an array of ray values in the original data type, and additionally an annotation.mapping object containing a colums property which defined the sequence of "columns" in the record.

In the example above, this would shrink the data payload from 577 chars to 329 chars, which is a reduction of around 43%. I assume this will be proportional to more "columns" and more records.

This change would also need to be included in the cube.js frontend libraries.

tobilg · 2020-08-27T07:41:03Z

Relevant discussion in Slack was https://cube-js.slack.com/archives/CC0403RRR/p1598379807118300 cc @paveltiunov

ifokeev · 2020-08-27T16:46:03Z

Yeah, have the same problem. That's better to have pandas dataframe split format:

{
    "columns": [
        "col 1",
        "col 2"
    ],
    "index": [
        "row 1",
        "row 2"
    ],
    "data": [
        [
            "a",
            "b"
        ],
        [
            "c",
            "d"
        ]
    ]
}

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html

That's the best format for rows data indeed. Also we will gain native pandas.read_json parsing

amitripshtos · 2020-08-27T18:43:54Z

I totally agree also that it is crucial to run cubejs faster, with less memory and less network bandwidth.

I think the main discussion here should be:

How do we retain backwards compatibility.
How do we mask this change for the client side (this one quite easy - just like Python's NamedTuple or pandas tuple)

I also up to the task to help with this one!

paveltiunov · 2020-09-02T02:05:26Z

Let's introduce compact flag for load method passed by client and stick with format proposed by @tobilg. We can check something like NODE_ENV on the client side to pass this flag by default so production builds always use compact format.

@amitripshtos Contributions are very welcomed here!

tobilg · 2020-09-04T13:44:35Z

Hi @paveltiunov, could you eventually point us to the relevant code portions which are used for generating the output? Thanks!

vasilev-alex · 2020-09-09T13:35:30Z

Hey @tobilg! Please see

cube.js/packages/cubejs-api-gateway/index.js

Line 610 in 18e48e1

data: transformData(

vasilev-alex assigned paveltiunov Aug 27, 2020

paveltiunov added the enhancement label Sep 2, 2020

paveltiunov added the help wanted label Sep 2, 2020

paveltiunov unassigned paveltiunov Sep 17, 2020

igorlukanin added good first issue hacktoberfest labels Oct 12, 2020

rpaik added pr:community and removed pr:community labels Oct 21, 2020

cube-js / cube.js

Optimizations for the response data format for speedup and cost-effectiveness in production mode #1030

Optimizations for the response data format for speedup and cost-effectiveness in production mode #1030

tobilg commented Aug 27, 2020

tobilg commented Aug 27, 2020

ifokeev commented Aug 27, 2020 •

edited

amitripshtos commented Aug 27, 2020 •

edited

paveltiunov commented Sep 2, 2020

tobilg commented Sep 4, 2020

vasilev-alex commented Sep 9, 2020 •

edited

cube-js / cube.js

Join GitHub today

Optimizations for the response data format for speedup and cost-effectiveness in production mode #1030

Optimizations for the response data format for speedup and cost-effectiveness in production mode #1030

Comments

tobilg commented Aug 27, 2020

tobilg commented Aug 27, 2020

ifokeev commented Aug 27, 2020 • edited

amitripshtos commented Aug 27, 2020 • edited

paveltiunov commented Sep 2, 2020

tobilg commented Sep 4, 2020

vasilev-alex commented Sep 9, 2020 • edited

Essential cookies

Always active

Analytics cookies

ifokeev commented Aug 27, 2020 •

edited

amitripshtos commented Aug 27, 2020 •

edited

vasilev-alex commented Sep 9, 2020 •

edited