Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations for the response data format for speedup and cost-effectiveness in production mode #1030

Open
tobilg opened this issue Aug 27, 2020 · 6 comments

Comments

@tobilg
Copy link

@tobilg tobilg commented Aug 27, 2020

Is your feature request related to a problem? Please describe.
Currently, the JSON responses of successful queries are very verbose, and contain a lot of duplicate information, which yields the following

  • Increased network traffic (which will imply costs with cloud providers)
  • Increased response times until the data is received
  • Increased time consumed on the client side to parse the data
  • Eventually, it's possible to hit hard limits like the API Gateway 6mb response limit, leading to obscure errors
  • "Consistency" problems, as it's not possible for us to anticipate what kind of on-demand queries our users produce

Describe the solution you'd like
Ideally, there would be the possibility to use a "compact mode" for the query responses when the production mode is used (via configuration flag or something similar)

This would then change the response data format from something like

{
  "query": {
    "dimensions": [
      "PV.path",
      "PV.edgeCity",
      "PV.edgeCountry",
      "PV.referrerDomainName"
    ],
    "timeDimensions": [
      {
        "dimension": "PV.eventDate",
        "dateRange": [
          "2020-07-20T00:00:00.000",
          "2020-07-20T23:59:59.999"
        ],
        "granularity": "day"
      }
    ],
    "measures": [
      "PV.pageViews"
    ],
    "filters": [
      {
        "dimension": "PV.domainName",
        "operator": "equals",
        "values": [
          "mydomain.io"
        ]
      },
      {
        "dimension": "PV.customerId",
        "operator": "equals",
        "values": [
          "ac2011e1-a2f3-4002-9999-66c0fb8d839f"
        ]
      }
    ],
    "timezone": "UTC",
    "order": []
  },
  "data": [
    {
      "PV.path": "/tutorials/test",
      "PV.edgeCity": "Sao Paulo",
      "PV.edgeCountry": "Brazil",
      "PV.referrerDomainName": null,
      "PV.eventDate.day": "2020-07-20T00:00:00.000",
      "PV.eventDate": "2020-07-20T00:00:00.000",
      "PV.pageViews": 2
    },
    {
      "PV.path": "/plugins/test",
      "PV.edgeCity": "Boston",
      "PV.edgeCountry": "United States",
      "PV.referrerDomainName": "www.npmjs.com",
      "PV.eventDate.day": "2020-07-20T00:00:00.000",
      "PV.eventDate": "2020-07-20T00:00:00.000",
      "PV.pageViews": 1
    }
  ],
  "lastRefreshTime": "2020-08-27T07:22:44.424Z",
  "annotation": {
    "measures": {
      "PV.pageViews": {
        "title": "P V Number of Page Views",
        "shortTitle": "Number of Page Views",
        "type": "number",
        "drillMembers": [],
        "drillMembersGrouped": {
          "measures": [],
          "dimensions": []
        }
      }
    },
    "dimensions": {
      "PV.path": {
        "title": "P V Path",
        "shortTitle": "Path",
        "type": "string"
      },
      "PV.edgeCity": {
        "title": "P V City",
        "shortTitle": "City",
        "type": "string"
      },
      "PV.edgeCountry": {
        "title": "P V Country",
        "shortTitle": "Country",
        "type": "string"
      },
      "PV.referrerDomainName": {
        "title": "P V Referrer Domain Name",
        "shortTitle": "Referrer Domain Name",
        "type": "string"
      }
    },
    "segments": {},
    "timeDimensions": {
      "PV.eventDate.day": {
        "title": "P V Date",
        "shortTitle": "Date",
        "type": "time"
      },
      "PV.eventDate": {
        "title": "P V Date",
        "shortTitle": "Date",
        "type": "time"
      }
    }
  }
}

to

{
  "query": {
    "dimensions": [
      "PV.path",
      "PV.edgeCity",
      "PV.edgeCountry",
      "PV.referrerDomainName"
    ],
    "timeDimensions": [
      {
        "dimension": "PV.eventDate",
        "dateRange": [
          "2020-07-20T00:00:00.000",
          "2020-07-20T23:59:59.999"
        ],
        "granularity": "day"
      }
    ],
    "measures": [
      "PV.pageViews"
    ],
    "filters": [
      {
        "dimension": "PV.domainName",
        "operator": "equals",
        "values": [
          "mydomain.io"
        ]
      },
      {
        "dimension": "PV.customerId",
        "operator": "equals",
        "values": [
          "ac2011e1-a2f3-4002-9999-66c0fb8d839f"
        ]
      }
    ],
    "timezone": "UTC",
    "order": []
  },
  "data": [
    [ "/tutorials/test",
      "Sao Paulo",
      "Brazil",
      null,
      "2020-07-20T00:00:00.000",
      "2020-07-20T00:00:00.000",
      2
    ],
    [
      "/plugins/test",
      "Boston",
      "United States",
      "www.npmjs.com",
      "2020-07-20T00:00:00.000",
      "2020-07-20T00:00:00.000",
      1
    ]
  ],
  "lastRefreshTime": "2020-08-27T07:22:44.424Z",
  "annotation": {
    "mapping": {
      "columns": [
        "PV.path",
        "PV.edgeCity",
        "PV.edgeCountry",
        "PV.referrerDomainName",
        "PV.eventDate.day",
        "PV.eventDate",
        "PV.pageViews"
      ]
    },
    "measures": {
      "PV.pageViews": {
        "title": "P V Number of Page Views",
        "shortTitle": "Number of Page Views",
        "type": "number",
        "drillMembers": [],
        "drillMembersGrouped": {
          "measures": [],
          "dimensions": []
        }
      }
    },
    "dimensions": {
      "PV.path": {
        "title": "P V Path",
        "shortTitle": "Path",
        "type": "string"
      },
      "PV.edgeCity": {
        "title": "P V City",
        "shortTitle": "City",
        "type": "string"
      },
      "PV.edgeCountry": {
        "title": "P V Country",
        "shortTitle": "Country",
        "type": "string"
      },
      "PV.referrerDomainName": {
        "title": "P V Referrer Domain Name",
        "shortTitle": "Referrer Domain Name",
        "type": "string"
      }
    },
    "segments": {},
    "timeDimensions": {
      "PV.eventDate.day": {
        "title": "P V Date",
        "shortTitle": "Date",
        "type": "time"
      },
      "PV.eventDate": {
        "title": "P V Date",
        "shortTitle": "Date",
        "type": "time"
      }
    }
  }
}

So, basically, send the data as an array of records, where a record is an array of ray values in the original data type, and additionally an annotation.mapping object containing a colums property which defined the sequence of "columns" in the record.

In the example above, this would shrink the data payload from 577 chars to 329 chars, which is a reduction of around 43%. I assume this will be proportional to more "columns" and more records.

This change would also need to be included in the cube.js frontend libraries.

@tobilg
Copy link
Author

@tobilg tobilg commented Aug 27, 2020

@ifokeev
Copy link

@ifokeev ifokeev commented Aug 27, 2020

Yeah, have the same problem. That's better to have pandas dataframe split format:

{
    "columns": [
        "col 1",
        "col 2"
    ],
    "index": [
        "row 1",
        "row 2"
    ],
    "data": [
        [
            "a",
            "b"
        ],
        [
            "c",
            "d"
        ]
    ]
}

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html

That's the best format for rows data indeed. Also we will gain native pandas.read_json parsing

@amitripshtos
Copy link

@amitripshtos amitripshtos commented Aug 27, 2020

I totally agree also that it is crucial to run cubejs faster, with less memory and less network bandwidth.

I think the main discussion here should be:

  1. How do we retain backwards compatibility.
  2. How do we mask this change for the client side (this one quite easy - just like Python's NamedTuple or pandas tuple)

I also up to the task to help with this one!

@paveltiunov
Copy link
Contributor

@paveltiunov paveltiunov commented Sep 2, 2020

Let's introduce compact flag for load method passed by client and stick with format proposed by @tobilg. We can check something like NODE_ENV on the client side to pass this flag by default so production builds always use compact format.

@amitripshtos Contributions are very welcomed here!

@tobilg
Copy link
Author

@tobilg tobilg commented Sep 4, 2020

Hi @paveltiunov, could you eventually point us to the relevant code portions which are used for generating the output? Thanks!

@vasilev-alex
Copy link
Member

@vasilev-alex vasilev-alex commented Sep 9, 2020

Hey @tobilg! Please see

data: transformData(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.