Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excess filtering when trying to join a BigQuery partitioned table #1165

Open
shai-caazam opened this issue Sep 30, 2020 · 1 comment
Open

Excess filtering when trying to join a BigQuery partitioned table #1165

shai-caazam opened this issue Sep 30, 2020 · 1 comment
Assignees
Labels

Comments

@shai-caazam
Copy link

@shai-caazam shai-caazam commented Sep 30, 2020

We have call center app that measures calls and orders made.
We use BigQuery for data warehouse:

  • calls table (partitioned by date - partition filter is required)
  • orders table (partitioned by date - partition filter is required)

Creating cubes for both these tables - everything is OK:
Calls:

cube(`Calls`,
{sql: `SELECT * FROM analytics.calls WHERE ${FILTER_PARAMS.Calls.timestamp.filter('timestamp')}`,
...

Orders:

cube(`Orders`,
{sql: `SELECT * FROM analytics.orders WHERE ${FILTER_PARAMS.Orders.timestamp.filter('timestamp')}`,
...

So far so good. Problem is when we want to LEFT JOIN calls with orders:

cube(`CallOrders`, {
 sql: `SELECT * FROM analytics.calls WHERE ${FILTER_PARAMS.CallOrders.timestamp.filter('timestamp')}`,
 joins: {
   Orders: {
      relationship: `hasMany`,
      sql: `DATE(${CallOrders.timestamp}) = DATE(${Orders.timestamp})`
   },
},
....

The problem is that the resulting SQL filters each table according to its required partition filter and in addition adds a filter on the joined table. This filter will remove any rows that have calls but don't have orders (right side of the join is null) because the joined row doesn't match the "outer" filter (its timestamp is NULL).
This results in partial results - cannot see rows that have only calls but no orders.

SELECT
DATETIME_TRUNC(DATETIME(`calls`.timestamp,
        'Europe/Warsaw'),
    DAY) `calls__timestamp_day`,
DATETIME_TRUNC(DATETIME(`orders`.timestamp,
        'Europe/Warsaw'),
    DAY) `orders__timestamp_day`,
COUNT(DISTINCT `calls`.call_id) `calls__count`,
COUNT(DISTINCT
    CASE
      WHEN (`calls`.status = 'completed') THEN `calls`.call_id
  END
    ) `calls__calls_completed`,
COUNT(DISTINCT
    CASE
      WHEN (`calls`.status = 'failed') THEN `calls`.call_id
  END
    ) `calls__calls_failed`,
COUNT(DISTINCT `orders`.call_id) `orders__count`
FROM (
  SELECT
    *
  FROM
    analytics.calls
  WHERE
    timestamp >= TIMESTAMP('2020-08-28T00:00:00Z')
    AND timestamp <= TIMESTAMP('2020-09-28T23:59:59Z')
  LIMIT
    10000) AS `calls`
LEFT JOIN (
  SELECT
    *
  FROM
    analytics.orders
  WHERE
    timestamp >= TIMESTAMP('2020-08-28T00:00:00Z')
    AND timestamp <= TIMESTAMP('2020-09-28T23:59:59Z')
  LIMIT
    10000) AS `orders`
ON DATE(calls.timestamp, 'Europe/Warsaw') = DATE(orders.timestamp, 'Europe/Warsaw')
WHERE
(`calls`.timestamp >= TIMESTAMP('2020-08-28T00:00:00Z')
    AND `calls`.timestamp <= TIMESTAMP('2020-09-28T23:59:59Z'))
AND (`orders`.timestamp >= TIMESTAMP('2020-08-28T00:00:00Z')
    AND `orders`.timestamp <= TIMESTAMP('2020-09-28T23:59:59Z'))
GROUP BY
1,
2
ORDER BY
1 DESC
LIMIT
10000

What we really need is filters to apply to the inner pre-joined tables only.
Any way to get around this?

Thanks

@vasilev-alex
Copy link
Member

@vasilev-alex vasilev-alex commented Oct 2, 2020

Hey @shai-caazam! I would go for the data blending approach here https://cube.dev/blog/introducing-data-blending-api. It will eliminate the filtering problem as there will be run two queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.