Excess filtering when trying to join a BigQuery partitioned table #1165

shai-caazam · 2020-09-30T12:41:01Z

We have call center app that measures calls and orders made.
We use BigQuery for data warehouse:

calls table (partitioned by date - partition filter is required)
orders table (partitioned by date - partition filter is required)

Creating cubes for both these tables - everything is OK:
Calls:

cube(`Calls`,
{sql: `SELECT * FROM analytics.calls WHERE ${FILTER_PARAMS.Calls.timestamp.filter('timestamp')}`,
...

Orders:

cube(`Orders`,
{sql: `SELECT * FROM analytics.orders WHERE ${FILTER_PARAMS.Orders.timestamp.filter('timestamp')}`,
...

So far so good. Problem is when we want to LEFT JOIN calls with orders:

cube(`CallOrders`, {
 sql: `SELECT * FROM analytics.calls WHERE ${FILTER_PARAMS.CallOrders.timestamp.filter('timestamp')}`,
 joins: {
   Orders: {
      relationship: `hasMany`,
      sql: `DATE(${CallOrders.timestamp}) = DATE(${Orders.timestamp})`
   },
},
....

The problem is that the resulting SQL filters each table according to its required partition filter and in addition adds a filter on the joined table. This filter will remove any rows that have calls but don't have orders (right side of the join is null) because the joined row doesn't match the "outer" filter (its timestamp is NULL).
This results in partial results - cannot see rows that have only calls but no orders.

SELECT
DATETIME_TRUNC(DATETIME(`calls`.timestamp,
        'Europe/Warsaw'),
    DAY) `calls__timestamp_day`,
DATETIME_TRUNC(DATETIME(`orders`.timestamp,
        'Europe/Warsaw'),
    DAY) `orders__timestamp_day`,
COUNT(DISTINCT `calls`.call_id) `calls__count`,
COUNT(DISTINCT
    CASE
      WHEN (`calls`.status = 'completed') THEN `calls`.call_id
  END
    ) `calls__calls_completed`,
COUNT(DISTINCT
    CASE
      WHEN (`calls`.status = 'failed') THEN `calls`.call_id
  END
    ) `calls__calls_failed`,
COUNT(DISTINCT `orders`.call_id) `orders__count`
FROM (
  SELECT
    *
  FROM
    analytics.calls
  WHERE
    timestamp >= TIMESTAMP('2020-08-28T00:00:00Z')
    AND timestamp <= TIMESTAMP('2020-09-28T23:59:59Z')
  LIMIT
    10000) AS `calls`
LEFT JOIN (
  SELECT
    *
  FROM
    analytics.orders
  WHERE
    timestamp >= TIMESTAMP('2020-08-28T00:00:00Z')
    AND timestamp <= TIMESTAMP('2020-09-28T23:59:59Z')
  LIMIT
    10000) AS `orders`
ON DATE(calls.timestamp, 'Europe/Warsaw') = DATE(orders.timestamp, 'Europe/Warsaw')
WHERE
(`calls`.timestamp >= TIMESTAMP('2020-08-28T00:00:00Z')
    AND `calls`.timestamp <= TIMESTAMP('2020-09-28T23:59:59Z'))
AND (`orders`.timestamp >= TIMESTAMP('2020-08-28T00:00:00Z')
    AND `orders`.timestamp <= TIMESTAMP('2020-09-28T23:59:59Z'))
GROUP BY
1,
2
ORDER BY
1 DESC
LIMIT
10000

What we really need is filters to apply to the inner pre-joined tables only.
Any way to get around this?

Thanks

vasilev-alex · 2020-10-02T07:27:09Z

Hey @shai-caazam! I would go for the data blending approach here https://cube.dev/blog/introducing-data-blending-api. It will eliminate the filtering problem as there will be run two queries.

paveltiunov assigned vasilev-alex Sep 30, 2020

vasilev-alex added the question label Oct 2, 2020

cube-js / cube.js

Excess filtering when trying to join a BigQuery partitioned table #1165

Excess filtering when trying to join a BigQuery partitioned table #1165

shai-caazam commented Sep 30, 2020

vasilev-alex commented Oct 2, 2020 •

edited

cube-js / cube.js

Join GitHub today

Excess filtering when trying to join a BigQuery partitioned table #1165

Excess filtering when trying to join a BigQuery partitioned table #1165

Comments

shai-caazam commented Sep 30, 2020

vasilev-alex commented Oct 2, 2020 • edited

Essential cookies

Always active

Analytics cookies

vasilev-alex commented Oct 2, 2020 •

edited