0

I have a situation here. i want to figure out the best way to ingest API streaming data from an application to GCP BigQuery while having data masking in place. However, some downstream admin users will essentially need to see unmasked data too.

What i am thinking here is to implement an event based trigger data ingestion using PubSub to trigger a dataflow as soon as a new file is published. And the dataflow will have 2 branches inside it.

Branch 1: To call DLP and mask the incoming data and load a table T1 in BigQuery Branch 2: Use "PubSub topic to BigQuery" template and load unmasked (as-is) data from the source to another table T2 in BigQuery.

I can later use role based access to give general user access to T1 and admin access to T2.

My question to you is about the first branch in the dataflow. Is there any template available to use DLP and mask the incoming data row by row. How can this be done. Do i need to use Apache Beam here.

Or is it a case that my entire design is wrong and a better approach can be implemented as a whole. Please guide me.

To get direction to my next project and build a dataflow accordingly.

2 Answers 2

0

Your approach seems reasonable. I do not think any available template but it is quite easy to create one. For example, if you use Python, roughly, this will be

with beam.Pipeline() as p:
    raw_messages = (p
                | 'Read from PubSub' >> beam.io.ReadFromPubSub(topic='your-pubsub-topic'))
    # branch 1
    _ = (raw_messages
                | 'Process Data' >> beam.Map(process_message_using_dlp)
                | 'Write to BigQuery' >> beam.io.WriteToBigQuery('your-dataset.your-table'))
    # branch 2
    _ = (raw_messages
                | 'Write raw to BigQuery' >> beam.io.WriteToBigQuery('your-dataset.your-table'))
0

I believe that you could simplify your architecture and directly write to BigQuery without using DLP. DLP is a great product but for your use case I believe BigQuery data masking is enough. When you write to BigQuery with Dataflow, make sure you use the BigQuery Storage Write API. https://cloud.google.com/bigquery/docs/column-data-masking-intro#bigquery-storage-read-api I haven't tested it but I believe you could use this template: https://cloud.google.com/dataflow/docs/guides/templates/provided/pubsub-to-bigquery (set the useStorageWriteApi parameter to True)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.