5,575 questions
0
votes
1
answer
24
views
Disable auto scaling for templated jobs
In Dataflow, you can run jobs without autoscaling. This is typically achieved by setting a pipeline_option called autoscaling_algorithm to NONE. Attempting the equivalent on Templated Dataflow Jobs ...
0
votes
1
answer
48
views
How to prevent deletions from source (GCP CloudSQL MySQL) reflecting in GCP BigQuery using Datastream?
Description:
We are currently using Google Cloud Datastream to replicate data from a CloudSQL (MySQL) instance into BigQuery in near real-time. The replication works perfectly for insert and update ...
0
votes
0
answers
33
views
Azure Data Factory / Data Flow, how to extract data from JSON where the ids are the keys?
In an Azure Data Factory Data Flow I am using a REST endpoint as the data source to get a JSON of data. However the data arrives in a strange format, it is a dictionary of keys where the key value is ...
0
votes
0
answers
49
views
+50
BigQuery Performance Issue After Switching Data Pipeline to DataFlow
Problem
I'm experiencing significant query performance degradation in BigQuery for recent partitions after switching our data pipeline from a sequential Talend approach to Apache Beam/DataFlow.
...
0
votes
1
answer
33
views
Azure DF error: Unable to parse expression
I am trying to use a dataset parameter set in the pipeline to make my blob path dynamic for each data flow I've created. However, just testing this first data flow, I keep getting an error saying '...
0
votes
1
answer
28
views
Cloud Scheduler to trigger dataflow flex template
I'm struggling to make my Flex Template working with Cloud Scheduler.
I was able to create it and I can run it from my local machine, through dataflow "create job from template" or using a ...
0
votes
1
answer
63
views
Dataflow Flex Template Docker issue: Cannot start an expansion service since neither Java nor Docker executables are available in the system
I'm trying to run a dataflow job using flex template in docker. Here what I have:
FROM python:3.11-slim
COPY --from=apache/beam_python3.11_sdk:2.54.0 /opt/apache/beam /opt/apache/beam
COPY --from=...
2
votes
1
answer
50
views
What’s the difference between regular Apache Beam connectors and Managed I/O?
Apache Beam recently introduced Managed I/O APIs for Java and Python. What is the difference between Managed I/O and the regular Apache Beam connectors (sources and sinks) ?
0
votes
1
answer
44
views
Apache Beam Cross-language JDBC (MSSQL) - incorrect negative Integer type conversion
We use JDBC cross-language transform to read data from MSSQL to BigQuery, and we noticed negative integers are being converted incorrectly.
For example: if we have INT column in source with value (-1),...
0
votes
1
answer
31
views
Escaping non-delimiters in a large csv file in Power BI Dataflow
I am currently attempting to read a large csv file (2.05GB) into Power BI's dataflow. The csv file has 5 million rows and 38 columns (as read separately in Jupyter notebook), and there are some cells ...
0
votes
1
answer
34
views
How does DataFlow charge for read operations from Cloud Storage
I am trying to understand how Google Cloud Dataflow costs when reading a file with beam.io.ReadFromText. From my understanding, every time something is read from a Google Cloud bucket, it incurs ...
1
vote
2
answers
57
views
Vertical autoscaling dataflow experiments args don't get properly parsed
We want to enable vertical autoscaling on our dataflow prime pipeline for a python container:
https://cloud.google.com/dataflow/docs/vertical-autoscaling
We're trying to run our pipeline through this ...
0
votes
1
answer
39
views
Leaving message unacknowledged in Benthos job with gcp_pubsub input
How does Benthos handle the acknowledgement of pubsub messages? How can we manage ack/unack based on custom if-else conditions?
Here is the scenario i'm trying to achieve:
I have written a Benthos job ...
0
votes
2
answers
64
views
GCP Batch Dataflow - Records Dropped while inserting to BigQuery
Im using GCP Batch Dataflow to process data that im picking from a table. The input here is table data - where im using a query in Java to get the data.
After processing, when I'm trying to insert the ...
0
votes
0
answers
35
views
Tracking FlowFile UUID Across Processors in Apache NiFi 2.1.0
How can I effectively track the original FlowFile UUID across different processors in Apache NiFi, especially after using the SplitJson processor, which creates new FlowFiles with different UUIDs? I ...