How To Replicate Oracle Data to BigQuery With Google Cloud Datastream
Set up Oracle 19c to BigQuery replication using Google Cloud Datastream, covering prerequisites, networking, CDC setup, and job validation.
Join the DZone community and get the full member experience.
Join For FreeThis technical guide outlines the steps to set up data replication using Google Cloud Datastream. Specifically, it details the process of setting up data replication from an Oracle 19c database hosted on a Google Compute Engine virtual machine into Google BigQuery. The tutorial covers all necessary steps, including prerequisites—enabling APIs and configuring firewalls, setting up the Oracle source environment, establishing secure networking, creating connection profiles for both Oracle and BigQuery in Datastream, preparing the Oracle database for Change Data Capture (CDC), and finally, creating and validating the Datastream replication job.
Prerequisites
Before proceeding, ensure the following Google Cloud APIs are enabled within your project. This is done through the Google Cloud Console under APIs & Services
- Compute Engine API
- Service Health API
- Datastream API
Create a Firewall rule:
- Name: dzone-firewallrule
- Source IPv4 Ranges: 10.0.101.0/24 (Adjust to match the internal IP range of your Oracle VM's subnet).
- Specified Ports: 22 (for SSH access), 1521 (Oracle Listener default port), TCP protocol.
- Click on "CREATE."
Create another firewall rule:
- Name: dzone-datastream
- Network: dzone-custom-vpc
- Direction of traffic: Ingress (Incoming connections)
- Action on match: Allow
- Targets: All instances in the network
- Source Filters: IPV4 Ranges
- Source IPV4 range: 10.0.103.0/29 (Datastream's private connectivity IP range; this may vary depending on your setup)
- Protocols and Ports: Allow All
- Click on "CREATE."
Setting up Oracle Database in Google Compute Engine
In today's data-driven world, database systems are essential. Oracle Database offers a Relational Database Management System (RDBMS) for handling critical business information reliably and securely.
To gain flexibility and scalability, organizations can choose to host these databases in the cloud. Google Compute Engine (GCE), a core part of Google Cloud Platform, provides high-performance Virtual Machines (VMs). This allows businesses to run their Oracle Database workloads on GCE to address their specific needs.
- Navigate to Compute Engine and click on "CREATE INSTANCE" to Create the Compute Engine VM where we will install the Oracle Database.
- Specify the VM name as oracle19c, leave the rest of the default settings.
- Select OS and storage, click on "CHANGE."
- For demonstration purposes we are using Red Hat Enterprise Linux 8 as the operating system, please update settings according to your environment settings. Click on "SELECT."
- Select the VPC or create a custom VPC and Click on "CREATE."
- Once the VM is created successfully, Click on "SSH" to log in to the VM.
- Click on "authorize" and "verify" if you are able to log in to the Compute Engine Instance VM.
- Run the command,
sudo yum install wget
, in case wget is not installed in your VM. - Go to Oracle and download Oracle Database 19c (19.3) for Linux x86-64 (RPM).
- Cancel the download. Next, copy the download URL and run the command wget in your VM.
- Run the command
curl -o oracle-database-preinstall-19c-1.0-2.el8.x86_64.rpm https://yum.oracle.com/repo/OracleLinux/OL8/appstream/x86_64/getPackage/oracle-database-preinstall-19c-1.0-2.el8.x86_64.rpm
.
- Run the command,
sudo yum -y localinstall oracle-database-preinstall-19c-1.0-2.el8.x86_64.rpm
.
- Run the command
rm oracle-database-preinstall-19c-1.0-2.el8.x86_64.rpm
. - Run the command
sudo yum -y localinstall oracle-database-ee-19c-1.0-1.x86_64.rpm
.
- Notice the message confirming the successful installation of the Oracle Database.
- Run the command
sudo /etc/init.d/oracledb_ORCLCDB-19c configure
.
- Run the command,
sudo su - oracle
to log in to the Oracle user account.
vi .bash_profile, export ORACLE_HOME=/opt/oracle/product/19c/dbhome_1 and export PATH=$PATH:$ORACLE_HOME/bin
- Capture the VM's private IP and copy the hostname from $ORACLE_HOME/network/admin/tnsnames.ora and update /etc/hosts file.
- Update the tnsnames.ora; update the HOST internal IP address to the internal IP address of the compute engine VM.
MY_PDB_CONNECTION_ALIAS =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = 10.0.101.26)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME =ORCLPDB1)
)
)
- Run the command,
sqlplus /
assysdba
.
- Run the command,
SELECT name FROM v$pdbs
; and capture the PDB Name and run the commandALTER SESSION SET CONTAINER=ORCLPDB1
;
- Run the command,
@/opt/oracle/product/19c/dbhome_1/demo/schema/human_resources/hr_main.sql
, specify the password you want to use, tablespace as users, temporary tablespace as temp, log directory as$ORACLE_HOME/demo/schema/log/
.
- If there are no errors and commit is complete, you should see the success message below.
Setting up Google Datastream
Google Datastream is a fully managed, serverless Change Data Capture (CDC) and replication service that enables you to synchronize data across heterogeneous data sources reliably and with minimal latency. It simplifies building data pipelines from databases like MySQL, Oracle, and PostgreSQL into Google Cloud services like BigQuery and Cloud Storage, empowering real-time analytics and data warehousing.
- Navigate to Datastream and click on "Private connectivity."
- Click on "CREATE CONFIGURATION."
- Specify the Configuration name and Configuration ID as dzone-oracle-configuration, Region as us-central1, VPC dzone-custom-vpc and IP address range as 10.0.103.0/29 then click on "CREATE."
- If you notice an error - "constraints/compute.restrictVpcPeering."
- Navigate to Organization policies, search for constraints/compute.restrictVpcPeering and click on "Restrict VPC peering usage."
- Click on "Manage Policy."
- Click on "Override parent's policy," Policy enforcement as "Replace," Policy values to "Allow All," Click on "Done" and click on "Set policy."
- Navigate back to Datastream and delete the old configuration and click on "CREATE CONFIGURATION."
- Specify the Configuration name and Configuration ID as dzone-oracle-configuration, Region as us-central1, VPC dzone-custom-vpc and IP address range as 10.0.103.0/29 and click on "CREATE."
- Navigate to Connection profiles and click on "CREATE PROFILE."
- Click on "BigQuery."
- Specify the Connection profile name and Connection profile ID as dzone-bigquery-profile, Region as us-central1 and click on "CREATE."
- Click on "CREATE PROFILE."
- Click on "Oracle."
- Specify the Connection profile ID as dzone-oracle-profile, Region as us-central1, Hostname as 10.0.101.26, Port as 1521, username as HR, Password as Oracle123, System identifier as ORCL.
- Click on "CONTINUE."
Click on "Connectivity method" then set Private connectivity configuration as dzone-oracle-configuration and click on "CONTINUE."
- Navigate to the Oracle Database running in the Compute Engine VM. Run the commands:
SHUTDOWN IMMEDIATE;
STARTUP MOUNT;
ALTER DATABASE ARCHIVELOG;
ALTER DATABASE OPEN;
- Exit out of the Oracle Database, then run the commands:
rman target /
CONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 4 DAYS;
exit
Connect back to the Oracle database with the command:
sqlplus / as sysdba
- Run the commands:
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA;
ALTER DATABASE ADD SUPPLEMENTAL LOG DATA (ALL) COLUMNS;
- Run the commands:
CREATE USER C##dstream IDENTIFIED BY welcome1 CONTAINER=ALL;
GRANT CREATE SESSION TO C##dstream;
GRANT SET CONTAINER TO C##dstream;
GRANT SELECT ON SYS.V_$DATABASE TO C##dstream;
GRANT SELECT ON SYS.V_$LOGMNR_CONTENTS TO C##dstream;
GRANT EXECUTE ON DBMS_LOGMNR TO C##dstream;
GRANT EXECUTE ON DBMS_LOGMNR_D TO C##dstream;
GRANT LOGMINING TO C##dstream;
GRANT EXECUTE_CATALOG_ROLE TO C##dstream;
- Run the commands:
ALTER PLUGGABLE DATABASE ORCLPDB1 OPEN;
ALTER SESSION SET CONTAINER=ORCLPDB1;
GRANT CREATE SESSION TO C##dstream;
GRANT SET CONTAINER TO C##dstream;
GRANT SELECT ANY TABLE TO C##dstream;
GRANT SELECT ON SYS.V_$DATABASE TO C##dstream;
GRANT SELECT ON SYS.V_$ARCHIVED_LOG TO C##dstream;
GRANT SELECT ON DBA_SUPPLEMENTAL_LOGGING TO C##dstream;
GRANT SELECT ON DBA_EXTENTS TO C##dstream;
- Navigate to Connection Profiles and update the dzone-oraclebq-profile profile with new user and password C##dstream, welcome1. Click on "SAVE."
- Create the Data Stream.
- Click on "CONTINUE."
- Select the Source connection profile as dzone-oraclebq-profile and click on "RUN TEST," once you get the message as Test passed click on "CONTINUE."
- Select objects to include as Specific schemas and tables, select the HR schema.
- Select Logminer in the "Specify the CDC method." Click on "CONTINUE."
- Select the dzone-bigquery-profile and click on "CONTINUE."
- Select the Schema Grouping as Dataset for each schema and Stream mode as Append-only.
- Click on "RUN VALIDATION"; once all the validations are successful, click on "CREATE & START."
- Notice that the HR schema and the tables are created successfully in BigQuery.
Summary
We successfully configured a data replication pipeline using Google Cloud Datastream. We created specific firewall rules within the custom VPC to allow traffic for SSH (port 22), Oracle (port 1521), and Datastream's private connectivity IP range. We provisioned a Compute Engine VM, installed Oracle Database 19c Enterprise Edition, configured the database instance and network settings and populated it with the sample HR schema. We established a Datastream private connectivity configuration linked to the custom VPC.
We created Datastream connection profiles for the Oracle source and for the BigQuery destination. We configured the Oracle database for CDC by enabling ARCHIVELOG mode, setting supplemental logging, and creating a dedicated Datastream user with the necessary Logminer and object access privileges. We successfully created and tested a Datastream, selecting the Oracle and BigQuery connection profiles, targeting the HR schema, and utilizing the Logminer method for CDC.
Opinions expressed by DZone contributors are their own.
Comments