How to copy Hadoop data from On-Premise to Google Cloud Storage(GCS)?

Cloud Storage Connector

Cloud Storage Connector is an open source Java library developed by Google. It allows us to copy the data from On-Premise to Google Cloud Storage.

With the help of this connector, We can access the cloud storage data from the On-Premise machine. Also, Apache Hadoop and Spark jobs can access the files in Cloud Storage using this connector.

Configure the connector in On-Premise Hadoop cluster

To access the Google Cloud Storage from On-Premise, we need to configure the Cloud Storage Connector in our Hadoop Cluster. The following steps needs to be done.

  • Copy this connector jar file to $HADOOP_COMMON_LIB_JARS_DIR directory (eg: hadoop/3.3.3/libexec/share/hadoop/common).
  • Create service account and download the key file in json format for the GCP project
  • Copy the service account key file to every node in on-prem hadoop cluster.(eg: hadoop/3.3.3/libexec/etc/hadoop)
  • Add below properties in the core-site.xml file in the Hadoop cluster. (eg. hadoop/3.3.3/libexec/etc/hadoop/core-site.xml).
    • We need to replace /path/to/keyfile with the actual service account key file path in the property google.cloud.auth.service.account.json.keyfile. For more details, please refer Github

Command to copy the HDFS file from On-Premise to GCS bucket

Now we can copy the HDFS file to GCS bucket. We have a file Online_Retail_Dataset.csv in on-prem Hadoop cluster as below

Source location

Destination location

We want to copy this file to Google cloud storage bucket rc_projects. Currently it doesn’t have any data.

GCS bucket
GCS bucket

Let’s run the hdfs copy command to copy the HDFS file Online_Retail_Dataset.csv to GCS bucket rc_projects

The command is executed successfully. We can check the file in GCS by running the hdfs dfs -ls command as below.

Let’s verify the same in GCS using Google cloud console. As we shown below, the file Online_Retail_Dataset.csv is present in the GCS bucket.

Data migration from On-prem Hadoop to GCS
Data migration from On-prem Hadoop to GCS

If the HDFS file is large, we can copy it using hadoop distcp command as below. It will speed up the copy process.

Finally we have migrated on-prem Hadoop data to Google Cloud Storage successfully.

Recommended Articles

References from GCP official documentation