How to copy Hadoop data from On-Premise to Google Cloud Storage(GCS)?

Contents

Cloud Storage Connector

Cloud Storage Connector is an open source Java library developed by Google. It allows us to copy the data from On-Premise to Google Cloud Storage.

With the help of this connector, We can access the cloud storage data from the On-Premise machine. Also, Apache Hadoop and Spark jobs can access the files in Cloud Storage using this connector.

Configure the connector in On-Premise Hadoop cluster

To access the Google Cloud Storage from On-Premise, we need to configure the Cloud Storage Connector in our Hadoop Cluster. The following steps needs to be done.

  • Copy this connector jar file to $HADOOP_COMMON_LIB_JARS_DIR directory (eg: hadoop/3.3.3/libexec/share/hadoop/common).
  • Create service account and download the key file in json format for the GCP project
  • Copy the service account key file to every node in on-prem hadoop cluster.(eg: hadoop/3.3.3/libexec/etc/hadoop)
  • Add below properties in the core-site.xml file in the Hadoop cluster. (eg. hadoop/3.3.3/libexec/etc/hadoop/core-site.xml).
    • We need to replace /path/to/keyfile with the actual service account key file path in the property google.cloud.auth.service.account.json.keyfile. For more details, please refer Github
<property>
  <name>fs.AbstractFileSystem.gs.impl</name>
  <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
  <description>The AbstractFileSystem for 'gs:' URIs.</description>
</property>
<property>
  <name>fs.gs.project.id</name>
  <value></value>
  <description>
    Optional. Google Cloud Project ID with access to GCS buckets.
    Required only for list buckets and create bucket operations.
  </description>
</property>
<property>
  <name>google.cloud.auth.type</name>
  <value>SERVICE_ACCOUNT_JSON_KEYFILE</value>
  <description>
    Authentication type to use for GCS access.
  </description>
</property>
<property>
  <name>google.cloud.auth.service.account.json.keyfile</name>
  <value>/path/to/keyfile</value>
  <description>
    The JSON keyfile of the service account used for GCS
    access when google.cloud.auth.type is SERVICE_ACCOUNT_JSON_KEYFILE.
  </description>
</property>

Command to copy the HDFS file from On-Premise to GCS bucket

Now we can copy the HDFS file to GCS bucket. We have a file Online_Retail_Dataset.csv in on-prem Hadoop cluster as below

Source location

hdfs dfs -ls /rc_prod

Found 1 items
-rw-r--r--   1 rc_user_1 supergroup  838 2022-07-22 16:22 /rc_prod/Online_Retail_Dataset.csv

Destination location

We want to copy this file to Google cloud storage bucket rc_projects. Currently it doesn’t have any data.

GCS bucket
GCS bucket

Let’s run the hdfs copy command to copy the HDFS file Online_Retail_Dataset.csv to GCS bucket rc_projects

hdfs dfs -cp /rc_prod/Online_Retail_Dataset.csv gs://rc-projects

The command is executed successfully. We can check the file in GCS by running the hdfs dfs -ls command as below.

hdfs dfs -ls gs://rc-projects

-rwx------  3 rc_user_1 rc_user_1  838 2022-07-22 16:38 gs://rc-projects/Online_Retail_Dataset.csv

Let’s verify the same in GCS using Google cloud console. As we shown below, the file Online_Retail_Dataset.csv is present in the GCS bucket.

Data migration from On-prem Hadoop to GCS
Data migration from On-prem Hadoop to GCS

If the HDFS file is large, we can copy it using hadoop distcp command as below. It will speed up the copy process.

hadoop distcp /rc_prod/Online_Retail_Dataset.csv gs://rc-projects

Finally we have migrated on-prem Hadoop data to Google Cloud Storage successfully.

Recommended Articles

References from GCP official documentation

Your Suggestions

Suggest Article