How to copy Hadoop data from On-Premise to Google Cloud Storage(GCS)?
Contents
Cloud Storage Connector
Cloud Storage Connector is an open source Java library developed by Google. It allows us to copy the data from On-Premise to Google Cloud Storage.
With the help of this connector, We can access the cloud storage data from the On-Premise machine. Also, Apache Hadoop and Spark jobs can access the files in Cloud Storage using this connector.
Configure the connector in On-Premise Hadoop cluster
To access the Google Cloud Storage from On-Premise, we need to configure the Cloud Storage Connector in our Hadoop Cluster. The following steps needs to be done.
- Download the Cloud Storage Connector as per our Hadoop version. The connector can be found in Google’s official site.
- Copy this connector jar file to $HADOOP_COMMON_LIB_JARS_DIR directory (eg: hadoop/3.3.3/libexec/share/hadoop/common).
- Create service account and download the key file in json format for the GCP project
- Copy the service account key file to every node in on-prem hadoop cluster.(eg: hadoop/3.3.3/libexec/etc/hadoop)
- Add below properties in the core-site.xml file in the Hadoop cluster. (eg. hadoop/3.3.3/libexec/etc/hadoop/core-site.xml).
- We need to replace /path/to/keyfile with the actual service account key file path in the property google.cloud.auth.service.account.json.keyfile. For more details, please refer Github
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
<property> <name>fs.AbstractFileSystem.gs.impl</name> <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value> <description>The AbstractFileSystem for 'gs:' URIs.</description> </property> <property> <name>fs.gs.project.id</name> <value></value> <description> Optional. Google Cloud Project ID with access to GCS buckets. Required only for list buckets and create bucket operations. </description> </property> <property> <name>google.cloud.auth.type</name> <value>SERVICE_ACCOUNT_JSON_KEYFILE</value> <description> Authentication type to use for GCS access. </description> </property> <property> <name>google.cloud.auth.service.account.json.keyfile</name> <value>/path/to/keyfile</value> <description> The JSON keyfile of the service account used for GCS access when google.cloud.auth.type is SERVICE_ACCOUNT_JSON_KEYFILE. </description> </property> |
Command to copy the HDFS file from On-Premise to GCS bucket
Now we can copy the HDFS file to GCS bucket. We have a file Online_Retail_Dataset.csv in on-prem Hadoop cluster as below
Source location
1 2 3 4 |
hdfs dfs -ls /rc_prod Found 1 items -rw-r--r-- 1 rc_user_1 supergroup 838 2022-07-22 16:22 /rc_prod/Online_Retail_Dataset.csv |
Destination location
We want to copy this file to Google cloud storage bucket rc_projects. Currently it doesn’t have any data.
Let’s run the hdfs copy command to copy the HDFS file Online_Retail_Dataset.csv to GCS bucket rc_projects
1 |
hdfs dfs -cp /rc_prod/Online_Retail_Dataset.csv gs://rc-projects |
The command is executed successfully. We can check the file in GCS by running the hdfs dfs -ls command as below.
1 2 3 |
hdfs dfs -ls gs://rc-projects -rwx------ 3 rc_user_1 rc_user_1 838 2022-07-22 16:38 gs://rc-projects/Online_Retail_Dataset.csv |
Let’s verify the same in GCS using Google cloud console. As we shown below, the file Online_Retail_Dataset.csv is present in the GCS bucket.
If the HDFS file is large, we can copy it using hadoop distcp command as below. It will speed up the copy process.
1 |
hadoop distcp /rc_prod/Online_Retail_Dataset.csv gs://rc-projects |
Finally we have migrated on-prem Hadoop data to Google Cloud Storage successfully.
Recommended Articles
- How to export data from BigQuery table to a file in Cloud Storage?
- Create a Hive External table on Google Cloud Storage(GCS)
- How to load JSON data from Cloud storage to BigQuery?
References from GCP official documentation
Your Suggestions