Create a Hive External table on Google Cloud Storage(GCS)

Google Cloud Storage (GCS)

Google Cloud Storage is a secure and scalable storage option offered by Google. It allows us to store the objects (images, videos, document and so on) in Google Cloud. It store the objects in container which is called as Bucket. The buckets are associated with the project that is created in Google cloud platform. In this tutorial , we will try to create a Hive External table in GCS bucket.

Prerequisites to create a Hive External table in GCS

  • Dataproc cluster – To perform the Big data processing, GCP providing a service called as Dataproc where many open source tools(Hadoop, Hive, Spark,.etc) are configured. We need to create a cluster with the name in Dataproc. We will be submitting a job to that cluster using gcloud command line tool.
Dataproc cluster
Dataproc cluster
  • Bucket in GCS – In the Cloud storage, we need to create a bucket with the name. It will be used as a location in the Hive create table statement. Example : gs://<bucket_name>
Buckets in Goole Cloud Storage
Buckets in Goole Cloud Storage

gcloud command to submit a Hive job

There are multiple ways to submit a job to Data proc cluster. Please refer Google Cloud documentation for more details.

  • Dataproc API jobs submit
  • Programmatic request
  • Google Cloud CLI gcloud command-line tool
  • Cloud Shell
  • Google Cloud Console in browser

Google Cloud SDK providing a set of command line tools such as gcloud, gsutil and bq. The Google Cloud CLI (gcloud) is used to create and manage Google Cloud resources. In this example, we will submit a Hive job using gcloud command line tool.

Create a Hive external table using gcloud

Syntax

Consider that we already created the Dataproc cluster with the name of cluster-rc-test. Similarly we created the GCS bucket rc_customers_bucket. In GCP Hive, we want to create an external table driver_details in the database cabs_db. Let’s write the CREATE TABLE statement with these values

gcloud command execution

Create External Hive table in GCP
Create External Hive table in GCP

We executed the gcloud command in our local machine where Google Cloud SDK is configured. So the command is executed successfully and created the external table driver_details in Hive. The location of the table is GCS bucket gs://rc_customers_bucket/.

Next we logged into our Dataproc cluster cluster-rc-test using SSH in browser option. We can confirm that the table is created successfully on GCS bucket.

Hive table created on GCS bucket
Hive table created on GCS bucket

Insert records in Hive table

Let’s try to insert below records into the external table. We are directly running these queries in Dataproc cluster cluster-rc-test.

After executing the inserts, we can check those data in the gcs bucket. Here we used gsutil CLI command to check the files in GCS. For each record, separate file is created under the GCS bucket.

Finally we can see the data inside those files using gsutil cat command.

Recommended Articles

References from GCP official documentation