Create Dataproc Cluster, submit Hive job and delete cluster using gcloud command

Cloud Dataproc in Google Cloud Platform

Cloud Dataproc is Google’s managed service for running Hadoop and Spark jobs. It allows us to create and manage the clusters easily and we can turn them off if we are not using them, to reduce the costs. It provides open source data tools(Hadoop, Spark, Hive, Pig,.etc) for batch processing, querying, streaming, and machine learning.

In this tutorial, we are going to do the following steps using gcloud command

  • Create Dataproc cluster
  • Submit Hive job
  • Delete the Dataproc cluster

Create Dataproc cluster using gcloud command

The command gcloud dataproc clusters create is creates the dataproc cluster in GCP. Let’s add the few arguments in that command to specify the cluster specification

We are creating the single node cluster with the name of rc-test-1 in the region us-east1. The image version is used to bundle the operating system, big data components, GCP connectors into one package that is deployed on a cluster.

We are running this command in the machine where Google cloud SDK is configured. The Dataproc cluster rc-test-1 is created successfully in GCP.

Dataproc cluster creation in Google Cloud Platform
Dataproc cluster creation in Google Cloud Platform

We can see the cluster details in the Google cloud console also.

Cluster details in Google Cloud console
Cluster details in Google Cloud console

Submit Hive Job to the dataproc cluster

In this step, we are going to create a new database in Hive. Let’s create the database bank_db using gcloud command.

The command gcloud dataproc jobs submit hive is submit a Hive job to the cluster. The cluster and region details are mentioned to submit the job in specific cluster. To submit the Hive job with inline queries, we have given the -e argument with query string.

The hive job has been submitted successfully and it created the new Hive database bank_db in the dataproc cluster rc-test-1.

Create database in Hive using gcloud command
Create database in Hive using gcloud command

Let’s check the hive database in the cluster. First we need to get into VM instances as below.

Connect Dataproc cluster using Compute Engine
Connect Dataproc cluster using Compute Engine

We connected to the machine using ssh and opened in the browser window. Using Hive CLI, we can see the database bank_db as below.

Hive database in Dataproc cluster
Hive database in Dataproc cluster

Submit multiple inline queries as Hive Job

In the previous example, we have passed the single inline query to create the database. Similarly we can run the multiple Hive queries using gcloud command. For that, we just need to pass the queries with the delimiter semi colon(;).

To redirect the query results including stdout and stderr to the file, we have used &> results.txt in gcloud command. If we want to redirect stderr alone, we need to use 2> results.txt.

Submit Hive queries from file

In production environment , we usually run the queries from file. To do so, we can use the hive -f option in the gcloud command.

The file hive-test-queries.q has list of queries which will be executed as Hive job.

execute hive queries from file using gcloud command
execute hive queries from file using gcloud command

The gcloud command has executed successfully. The query output can be verified using the results.txt file which we given in the command. As we shown below, the Hive queries are executed sequentially.

redirect gcloud command output to a file
redirect gcloud command output to a file

Verify the Hive jobs in Dataproc cluster

We can check the Hive jobs status in Dataproc cluster. As we can see below, the jobs are succeeded.

Let’s verify the Hive tables using SSH in browser option in Google Cloud Console. The newly created table has the records as per our insert queries.

Hive Table : customer_details

Hive CLI using SSH in browser
Hive CLI using SSH in browser

Hive table : transaction

Hive CLI using SSH in browser
Hive CLI using SSH in browser

Delete the Dataproc cluster

Once the jobs are completed, we can delete the dataproc cluster using gcloud command. The command gcloud dataproc clusters delete is used to delete the cluster. Please note that it will delete all the objects including our Hive tables.

The cluser name rc-test-1 and region of that cluster us-east1 are mentioned in the command. We tried to execute the command. But it is asking confirmation to delete the cluster as below.

Delete dataproc cluster using gcloud command
Delete dataproc cluster using gcloud command

We want to delete the cluster without prompt. For that, we need to specify –quiet or -q on the delete command. Let’s do that

Delete dataproc cluster without prompt
Delete dataproc cluster without prompt

Finally the dataproc cluster rc-test-1 got deleted from the Google Cloud Platform. We can verify the same in Google Cloud Console also.

Cloud DataProc in Google Cloud Console
Cloud DataProc in Google Cloud Console

Recommended Articles

References from GCP Official documentation