Chapter 3. Querying AWS S3 using Prestodb

  • by

This lab is all about executing queries in prestodb with amazon s3 in the back end. We will be executing DDL(data definition language) and DML(data manipulation language) statements. For configuring Presto Cluster, follow Chapter 2

Query Execution Flow

Scenario 1:

In this, we will be creating a table that will act as a logical separation for the data, and then we will insert some records, define the format in which we want to insert the data, and finally will query the data.

Create a table :

presto-cli --server localhost:8080 --catalog hive
presto> use default
     -> ;
USEpresto:default>  CREATE TABLE presto_table ( name varchar, hage varchar ) WITH (external_location = 's3a://<s3Bucket_name>/<folder_name>', format = 'CSV' );
             -> ;
CREATE TABLEpresto:default> desc  presto_table;
 Column |  Type   | Extra | Comment 
--------+---------+-------+---------
 name   | varchar |       |         
 hage   | varchar |       |         
(2 rows)Query 20210414_022835_00009_zk47q, FINISHED, 2 nodes
Splits: 19 total, 19 done (100.00%)
0.35 [2 rows, 162B] [5 rows/s, 460B/s]

Now, let’s insert some data into this table using the INSERT command:

presto> use default
     -> ;
USE
presto:default> insert into presto_table(name ,age) values ('Jack','25');
INSERT: 1 rowpresto:default> insert into presto_table(name ,age) values ('Deniel','25');
INSERT: 1 row

Let us now fetch these records using the SELECT command:

presto:default> select count(*) from presto_table;
 _col0 
-------
     2 
(1 row)Query 20210414_023336_00011_zk47q, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0.41 [2 rows, 1.03KB] [4 rows/s, 2.52KB/s]

This is one way to insert data into s3 using the INSERT command in PrestoDB,
However, we can directly upload the CSV file and query the same way. Presto supports
Delimited, CSV, RCFile, JSON, SequenceFile, ORC, Avro, and Parquet file types.

Scenario 2:

In this scenario, we will be creating a table that will act as a logical separation to query data, here we will first upload a CSV file to s3 which will be having some records to query on.

Upload .csv file
You can use the below command to upload your CSV file, or can directly go to the AWS console and upload your CSV file/files

aws s3 cp /path_of_csv_file/presto.csv s3://<bucket_name>/directory_name/ 

Create a Table:

presto-cli --server localhost:8080 --catalog hive
presto> use default
     -> ;
USEpresto:default>  CREATE TABLE presto_table_scv ( name varchar, hage varchar ) WITH (external_location = 's3a://<s3Bucket_name>/<folder_name>', format = 'CSV' );
             -> ;
CREATE TABLEpresto:default> desc  presto_table_csv;
 Column |  Type   | Extra | Comment 
--------+---------+-------+---------
 name   | varchar |       |         
 hage   | varchar |       |         
(2 rows)Query 20210414_022835_00009_zk47q, FINISHED, 2 nodes
Splits: 19 total, 19 done (100.00%)
0.35 [2 rows, 162B] [5 rows/s, 460B/s]

Lets now run the SELECT command on this table

presto:default> select count(*) from presto_table_csv;
 _col0 
-------
     2 
(1 row)Query 20210414_023336_00011_zk47q, FINISHED, 1 node
Splits: 19 total, 19 done (100.00%)
0.41 [2 rows, 1.03KB] [4 rows/s, 2.52KB/s]

You can have your query run on one CSV file or multiple CSV files , just upload your CSVs in the path mentioned in create table command.

Note: Your DML queries wont work if you havent configured below flag in your hive.properties file:

hive.non-managed-table-writes-enabled=true

For more flags , that can be used in presto , you can visit
https://prestodb.io/docs/current/connector/hive.html#hive-configuration-properties

Hope this was helpful!
See you in next Chapter!
Happy Learning!
Shivani S.

Leave a Reply

Your email address will not be published. Required fields are marked *