Chapter 1. Introduction To PrestoDB- Massive Parallel Processing

  • by

What is Prestodb?

Prestodb(Presto) is an open-source SQL query engine that is used for speeding up the execution of analytics queries against data of any size.
It can be used with relational databases like MySQL, PostgreSQL, Amazon Redshift, Microsoft SQL Server, and Teradata; and non-relational databases like Hadoop Distributed File System (HDFS), Amazon S3, Cassandra, MongoDB, and HBase.
Presto queries data where it is stored, and there is no need for moving data to any analytics system separately. Query execution takes place in parallel over a memory-based architecture which allows returning response in seconds.

How Prestodb works?

Presto runs on Hadoop and uses a similar architecture to the classic massively parallel processing (MPP) database management system. It generally has one coordinator node and various worker nodes which work in sync with the coordinator.

When a user submits an SQL query to the coordinator, it uses a custom query and an execution engine to parse, plan and schedule a distributed query plan amongst the worker nodes.

Once the query is compiled, Presto processes the request into multiple stages amongst the worker nodes and this processing happens in-memory and pipe-lined across the network in stages so that unnecessary I/O overhead can be avoided. This is how parallelism works here in query execution and leads to fast processing. The more number of worker nodes, the more parallelism, and even faster results.

Presto and Hadoop

Presto is designed for fast, interactive queries on data in HDFS, and others.
Presto doesn’t have its own storage system unlike Hadoop, thus it acts as a complement to Hadoop. Presto can be used with any implementation of Hadoop and is packaged in the Amazon EMR Hadoop distribution.

Presto-Architecture

Components:

Coordinator:
The coordinator is the main component of the Presto installation and it is must to have. Its job is to parses statements, plans queries, and manages Presto worker nodes, and it keeps a track of all the workers’ activity to coordinate queries. It gets results from the Workers and returns the final results back to the client. Coordinators connect with workers and clients via REST.

Worker:
The worker runs tasks and processes data. These nodes share data across each other and get data from the Coordinator. Once the worker node is up and running, it will detect the coordination and makes itself available for task executions.

Presto has several important components that manage the data itself.

Catalog:
Information about where data is located is been managed by Catalog.
Catalog stores schemas and the data source. When an SQL statement gets executed in Presto, it indicates, its running on one or more catalogs.
Catalogs are defined in properties files stored in the Presto configuration directory.

Connector:
To integrate Presto with external data sources like object stores, relational databases, or Hive, connectors are used.

This was an overview of Prestodb and its architecture.

Hope this was helpful!
See you in next Chapter!
Happy Learning!
Shivani S.

Leave a Reply

Your email address will not be published. Required fields are marked *