impala architecture

May 26, 2021 impala

Impala is the MPP (large-scale parallel processing) query execution engine that runs on many systems in the Hadoop cluster. U nlike traditional storage systems, impala is decoupled from its storage engine. I t has three main components, Impala daemon, Impala Statestore, and Impala metadata or metastore.

Impala daemon（Impalad）

Impala daemon (also known as impalaad) runs on each node on which Impala is installed. I t accepts queries from various interfaces, such as impala shell, hue browser, etc. a nd deal with them.

Each time a query is submitted to impalad on a specific node, the node acts as the "coordinator node" for the query. I mpalad also runs multiple queries on other nodes. A fter accepting the query, Impalad reads and writes the data file and parallelizes the query by distributing the work to other Impala nodes in the Impala cluster. When a query processes various Impalad instances, all queries return the results to the central coordination node.

Depending on your needs, you can submit queries to a dedicated Impalad or to another Impalad in the cluster in a load-balanced manner.

The state of impala storage

Impala has another important component called Impala State Storage, which checks the health of each Impalad and then frequently relays each Impala Daemon health to other daemons. This can run on the same node that runs the Impala server or other nodes in the cluster.
The name of the Impala State storage daemon is the state of the store. Impalad reports its health to the Impala State storage daemon, which is the state of the store.
In the event of a node failure for any reason, Statestore updates all other nodes about this failure, and once such notifications are available for other impalads, other Impala daemons will not assign any further queries to the affected nodes.

Impala metadata and meta-storage

Impala metadata and meta-storage are another important component. I mpala uses a traditional MySQL or PostgreSQL database to store table definitions. Important details such as table and column information and table definitions are stored in a centralized database called metastores.
Each Impala node caches all metadata locally. W hen working with very large amounts of data and/or many partitions, it can take a lot of time to obtain table-specific metadata. Therefore, locally stored metadata caches help provide such information immediately.
When a table definition or table data is updated, other Impala background processes must update their metadata cache by retrieving the latest metadata, and then issue a new query to the related table.

The query processing interface

To handle queries, Impala provides three interfaces, as shown below.

Impala-shell - After setting up Impala with Cloudera VM, you can start Impala shell by typing the impala-shell command in the editor. W e'll talk more about Impala shell in the next chapter.
Hue interface - You can use Hue browser to process Impala queries. I n The Hue browser, you have the Impala Query Editor, where you can type and execute impala queries. T o access this editor, first, you need to sign in to The Hue browser.
ODBC / JDBC Drivers - Like other databases, Impala provides ODBC /JDBC drivers. W ith these drivers, you can connect to impala through programming languages that support them and build applications that use those programming languages to process queries in impala.

The query execution process

Whenever a user passes a query using any interface provided, one of the Impalads in the cluster accepts the query. This Impalad is considered a coordinator for that particular query.
After receiving the query, the query coordinator uses the table pattern in the Hive metastate to verify that the query is appropriate. Later, it collects information from the HDFS name node about the location of the data required to execute the query and sends that information to other ipalads to execute the query.
All other Impala daemons read the specified blocks of data and process queries. Once all daemons have completed their tasks, the query coordinator collects the results and passes them to the user.