Impala overview


May 26, 2021 18:00 impala


Table of contents


What is Impala?

Impala is the MPP (Mass Parallel Processing) SQL query engine for processing large amounts of data stored in Hadoop clusters. I t's an open source software written in C and Java. C ompared to other Hadoop SQL engines, it offers high performance and low latency.

In other words, Impala is the highest performing SQL engine (providing an RDBMS-like experience) that provides the fastest way to access data stored in Hadoop's distributed file system.

Why Impala?

Impala combines SQL support and multi-user performance of traditional analytics databases with the scalability and flexibility of Apache Hadoop by using standard components such as HDFS, HBase, Metastore, YARN, and Sentry.

  • With Impala, users can use SQL queries to communicate with HDFS or HBase in a faster way than other SQL engines, such as Hive.

  • Impala can read almost any file format used by Hadoop, such as Parquet, Avro, RCFile.

Impala uses the same metadata, SQL syntax (HIVe SQL), ODBC drivers, and user interface (Hue Beeswax) as Apache Hive, providing a familiar and unified platform for bulk or real-time queries.

Unlike Apache Hive, Impala is not based on the MapReduce algorithm. I t implements a daemon-based distributed architecture that is responsible for all aspects of query execution running on the same machine.

As a result, it reduces latency in using MapReduce, which makes Impala faster than Apache Hive.

The advantages of Impala

Below is a list of some notable benefits of Cloudera Impala.

  • With impala, you can use traditional SQL knowledge to process data stored in HDFS at great speed.

  • Because data processing is performed while the data resides (on the Hadoop cluster), there is no need to convert and move data stored on Hadoop when using Impala.

  • With Impala, you can access the data stored in HDFS, HBase, and Amazon s3 without having to know about the MapReduce job. Y ou can access SQL queries using the basic concepts.

  • In order to write queries in business tools, data must go through a complex extract-transform load (ETL) cycle. H owever, with Impala, this process is shortened. T he time-consuming phases of loading and recombining are overcome by new technologies, such as exploratory data analysis and data discovery, to make the process faster.

  • Impala is taking the lead in using the Parquet file format, a column storage layout optimized for typical large-scale queries in a data warehouse scenario.

Impala's features

Here's what cloudera Impala does -

  • Impala is available as an open source free of charge under the Apache license.

  • Impala supports in-memory data processing, where it accesses/analyzes data stored on Hadoop data nodes without data movement.

  • You can use Impala to access data using class SQL queries.

  • Impala provides faster access to data in HDFS than other SQL engines.

  • With Impala, you can store data in storage systems such as HDFS, Apache HBase, and Amazon s3.

  • You can integrate Impala with business intelligence tools such as Tableau, Pentaho, Micro Policy, and Scale Data.

  • Impala supports a variety of file formats such as LZO, Serial File, Avro, RCFile and Parquet.

  • Impala uses Apache Hive metadata, ODBC drivers, and SQL syntax.

Relationship database and Impala

Impala uses a Query language similar to SQL and HiveQL. T he following table describes some of the key differences between SQL and Impala query languages.

Impala A related database
Impala uses an SQL-like query language similar to HivQL. The relationship database uses the SQL language.
In Impala, you cannot update or delete individual records. In a relationship database, you can update or delete individual records.
Impala does not support transactions. The relationship database supports transactions.
Impala does not support indexing. The relationship database supports indexes.
Impala stores and manages large amounts of data (PB). The relationship database processes less data (TB) than Impala.

Hive, Hbase and Impala

Although Cloudera Impala uses the same query language, metadata, and user interface as Hive, in some ways it is different from Hive and HBase. T he following table describes the comparative analysis between HBase, Hive and Impala.

HBase Hive Impala
HBase is a wide-column storage database based on Apache Hadoop. It uses the concept of BigTable. Hive is a data warehouse software. With it, we can access and manage large distributed Data Sets based on Hadoop. Impala is a tool for managing and analyzing data stored on Hadoop.
HBase's data model is wide-column storage. Hive follows the relationship model. Impala follows the relationship model.
HBase was developed in the Java language. Hive was developed in the Java language. Impala was developed using C.
HBase's data model is patternless. Hive's data model is pattern-based. Impala's data model is pattern-based.
HBase offers Java, RESTful, and Trift APIs. Hive offers JDBC, ODBC, Thrift APIs. Impala offers JDBC and ODBC APIs.
Supports programming languages such as C, C, C, Groovy, Java PHP, Python, and Scala. Support for programming languages such as C, Java, PHP and Python. Impala supports all JDBC/ODBC languages.
HBase provides support for triggers. Hive does not provide any trigger support. Impala does not provide any support for triggers.

All three databases -

  • Is the NOSQL database.

  • Can be used as an open source.

  • Support for server-side scripting.

  • Follow ACID properties such as Durability and Concurrency.

  • Use shrapned for partitioning.

Impala's shortcomings

Some of the disadvantages of using Impala are as follows -

  • Impala does not provide any support for serialization and antiserration.
  • Impala can only read text files, not custom binary files.
  • Whenever a new record/file is added to the data directory in HDFS, the table needs to be refreshed.