Query ElasticSearch: Replace DSL with SQL

May 31, 2021 Article blog

This article is from the public number: Code Farmer Knowledge Point, author Monica2333

The ES7.x version of x-pack comes with ElasticSearch SQL which we can use directly through SQL REST APIs, SQL CLI, and more.

SQL REST API

Enter in Kibana Console

POST /_sql?format=txt
{
  "query": "SELECT * FROM library ORDER BY page_count DESC LIMIT 5"
}

Just replace the SQL above with your own SQL statement. The return format is as follows:

    author      |        name        |  page_count   | release_date
-----------------+--------------------+---------------+------------------------
Peter F. Hamilton|Pandora's Star      |768            |2004-03-02T00:00:00.000Z
Vernor Vinge     |A Fire Upon the Deep|613            |1992-06-01T00:00:00.000Z
Frank Herbert    |Dune                |604            |1965-06-01T00:00:00.000Z

SQL CLI

elasticsearch-sql-cli is a script file for the bin directory when ES is installed and can also be downloaded separately. We run in the ES directory

./bin/elasticsearch-sql-cli https://some.server:9200

Enter SQL to query

sql> SELECT * FROM library WHERE page_count > 500 ORDER BY page_count DESC;
     author      |        name        |  page_count   | release_date
-----------------+--------------------+---------------+---------------
Peter F. Hamilton|Pandora's Star      |768            |1078185600000
Vernor Vinge     |A Fire Upon the Deep|613            |707356800000
Frank Herbert    |Dune                |604            |-144720000000

SQL To DSL

Enter in Kibana

POST /_sql/translate
{
  "query": "SELECT * FROM library ORDER BY page_count DESC",
  "fetch_size": 10
}

To get the converted DSL query:

{
  "size": 10,
  "docvalue_fields": [
    {
      "field": "release_date",
      "format": "epoch_millis"
    }
  ],
  "_source": {
    "includes": [
      "author",
      "name",
      "page_count"
    ],
    "excludes": []
  },
  "sort": [
    {
      "page_count": {
        "order": "desc",
        "missing": "_first",
        "unmapped_type": "short"
      }
    }
  ]
}

Because query-related statements have been generated, we can happily use DSLs with or without modification on this basis.

Let's go into more detail about the SQL statements supported by ES SQL and how to avoid misuse.

First you need to understand the correspondence between SQL terms and ES terms in the SQL statements supported by ES SQL:

Query ElasticSearch: Replace DSL with SQL1

Most of ES SQL's syntax support follows ANSI SQL standards, with supported SQL statements such as DML queries and partial DDL queries. DDL queries such DESCRIBE table SHOW COLUMNS IN table slightly chicken ribs, we mainly look at SELECT,Function DML query support.

SELECT

The syntax structure is as follows:

SELECT [TOP [ count ] ] select_expr [, ...]
[ FROM table_name ]
[ WHERE condition ]
[ GROUP BY grouping_element [, ...] ]
[ HAVING condition]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ LIMIT [ count ] ]
[ PIVOT ( aggregation_expr FOR column IN ( value [ [ AS ] alias ] [, ...] ) ) ]

Indicates that row data is obtained from 0-N tables. SQL is executed in the following order:

Get all the keywords in FROM to determine the table name.
If there is a WHERE condition, filter out all non-compliant rows.
If there is a GROUP BY condition, the aggregation is grouped, and if there is a HAVING condition, the results of the aggregation are filtered.
The results obtained in the previous step are select_expr to determine the specific data returned.
If there is ORDER BY condition, the returned data is sorted.
If there is a LIMIT or TOP condition, a subset of the results of the previous step are returned.

Unlike common SQL, ES SQL supports TOP [ count ] and PIVOT ( aggregation_expr FOR column IN ( value [ [ AS ] alias ] [, ...] ) ) c lause. TOP [ count ] : If SELECT TOP 2 first_name FROM emp indicates that up to two pieces of data are returned and cannot be shared with LIMIT condition. T he PIVOT clause rows the results of its aggregation criteria for further operation. I haven't used this, I don't introduce it.

FUNCTION

Based on the above SQL we are actually able to have filtering, aggregation, sorting, paging function of SQL too. B ut we need to learn more about function support in ES SQL to write a rich SQL with full-text search, aggregation, and grouping capabilities. Show SHOW FUNCTIONS allows you to list the supported function names and types.

SHOW FUNCTIONS;


      name       |     type
-----------------+---------------
AVG              |AGGREGATE
COUNT            |AGGREGATE
FIRST            |AGGREGATE
FIRST_VALUE      |AGGREGATE
LAST             |AGGREGATE
LAST_VALUE       |AGGREGATE
MAX              |AGGREGATE
MIN              |AGGREGATE
SUM              |AGGREGATE
........

We mainly look at aggregation, grouping, full-text search related common functions.

Full-text matching function

MATCH Equivalent to match and multi_match query in the DSL.

MATCH(
    field_exp,       --字段名称
    constant_exp,       --字段的匹配值
    [, options])       --可选项

Examples of use:

SELECT author, name FROM library WHERE MATCH(author, 'frank');


    author     |       name
---------------+-------------------
Frank Herbert  |Dune
Frank Herbert  |Dune Messiah
SELECT author, name, SCORE() FROM library WHERE MATCH('author^2,name^5', 'frank dune');


    author     |       name        |    SCORE()
---------------+-------------------+---------------
Frank Herbert  |Dune               |11.443176
Frank Herbert  |Dune Messiah       |9.446629

QUERY Equivalent to query_string query in query_string DSL.

QUERY(
    constant_exp      --匹配值表达式
    [, options])       --可选项

Examples of use:

SELECT author, name, page_count, SCORE() FROM library WHERE QUERY('_exists_:"author" AND page_count:>200 AND (name:/star.*/ OR name:duna~)');


      author      |       name        |  page_count   |    SCORE()
------------------+-------------------+---------------+---------------
Frank Herbert     |Dune               |604            |3.7164764
Frank Herbert     |Dune Messiah       |331            |3.4169943

SCORE() Returns the relevance of the input data and the returned relevance Examples of use:

SELECT SCORE(), * FROM library WHERE MATCH(name, 'dune') ORDER BY SCORE() DESC;


    SCORE()    |    author     |       name        |  page_count   |    release_date
---------------+---------------+-------------------+---------------+--------------------
2.2886353      |Frank Herbert  |Dune               |604            |1965-06-01T00:00:00Z
1.8893257      |Frank Herbert  |Dune Messiah       |331            |1969-10-15T00:00:00Z

The aggregation function

AVG(numeric_field) Calculates the average of fields of a numeric type.

SELECT AVG(salary) AS avg FROM emp;

COUNT(expression) the total number of input data, including data with a value of null field_name when COUNT(). COUNT(ALL field_name) Returns the total number of input data, excluding data with field_name corresponding value of null. COUNT(DISTINCT field_name) Returns the total number of values field_name in the input data that are not null. SUM(field_name) Returns the sum of the corresponding values field_name numeric fields in the input data. MIN(field_name) Returns the minimum value of the corresponding value for the numeric field field_name in the input data. MAX(field_name) Returns the maximum value of the corresponding value for the numeric field field_name in the input data.

Group functions

The grouping function here is the bucket grouping in the corresponding DSL.

HISTOGRAM The syntax is as follows:

HISTOGRAM(
           numeric_exp,    --数字表达式，通常是一个field_name
           numeric_interval    --数字的区间值
)


HISTOGRAM(
           date_exp,      --date/time表达式，通常是一个field_name
           date_time_interval      --date/time的区间值
)

The following returns data on births in the early hours of January 1st each year:

ELECT HISTOGRAM(birth_date, INTERVAL 1 YEAR) AS h, COUNT(*) AS c FROM emp GROUP BY h;




           h            |       c
------------------------+---------------
null                    |10
1952-01-01T00:00:00.000Z|8
1953-01-01T00:00:00.000Z|11
1954-01-01T00:00:00.000Z|8
1955-01-01T00:00:00.000Z|4
1956-01-01T00:00:00.000Z|5
1957-01-01T00:00:00.000Z|4
1958-01-01T00:00:00.000Z|7
1959-01-01T00:00:00.000Z|9
1960-01-01T00:00:00.000Z|8
1961-01-01T00:00:00.000Z|8
1962-01-01T00:00:00.000Z|6
1963-01-01T00:00:00.000Z|7
1964-01-01T00:00:00.000Z|4
1965-01-01T00:00:00.000Z|1

ES SQL limitations

Because ES SQL and ES DSL do not exactly match functionally, the SQL limitations mentioned in the official documentation are:

Large queries may throw ParsingException

During the parsing phase, a large query consumes too much memory, in which case Elasticsearch SQL engine aborts parsing and throws an error.

The representation of the nested type field

Fields of the nested type are not nested in SQL and can only be used

[nested_field_name].[sub_field_name]

This form refers to inline subfields. Examples of use:

SELECT dep.dep_name.keyword FROM test_emp GROUP BY languages;

The nested type fields cannot be used on the Scalar functions ofwhere and order by

SQL such as the following is all errors

SELECT * FROM test_emp WHERE LENGTH(dep.dep_name.keyword) > 5;


SELECT * FROM test_emp ORDER BY YEAR(dep.start_date);

Simultaneous queries for multiple nested fields are not supported

Such as nested fields nested_A and nested_B cannot be used at the same time.

Nested inner field pagination limit

When a paginated query has a nested field, the pagination results may be incorrect. This is because a paginated query in ES occurs on Root nested document not on its inner field.

The field of the keyword type does not support normalizer

Fields of array types are not supported

This is because there is only one value for a field in SQL, in which case we can convert the API of SQL To DSL described above into a DSL statement, and it would be nice to query with DSL.

The limit of aggregate sorting

The sort field must be a field in the aggregation bucket, and the ES SQL CLI breaks this limit, but the upper limit cannot exceed 512 rows, otherwise an exception will be thrown during the sorting phase. It is recommended to use with Limit clause, such as:

SELECT * FROM test GROUP BY age ORDER BY COUNT(*) LIMIT 100;

Sort conditions for aggregate sorting do not support Scalar functions or simple operator operations. Complex fields after aggregation, such as containing aggregation functions, cannot be used on sort criteria.

Here's an example of an error:

SELECT age, ROUND(AVG(salary)) AS avg FROM test GROUP BY age ORDER BY avg;


SELECT age, MAX(salary) - MIN(salary) AS diff FROM test GROUP BY age ORDER BY diff;

The limit of the subquery

Subquery contains GROUP BY or HAVING or than SELECT X FROM (SELECT ...) WHERE [simple_condition] This structure is complex and can be unsuccessful.

The time data type fields do not support GROUP BY conditions and HISTOGRAM functions

The following query is wrong:

SELECT count(*) FROM test GROUP BY CAST(date_created AS TIME);


SELECT HISTOGRAM(CAST(birth_date AS TIME), INTERVAL '10' MINUTES) as h, COUNT(*) FROM t GROUP BY h

However, wrapping a field of the TIME type as a Scalar function return is supported by GROUP BY, such as:

SELECT count(*) FROM test GROUP BY MINUTE((CAST(date_created AS TIME));

The limit of the return field If a field is not stored in source, it cannot be queried. keyword, date, scaled_float, geo_point, geo_shape These types of fields are not subject to this restriction because they are not returned from _source but from docvalue_fields

The above is W3Cschool编程狮 about query ElasticSearch: using SQL instead of DSL related to the introduction, I hope to help you.