Get Started with SQL Over Files
In this tutorial, you learn the basics of querying files in SQL by creating and querying a CSV file.
Before You Begin
To complete this tutorial, you need the following:
| Prerequisites | Useful resources |
|---|---|
A Hazelcast cluster in client/server mode and an instance of Management Center running on your local network. |
|
A connection to the SQL shell |
Step 1. Create a Data Source to Query
SQL can query data in maps, Kafka topics and the local cluster’s file system.
In this step, you create a CSV file that you can use to query.
-
Create a file named
likes.csv. -
Add the following data to your file.
msg_id,likes,dislikes 1,20,13 2,108,25 3,122,73 4,9,88 5,51,42This example file contains a record of the number of likes each message has.
Step 2. Create a Mapping to the File
To allow Hazelcast to find and recognize the data in your CSV file, you need to create a mapping to it.
In the SQL shell, use the CREATE MAPPING statement to configure the file connector and give Hazelcast access to the data in the likes.csv file.
CREATE MAPPING csv_likes (msg_id INT, likes INT, dislikes INT)
TYPE File
OPTIONS ('format'='csv',
'path'='/absolute/path/to/current/directory', 'glob'='likes.csv');
Make sure you replace the path option with the absolute path to your CSV file.
|
Step 3. Run Ad-Hoc Queries
Ad-hoc queries allow you to retrieve a small subset of data. Usually these queries are simple and you can have many of them running concurrently in a Hazelcast cluster.
-
Use a
SELECTstatement to query all the data in thelikes.csvfile.SELECT * FROM csv_likes;You should see the following:
+------------+------------+------------+ | msg_id| likes| dislikes| +------------+------------+------------+ | 1| 20| 13| | 4| 9| 88| | 3| 122| 73| | 2| 108| 25| | 5| 51| 42| +------------+------------+------------+ -
Query only the
msg_idandlikescolumns, by adding them as a comma-separated list after theSELECTstatement.SELECT msg_id, likes FROM csv_likes;+------------+------------+ | msg_id| likes| +------------+------------+ | 1| 20| | 3| 122| | 4| 9| | 2| 108| | 5| 51| +------------+------------+ -
Use a filter to display only the message numbers with more than 50 likes.
SELECT msg_id FROM csv_likes WHERE likes > 50;+------------+ | msg_id| +------------+ | 2| | 5| | 3| +------------+ -
Give the
msg_idcolumn an alias for the query results.This clause does not rename the column in the table. SELECT msg_id AS message_number, likes, dislikes FROM csv_likes WHERE likes > 20;+--------------+------------+------------+ |message_number| likes| dislikes| +--------------+------------+------------+ | 2| 108| 25| | 3| 122| 73| | 5| 51| 42| +--------------+------------+------------+ -
To filter rows on more than one condition, you can join conditions with the
AND,OR, andNOToperators.SELECT * FROM csv_likes WHERE likes > 100 OR dislikes < 30;+------------+------------+------------+ | msg_id| likes| dislikes| +------------+------------+------------+ | 1| 20| 13| | 2| 108| 25| | 3| 122| 73| +------------+------------+------------+SELECT * FROM csv_likes WHERE likes > 100 AND dislikes < 30;+------------+------------+------------+ | msg_id| likes| dislikes| +------------+------------+------------+ | 2| 108| 25| +------------+------------+------------+
If you need more control over how your data is being transformed and aggregated, you may want to build a pipeline with the Jet API.
Step 4. Run Federated Queries
Federated queries are those that join tables from different datasets.
Normally, SQL queries are executed on one particular database or dataset. However, with Hazelcast, you can pull information from different sources and present a more complete picture of the data.
-
Configure the map connector to create a new table called
dislikes.CREATE MAPPING names TYPE IMap OPTIONS ('keyFormat'='int', 'valueFormat'='varchar');This table is mapped to a distributed map in Hazelcast where the key is an integer and the value is a string.
-
Use the
SINK INTOstatement to add some entries to the map.SINK INTO names VALUES (1, 'Greg'), (2, 'Jerry'), (3, 'Mary'), (4, 'Jerry'), (5, 'Joe'); -
Use the
JOINclause to merge results from themessagesandnamestables so you can see who has the most likes and dislikes.
SELECT names.this, csv_likes.likes, csv_likes.dislikes
FROM csv_likes
JOIN names
ON csv_likes.msg_id = names.__key;
+
+--------------------+------------+------------+
|this | likes| dislikes|
+--------------------+------------+------------+
|Jerry | 108| 25|
|Greg | 20| 13|
|Jerry | 9| 88|
|Joe | 51| 42|
|Mary | 122| 73|
+--------------------+------------+------------+
-
Use the
ORDER BYclause to order the results by name and use theLIMITclause to limit them so that only the first two are displayed. Change the header of thenamescolumn toname.SELECT names.this AS name, csv_likes.likes, csv_likes.dislikes FROM csv_likes JOIN names ON csv_likes.msg_id = names.__key ORDER BY names.this LIMIT 2;+--------------------+------------+------------+ |name | likes| dislikes| +--------------------+------------+------------+ |Greg | 20| 13| |Jerry | 9| 88| +--------------------+------------+------------+ -
Use the
SUM()function to aggregate the total number of likes for each person and group the results by name.SELECT names.this AS name, sum(csv_likes.likes) AS total_likes FROM csv_likes JOIN names ON csv_likes.msg_id = names.__key GROUP BY name;You should see the following:
+--------------------+--------------------+ |name | total_likes| +--------------------+--------------------+ |Greg | 20| |Mary | 122| |Joe | 51| |Jerry | 117| +--------------------+--------------------+The results do not include a row for each Jerry because the
GROUP BYstatement groups the results by name. -
Filter for the names that have more than 100 likes combined, using the
HAVINGclause. This clause is equivalent to theWHEREclause but for aggregate results.SELECT names.this AS most_liked FROM csv_likes JOIN names ON csv_likes.msg_id = names.__key GROUP BY names.this HAVING SUM(likes) > 100;+--------------------+ |most_liked | +--------------------+ |Jerry | |Mary | +--------------------+
For a list of available aggregations, see Expressions, Functions and Operators.
Step 5. Ingest Query Results into a Hazelcast Map
To save your query results as a view, you can cache them in Hazelcast by ingesting them into a map.
-
Configure the map connector to create a new table called
likes_and_dislikes.CREATE MAPPING likes_and_dislikes ( __key INT, name VARCHAR, likes INT, dislikes INT ) TYPE IMap OPTIONS ('keyFormat'='int', 'valueFormat'='json-flat');This table is mapped to a distributed map in Hazelcast where the key is an integer and the value is an object that’s serialized to JSON.
-
Run the
JOINquery to merge results from the CSV file and thedislikesmap and insert them into thelikes_and_dislikesmap.INSERT INTO likes_and_dislikes SELECT csv_likes.msg_id, names.this, csv_likes.likes, csv_likes.dislikes FROM csv_likes JOIN names ON csv_likes.msg_id = names.__key; -
Make sure that the query results were added to the map.
SELECT * FROM likes_and_dislikes ORDER BY __key;+------------+--------------------+------------+------------+ | __key|name | likes| dislikes| +------------+--------------------+------------+------------+ | 1|Greg | 20| 13| | 2|Jerry | 108| 25| | 3|Mary | 122| 73| | 4|Jerry | 9| 88| | 5|Joe | 51| 42| +------------+--------------------+------------+------------+
Next Steps
Learn how to query maps with SQL.
Explore all available SQL statements.