Amazon S3 Select
Overview
Amazon S3 Select is an add-on feature available in the Amazon S3 service for querying single objects using standard SQL. Amazon S3, or Simple Storage Solution, is the storage solution provided by AWS where we can store data, commonly referred to as objects, in the S3 bucket. Customers can implement Amazon S3 Select using code with SQL expressions in their application. An Amazon S3 select is mainly used to quickly retrieve an object from an Amazon S3 bucket.
What is S3 Select?
Amazon S3 Select is a feature introduced by Amazon in the year 2018.
Amazon S3 Select is used to retrieve the subset of data from an object residing in S3 based on the SQL expression that we have given. Customers should have s3
For example
Consider if we have a 1 GB size object and customers want to query the data for a particular row in an object.
Before Amazon S3 Select
- Customers need to download the object from S3 and export it to some application and do the query. Depending upon the size and file, the result will return.
- Optionally, they can use Amazon Athena, but it will take some time to query the data and return the result.
After Amazon S3 Select
- Customers can query the objects with predefined values such as row and column as parameters in the S3 select command in the code.
- It will scan the object or data, filter it quickly, and return the expected result fastly.
Filtering and Retrieving Data Using Amazon S3 Select
- The data returned by the Amazon S3 console is limited to 40 MB only. So if the return data size is supposed to be more than 40 MB, use the AWS SDK or AWS CLI.
- Amazon S3 Select supports CSV, JSON, and Apache Parquet as input formats.
- Amazon S3 Select supports only CSV or JSON as output formats. It does not support Parquet output.
- The maximum length of the SQL expression supported by Amazon S3 select is 256 KB.
- We should use HTTPS and an encryption key in the request if the objects are encrypted with a customer-provided encryption key (SSE-C).
- If any issue occurs while running or performing the query, Amazon S3 Select returns an error code
- Customers can scan a subset of the object by specifying bytes and a query range with the ScanRange parameter.
- Amazon S3 select uses an input-serialization method for how the object should be queried and an output-serialization method for how the object should return the results.
Features of S3 Select
General Availability
- S3 Select is generally available in all the AWS regions. All AWS customers can use the S3 Select service. Depending on the scanned and returned data, the cost will vary.
Serverless
- S3 Select is a serverless service, meaning that customers don’t have to provide any resources to run the query.
- Customers can run the query on the S3 console. AWS will handle all the backend work and return the results.
Supported Format
-
S3 Select supports three formats. They are CSV, JSON, and Parquet.
-
Input-Serialization Setting
Format: CSV, JSON, Apache Parquet.
Delimiter-Comma, Tab, and custom.
Compression-None, Gzip, Bzip2.
-
Output-Serialization Setting
Format—JSON or CSV.
Delimiter - CSV or Tab or custom.
-
Integration of AWS Services
-
Customers can run the S3 Select command over a code in a lambda function and invoke the lambda. The lambda will return the s3 select output as a result.
-
S3 Select also works with Amazon EMR Service.
-
S3 Select helps customers reduce the query time in a data-intensive application.
Effective Performance
- Customers can accelerate their application performance and reduce their costs by querying an object using Amazon S3 select.
Data Types
- The Amazon S3 select treats all the input data as a string unless the CAST function is defined.
CAST: It converts one data type to another.
Supported Data Type
No | S3 Select Supported Data Types | Parquet Supported Data Types |
---|---|---|
1 | bool | DATE |
2 | int, integer | INT(8), INT(16), INT(32), INT(64) |
3 | string | STRING |
4 | float | LIST,ENUM |
5 | decimal, numeric | DECIMAL |
6 | Timestamp | TIMESTAMP |
Examples
We have two options to perform an S3 Select query on S3 objects. They are the AWS SDK and the Rest APIs.
Rest API’s
If any of their applications are required, customers can send REST requests directly to S3.
AWS SDKs
AWS provides AWS SDK support for four languages: Java, Javascript, Ruby, and Python.
Illustration
Perform S3 Select Query Using the AWS SDK With the Help of AWS Lambda
In this example, we are using the AWS SDK for Python in Lambda.
Requirement
CSV file uploaded to an S3 bucket AWS account with Lambda, IAM, and S3 access.
Step 1: Create a lambda function with appropriate permissions.
Go to the AWS Lambda console and select functions in the left navigation pane, then click Create a Function, which is showing on the right side.
Step 2: Enter the below-mentioned details
Option 1: Author From Scratch Name: s3-select-demo Runtime: Python 3.9 Architecture: X86_64 Permission: Create a new role from AWS policy templates
Role Name: S3-bucket-permission
Policy Name: Amazon S3 object read-only permissions
Select and enter the above value and select the Create function button.
Step 3: Enter the following code and run it.
Click the deploy button and run it after the lambda function is deployed.
Go to the executive result tab, and we can see that the required output is shown below.
Troubleshooting
Check the following items if an error occurs.
-
S3 Get object permission in the lambda role permission
-
S3 Bucket policy granting get object permission.
-
Correct the bucket name and key name of the object mentioned in the lambda code.
-
Check whether the IAM user has had lambda invoke permission or not.
-
Check whether the SQL expression in the lambda code is valid or not.
Using S3 Select to Perform a Query
We can perform the query using the console, AWS SDK, or REST APIs.
Illustration
Amazon S3 Select using AWS Console
Requirement: CSV file uploaded to an S3 bucket.
Go to the AWS S3 console and upload a CSV file.
Once the file is successfully uploaded, click the query with the S3 select option mentioned in the action button.
Select the below options as mentioned in the below image.
Input Setting
Format: CSV
Delimiter-Comma
Compression-None
Output Setting
Format: CSV
Delimiter-CVV
Enter the below query
The above query will return the result as shown in the below image.
Furthermore, AWS provides some predefined queries to support basic use cases as well.
Cost Of S3 Select
Amazon S3 Select the cost for scanning the data across the standard storage class costs $0.0025. The cost for the data returned by the Amazon S3 Select will vary.
The below price is based on the AWS Mumbai region.
Amazon S3 Select = $0.0004 per 1000 SELECT commands
S3 Class | Data scanned (price per GB) | Data returned (price per GB) |
---|---|---|
S3 Standard | $0.0025 | $0.0009 |
S3 Intelligent - Tiering | $0.0025 | $0.0009 |
S3 Standard - Infrequent Access | $0.0025 | $0.01 |
S3 One Zone - Infrequent Access | $0.0025 | $0.0 |
An example scenario of Amazon S3 Select's cost
Assume that the customer has 1 GB of storage in S3 Standard. The object size is also 1 GB. After querying the data using S3 select, the data scanned and returned by S3 select is also 1GB. Calculate the total cost incurred by S3 Select.
The calculation is as follows:
S3 Select command= $0.0004 per 1000 commands Data scanned by S3 Select is $0.0025 per GB. Data returned by S3 Select= $0.0009 per GB
Total = 0.0025 + 0.0038
Benefits and Limitations of S3 Select
- Amazon S3 Select helps customers filter the data and perform a scan of the data in the storage layer where the data lives.
- Amazon S3 Select provides the solution to the querying job quicker and more cost-effectively.
- Amazon S3 Select plays an important role in accelerating performance in Big Data.
- Amazon S3 Select helps customers and developers access the data in a way that is much more optimized for analytics and data lakes environments.
- Customers can store data in S3 as tables and scan across those tables using Amazon S3 select. It will return the relevant results.
- The larger the size of the object, the higher the time for the query.
- It is not suitable for complex analysis and joins expression.
- Amazon S3 Select is serverless and highly available, so customers don’t have to worry about managing the infrastructure for querying in S3.
- AWS suggests that customers go with S3 Select instead of the S3 Get Action API for better results and application performance.
Amazon S3 Select vs Athena
No | S3 Select | Athena |
---|---|---|
1 | Query one object at a time | Query multiple objects or an entire bucket at a time |
2 | Limited to SELECT Clause | Supports multiple clauses such as DDL and DML. |
3 | It does not support Joins and Groupings | It supports Joins and Grouping |
4 | It provides Optimized performance for larger dataset Objects in a single query time. | It is used in various domains such as data analytics and Data pipelines. |
5 | It provides better results for querying a single large object than Athena. | It is considered one of the big data tools. It typically supports storage capacities ranging from gigabytes to petabytes. |
6 | Only the S3 data source is supported. | It supports various data sources. |
7 | It is a serverless service, which means there is no requirement to manage any infrastructure. | It is also serverless. To perform big data tasks, it uses a presto engine under the hood. |
Amazon S3 Select vs Presto
No | S3 Select | Presto |
---|---|---|
1 | It is a proprietary service provided by AWS | It is an Open Source Distributed Query Engine governed by Linux Foundation and some member companies |
2 | It supports data sources as Amazon S3 only | It supports nearly 26 open-source connectors |
3 | It has limited SQL dialect | It has a comprehensive SQL Dialect |
4 | It supports CSV, JSON, and Apache Parquet format only. | It supports many formats such as SequenceFile, RCFile, ORC, Delimited, CSV JSON, Parquet, and Avro. |
5 | It works for a single large object. | It usually works with large-scale data applications and data domains such as Big Data and Data Analytics |
6 | The S3 select command can be used to query AWS S3 and AWS Outpost S3. | It supports running on-premises or in the cloud (AWS, Azure, GCP, or OCI) |
7 | The pricing depends on the scanning and returning size of the object. | There is no cost for downloading and using the engine. |
Conclusion
- The Amazon S3 Select feature is simple to use, faster in performance, and cheaper when compared to doing the query manually.
- Amazon S3 Select SQL expressions are highly optimized. As a result, it will take much less time than any other tool, such as Amazon Athena.
- Customers must have S3 Object permission to query with the S3 Select.
- Customers can utilize S3 Select using the AWS SDK for Java, AWS SDK for Python, AWS Console, Rest APIs, and AWS CLI.
- Amazon S3 Select is widely implemented in the domains of data analytics, data warehouse, data streaming, data pipelines, and Big Data.