Amazon Redshift: Best practices for optimizing query performance
by Rackspace Technical Staff
Originally published on June 17, 2020 at the Onica.com blog.
Organizations across a plethora of industries look to use data analytics for operations and other functions critical to success. However, as data volumes grow, management and value extraction can become increasingly complex.
Amazon Redshift
Amazon® Redshift® is a powerful data warehouse service from Amazon Web Services® (AWS) that simplifies data management and analytics. Let’s take a look at Amazon Redshift and some best practices you can implement to optimize data querying performance.
Data lakes versus Data warehouse
Before digging into Amazon Redshift, it's important to know the differences between data lakes and warehouses. A data lake, like Amazon S3, is a centralized data repository that stores structured and unstructured data, at any scale and from many sources, without altering the data. On the other hand, Data warehouses store data in a reconciled state optimized to perform ongoing analytics and only load the data needed for analytics from data lakes.
Amazon Redshift takes storage for data analytics one level further, amalgamating the qualities of data lakes and warehouses into a “lake house” approach. It allows the querying of large exabyte-scale data lakes while remaining cost-effective, minimizing data redundancy, and minimizing maintenance overhead and operational costs.
Amazon Redshift architecture
To rapidly process complex queries on big data sets, Amazon Redshift architecture supports massively parallel processing (MPP) that distributes the job across many compute nodes for concurrent processing. These nodes are grouped into clusters, and each cluster consists of three types of nodes:
- Leader Node: These manage connections, act as the SQL endpoint, and coordinate parallel SQL processing.
- Compute Nodes: Composed of *slices*, these execute queries in parallel on data stored in a columnar format and in 1 MB immutable blocks. An Amazon Redshift cluster can contain between 1 and 128 compute nodes, portioned into slices that contain the table data and act as a local processing zone.
- Amazon Redshift Spectrum Nodes: These execute queries against an Amazon S3 data lake.
Optimizing query performance
By bringing the physical layout of data in the cluster into congruence with your query patterns, you can extract optimal querying performance. If Amazon Redshift is not performing optimally, consider reconfiguring workload management.
Reconfigure workload management (WLM)
Often left in its default setting, tuning WLM can improve performance. You can automate this task or perform it manually. When automated, Amazon Redshift manages memory usage and concurrency based on cluster-resource usage. It allows you to set up eight priority-designated queues. When done manually, you can adjust the number of concurrent queries, memory allocation, and targets.
You can also optimize querying performance through the following WLM configuration parameters:
- Query monitoring rules help you manage expensive or runaway queries.
- Short query acceleration helps you prioritize short-running queries over longer-running queries by using machine learning algorithms to predict querying execution time.
- Concurrency scaling helps you add multiple transient clusters in seconds to speed up concurrent read queries.
WLM best practices
Some WLM tuning best practices include:
- Creating different WLM queries for different types of workloads.
- Limiting maximum total concurrency for the main cluster to 15 or less to maximize throughput.
- Enabling concurrency scaling.
- Minimizing the number of resources in a queue.
Refining data distribution
The rows of a table are automatically distributed by Amazon Redshift across node slices, based on the following distribution styles:
- `AUTO`: Starts with ALL and switches to EVEN as the table grows.
- `ALL`: Consists of small, frequently joined, and infrequently modified tables placed on the first slice of each compute node.
- `EVEN`: Consists of large, standalone fact tables that are not frequently joined or aggregated in a round-robin distribution across the slices.
- `KEY`: Consists of frequently joined fact tables or large dimension tables. In this style, a column value is hashed, and the same hash value is placed on the same slice.
Using the right distribution patterns can maximize the performance of `JOIN`, `GROUP BY`, and `INSERT INTO SELECT` operations.
Refining data sorting
Sort keys define the physical order of data on a disk. Table columns used in `WHERE` clause predicates are a good choice for sort keys and commonly use date or time-related columns. Use Zone maps, stored in memory and generated automatically, to define the value extremes for each block of data. Effectively using sort keys and zone maps together can help you restrict scans to the minimum required number of blocks. The following diagram illustrates how table sorting focuses on scanning targets for time-based queries, thereby improving query performance.
Optimal query performance best practices
Using the previously mentioned Amazon Redshift changes can improve query performance and improve cost and resource
efficiency. Here are some more best practices you can implement for further performance improvement:
- Use `SORT` keys on columns that are often used in `WHERE` clause filters.
- Use `DISTKEY` on columns that are often used in `JOIN` predicates.
- Compress all columns except the first sort-key column.
- Partition data in the data lake based on query filters such as access patterns.
To explore some more best practices, take a deeper dive into the Amazon Redshift changes, and see an example of an in-depth query analysis, read the AWS Partner Network (APN) Blog.
Recent Posts
Building a Google Cloud Landing Zone in a Scalable, Repeatable and Secure Way (Part 1)
August 28th, 2024
Building a Google Cloud Landing Zone in a Scalable, Repeatable and Secure Way (Part 2)
August 28th, 2024
Rackspace Technology response to Crowdstrike July Incident
July 19th, 2024