We'll respond shortly.
Have you heard about the new super-efficient Pivotal Query Optimizer developed by the Greenplum engineering team? Previously codenamed “Orca”, this new feature has been released as part of the HAWQ query engine in Pivotal HD, Pivotal’s commercially-supported distribution of Apache Hadoop®.
This new optimizer has been undergoing months of performance testing and improvements and is nearly ready for market. Pivotal will be showcasing a peer-reviewed paper at ACM SIGMOD Conference 2014, June 22 – 27, on the results of this performance study. Titled “Orca: A Modular Query Optimizer Architecture for Big Data”, this paper explains how they built the query optimizer, and show the results they’ve seen so far in customer usage and ongoing testing. If you would like to get a copy of the paper yourself and see the detailed benchmark results, ask at the Pivotal booth (booth S32) at this week’s Hadoop® Summit in San Jose.
Developing a query optimizer involves some very sophisticated computer science. The team wanted to create a new SQL-compliant query technology that was better suited to the trends we are seeing in big data:
This technology is laser focused on providing fast SQL query results on petabytes of data and be portable across data architectures, such as Pivotal HD and Pivotal Greenplum.
Along with further enhancements with the release of Pivotal HD 2.0, this new query optimizer is allowing customers to make use of full ANSI SQL compliant queries against Hadoop® at a rate up to 1000X faster than they could with Pivotal HD 1.0. Not only does it speed up your queries, it makes Apache Hadoop® more practical for some serious data science work. Now you can better take advantage of more analytics use cases on Apache Hadoop® through faster queries in HAWQ, which comes with support for GraphLab, MADLib, languages such as R, Python and Java, and all new support for Parquet files.
I’m pleased to be able to preview some of these testing results with you in this blog—for a certain purpose. Pivotal is looking for a few customers of Greenplum DB to help with final testing and validation of the new query optimizer. We’d love for you to join the early access program, and experience for yourself the performance benefits and new use cases you can achieve with the new Pivotal Query Optimizer on Greenplum DB.
Part of validating the new Pivotal Query Optimizer includes performance testing against the TCP-DS benchmark. As mentioned, testing of Pivotal HD 2.0 versus Pivotal HD 1.0 against the benchmark showed some of the queries had up to a 1000X improvement. More importantly, with the new query optimizer, Pivotal HD 2.0 is able to complete the entire benchmark of 111 queries. For the first time in the market, a commercially supported Apache Hadoop® stack can now be effectively used for ad hoc analytical use cases as well as leverage existing applications and expertise.
We did similar performance testing of the new version of Greenplum DB vs. the prior version of Greenplum DB using the TCP-DS benchmark. GPDB configured with the Pivotal Query Optimizer database versus GPDB configured to use the legacy query optimizer planner showed an overall 5X improvement in running the entire benchmark of 111 queries. For some specific queries we see as much as a 1000X improvement. We timed out the test at that point.
What many of these significantly improved queries have in common is layers of nested queries, often with window functions. We find these kinds of queries occur when users are working with advanced analytics packages against such as SAS on top of Greenplum DB. We expect to see significant improvement in analysis times for users of these tools as we ramp up early access.
Editor’s Note: Apache, Apache Hadoop, Hadoop, and the yellow elephant logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries.