We'll respond shortly.
Big data is becoming a big headache, real fast. Traditional approaches to data are shattered now across distributed data systems that can handle the volume. But now instead of querying in one place, we need to figure out how to effectively use this distributed system to get a single answer from many places. There are a variety of protocols and methods to tackling this big data issue.
At SpringOne 2GX, the chief architect of vFabric GemFire and SQLFire, Jags Ramnarayan, and Senior Systems Engineer Guillermo Tantachuco spoke specifically about NewSQL data strategies and how they can be used at scale.
The talk started off with Ramnarayan’s foundational premise of NewSQL, a combination of traditional SQL and NoSQL. NoSQL is a solution to the changing structure of data and data structures themselves. With NewSQL, the goal is similar to NoSQL—high scale, high performance transactions done through memory orientation, and distribution of data, but NewSQL has the ability to query by traditional SQL standards. The reason these models are emerging is because they both meet a challenge that traditional RDBMS does not. The original, traditional RDBMS architectures and designs were based on scenarios where networks were unreliable and memory was scarce. At the time, the approach was sound and focused on issues around I/O, ACID requirements, and disk sync bottlenecks. While these systems were optimized for I/O, they wasted memory. While ACID was important, it created locking scenarios under concurrent transactions. As well, buffers were primarily tuned for I/O—first writing to a log and then to data files.
In the real world, particularly with web-based applications, we don’t have consistent transaction profiles. This is why it is so hard to co-locate independent databases on a single server. Usage is not predictable—spikes happen, lulls come, query popularity changes, etc. When resource consumption varies, our performance degrades because we cannot optimize for things we cannot predict. For example, a massive scan can wipe out our buffers and cause a performance hit.
Ramnarayan posed that, instead of looking purely at scalability as the solution, we make a slight adjustment to our perspective and look primarily at predictable, consistent, and highly available response times of queries. After all, a query is a service that should have a service level. Right? We should not have to design the addition of scale to a data store architecture, predict spikes, and then optimize around them. Instead, we should expect response times to just be consistent—this is where scale is something that just happens in the background. This model of a cloud data fabric fits an analogy used since we coined the term “application service provider” back in the 90s and “service bureaus” from a bygone era, decades before: your data service should work like a power utility. With a power utility, you plug in a cord, and the power comes on. We don’t have to wait for power to show up, even if more power is being used in your neighborhood or city, we still have power available on demand. In this case, your query should just respond without wait, regardless of other demand.
With this shift in perspective, we can begin to see how themes in the next generation of databases emerge. We use shared-nothing, commodity clusters, in-memory data storage, distributed data, and clustering. When partitioning the data across nodes, we are able to optionally move the system behavior to where the data is. To keep distributed data highly available, we must keep data in sync across the cluster and even across data centers. This means high availability and network optimization must become the norm instead of the exception. When data availability crosses both racks and data centers pervasively, the data engine can move data, writes, and reads across the network to different nodes in different situations and without constraints.
While the shift in perspective opens the door to a different way of solving problems, these are not easy problems to solve—imagine moving a terabyte of data in large chunks, staying consistent, and avoiding bottlenecks due to locking. These are the capabilities needed to allow us to add and remove capacity through horizontal scale on the data tier.
Ramnarayan also pointed out how sharding impacts elasticity. For many NoSQL approaches, people address scale problems by looking at approaches to sharding. Sharding certainly works, but it has drawbacks. With sharding, you still have to address a single point of failure, and you also need appropriate failover, backups, and operational management. For example, let’s say you spread data into shards by keying on customer name where A-J is shard 1, K-S is shard 2, and T-Z is shard 3. Each of these shards still needs to be replicated for availability, backed up, and managed. While these considerations add complexity to sharded data, one of the biggest and most significant burdens and costs over time is around the complexity of SQL development to handle sharding logic. The application layer basically begins to take on responsibility for logic involving data management. In our example, the application has to know what customer is in which shard for all queries and use cases. What happens if you need to add a shard? Do you add new code too?
The approach can be a problem for development teams because no one can predict how apps and data will change in advance. Again, we should assume that life evolves, markets change, features get added, new data comes into the mix, etc. With that in mind, the challenge is querying across partitions or shards. Day one, you might have your shards clearly organized, data separated intelligently, and application queries to match; then, someone says the application should query the data another way, and you find your team writing a very expensive, nested, loop join. Any new capabilities around cross-shard aggregations, ordering, and grouping will also have to be hand-coded. If little data views or marts are propped up, then these intermediate data sets become a problem for cost-effective stewardship of the application and data.
So, if we know these types of challenges are an eventual reality, we need to go ahead and avoid the need for developers to create code to compensate for the situations and even address the expense of cross-partition transactions where we might lose atomicity and isolation. Instead of making the application highly coupled, you just need to be able to expand, contract, or move data (and their associated locks) within clusters on demand. In this manner, sharding just happens in the background, and the application can stay independent of the data layer.
In other words, adding a new node doesn’t mean you need to add new code!
In the demonstration, Guillermo shows how vFabric SQLFire addresses elastic scale, intelligent sharding, and can even handle distributed transactions. Most importantly, they show SQLFire’s data moveing and scaling as new nodes are added. Here is the video from SpringOneGX and the key points from the demonstration portions are summarized further below.
The demonstration walks through an example of a star schema for the airline industry and a 3 VM cluster. The following capabilities are explained in detail:
When scaling the data tier, adding compute nodes is not enough. One must also figure out how to deal with distribution of data and application software’s ability to scale. This takes an investment in smart engineers, architects, developers, and technologies.
For more information on SQLFire: