Unlike traditional disk-based database architectures, SQLFire manages its data structures and indexes in-memory, with optimization for distributed data access rather than for disk I/O. SQLFire throughput can be measured in terms of the number of operations on a single node, or the number of aggregated operations across an entire cluster. And even when persisting data to disk, SQLFire does not seek to a disk location or flush data all the way to disk. The SQLFire architecture introduces no single points of contention, either with disk seeks, CPU cycles, or network access.
In comparison, relational databases typically seek and flush every transaction to disk, limiting the maximum throughput to approximately 200 writes per second on commodity hard disks. Even solid state drives cannot improve throughput significantly, because of the lock contention that is inherent in traditional RDBMS architectures.
All SQLFire data can be managed in memory with one or more copies of data spread across the cluster. Tables can be replicated in memory across several SQLFire instances, or can be partitioned horizontally across instances, with redundancy. SQLFire replicates data in memory using a distribution protocol that requires minimum overhead. Clustered applications can access data concurrently on multiple nodes when the data is infrequently updated. For large datasets, or datasets where frequent updates are required, you can partition the table across many server instances and configure redundant copies for high read and write throughput. When the cluster capacity (CPU or memory) exceeds a certain threshold, you simply add more servers. The additional servers can automatically rebalance the data and behavior so that the entire application scales in linear fashion. In applications where data access is balanced across the entire data set, doubling the cluster capacity usually doubles the aggregate throughput. A linear increase in throughput is limited only by the backbone network capacity—typically the network switch rather than the a node's full duplex, Gigabit ethernet capacity. SQLFire uses highly optimized data structures designed to minimize context switches between threads and processes. Contention points are minimized by managing data in concurrent structures. Communication to peer members is synchronous, which minimizes the latency for data distribution.
As with common relational databases, SQLFire enables applications to execute stored procedures in the database to move data-intensive logic close to where that data resides. SQLFire extends this functionality by making application logic, as well as data, highly available. Procedure execution is routed transparently to the appropriate data nodes, which avoids moving data across the network and increases application throughput.
Parallel execution is implemented in the following ways:
When stored procedures are invoked, the application hint can be a single key, a set of keys, or a set of filters ("where" clauses). If the data set is partitioned across several servers, the procedure invocation is parallelized to the servers that host the hinted data. Each parallel execution can stream its results back to a coordinating server which, in turn, aggregates the results using an application-supplied callback. This process is similar to the "map-reduce" pattern where data-intensive operations execute in parallel and a reduction phase computes a single result. The application that invokes the stored procedure is abstracted from the details of parallel execution and reduction. Any failures in procedure execution (such as server crash) automatically result in idempotent re-execution of the procedure. SQLFire servers execute in a Java virtual machine and all stored procedures must be written in Java. Stored procedures are executed in-process to where the data resides, providing direct access to data in the process heap. SQLFire can also be used in an embedded data server within popular containers such as Spring or JEE. When used within a Spring container, for example, aspects can be developed such that bean method invocation is routed to the node that manages the data set required by the bean.
SQLFire achieves scalability by partitioning data dynamically across many member nodes, distributing the load uniformly across servers, and supporting "elastic" data clusters. Servers can go online and offline as necessary to serve a growing or shrinking demand pattern. When a new server announces its availability, SQLFire initiates a non-blocking algorithm to determine the subset of data that should be moved to the new server. This is achieved without introducing contention points to current application clients. Data is first copied and all updates on the data being copied are applied simultaneously to the new member to ensure data coherency. By using SQLFire's built-in instrumentation that captures throughput, latency metrics, and CPU/network/memory utilization, you can program applications to detect changing performance patterns and proactively add extra resources and rebalance the data. This enables you to provide predictable data access throughput and latency without the having to over-provision the cluster.
You can manage replicated and partitioned tables completely in memory, or both in memory and on disk. In SQLFire, each member node can persist data in disk files independently of other members. Failures in disks or cache failures in one node do not affect other instances being able to safely operate on their disk files. This "shared nothing" persistence architecture allows applications to be configured such that tables can be persisted on different nodes across the cluster, reducing the impact of disk latencies.
SQLFire data management on disk is fundamentally different from disk-oriented databases where each update results in at least two writes to disk: one write to a "redo" or "undo" log file and one write to the data files themselves. SQLFire does not manage data and transaction logs in separate files. Instead, all data updates are appended to files that are similar to the transaction logs of traditional databases.
Table update operations are appended to rolling log files only, with an algorithm that does not require a flush all the way to disk and no need to seek to a disk location. Query operations primarily use memory indexes and in most cases do not require disk access. Writes are buffered by the IO subsystem in the kernel, allowing the IO scheduler to merge and sort disk writes to achieve the highest possible disk throughput. You mitigate risk from catastrophic machine failure by making sure at least two machines are synchronously writing to the disk buffers. The result is throughput that comes close to the maximum disk transfer rate on each node in the cluster.
You can use SQLFire as a data store or as a middle-tier cache for an existing database system. SQLFire provides a caching framework that has the following capabilities:
SQLFire enables you to replicate or partition any table while maintaining one or more redundant copies. To provide this level of availability without slowing down applications, SQLFire quickly detects and processes machine failures, unresponsiveness, and even network partitions. SQLFire uses multiple TCP and UDP-based protocols to detect failure conditions within a predictable time period. It uses a coordinator to maintain a consistent view of the distributed system. When a server fails or leaves the cluster unexpectedly, other servers each recognize the change and maintain coherency across the system.
SQLFire attempts to maintain the configured level of redundancy for tables automatically even when clustered servers fail. If a failed server does not reappear within a configurable time period, the lost copy of data is restored among the other cluster members if sufficient capacity is available. SQLFire uses three main features to enable continuous availability:
Unlike traditional solutions for disaster recovery where the backup site is in standby mode, SQLFire supports active-active clustering over WAN boundaries. This means that application clients can read and write to either cluster, and can also fail over to either cluster. Given the asynchronous nature of replication, applications in such an environment should be designed handle the potential for some inconsistency during the period of failover.