Lessons learned from grid computing: Moving data is expensive

The sweet spot for grid computing is pure in-memory processing, moving business logic to distributed data instead of moving data to business logic. Such applications crunch millions of entries per second and can scale horizontally. However, this requires that data is present and partitioned in memory. There are several pitfalls while moving data to grid computing heaven.

Consider this example configuration:

  • The documents to be processed arrives as files, typically in XML, EDIFACT or CSV format.
  • Resilience by replication between two different data centers.
  • Data is archived in a relational database.
  • The result from the processing is transferred to adjacent applications such as the data warehouse.

 

The first operation is to parse the incoming data and get it into the grid. Putting single entities is limited to ~10 per second; a 20M dataset would take 23 days. Bulk inserts using an API call such as putAll(…) provides thousands of entities per second; with 5000/sec the 20M dataset is done in a good hour. Tens of thousands per second would really take us off the chart, but unfortunately loading data is not straightforward to scale horizontally.

As entities arrive in the grid, they are replicated across the data centers for redundancy. To avoid backups from detaining inserts, they need to have the same throughput as the inserts themselves. Synchronous backups guarantee that both the primary copy and backups are complete before returning control to the client. This will kill your insert performance. Asynchronous backups provide a best effort alternative, with no guarantees on consistency or timing. Look for a grid platform that has optimized the backups similar to bulk loading, or support changing from asynchronous backups while bulk loading to synchronous backups during normal operation.

If you need to make sure all data is persisted to an RDBMS, archiving data also need to be optimized. Grid platforms support transparent persistence via a pluggable MapStore API. Configured as write-through the entities are persisted as they arrive. No surprise, this will kill your performance for large volumes. Write-behind is an asynchronous mechanism, persisting all entities marked as dirty. Implemented with the JDBC batch API, a MapStore throughput of thousands per second is possible. Additionally, parallel execution is attainable as each grid node persists its own entities. The cost of write-behind is reduced consistency control; no guarantees are given that all data is persisted.

Finally, we need to get data out of the grid. Streaming data to clients is the best way to achieve high throughput, while an API call such as getAll(keys) can be used for reading local data per node. Instead of pushing data to client endpoints, an Atom event feed can be used as described in REST in Practice: How to use Atom for Event-Driven Systems. This approach improves robustness and reduces coupling as well as providing high throughput. Each node can produce its own event feed, which is concatenated and published as an Atom feed for clients to consume. Again, we need to use bulk transfer techniques to exceed thousands of entities per second; fetching items one by one does not scale.

Concluding remarks:

  • Moving data is expensive, whether it is through a network, to a file or a database. Is the cost really worth the benefit?
  • Bulk transfer is necessary to exceed a throughput of thousands of entities per second.
  • Asynchronous operations is one way to achieve bulk transfer, sacrificing control and consistency.
  • Some operations can be hard to scale horizontally.
  • Look for a grid platform that supports bulk operations for your needs.
  • Grid computing is a good fit when data is to live in memory for a long time and processed several times. Otherwise, take a look at streaming platforms like Twitter Storm.