SparkSQL

SparkSQL adds the ability to specify schema for RDDs, run SQL queries on them, and even create, delete, and query tables in Hive, the original SQL tool for Hadoop. Recently, support was added for parsing JSON records, inferring their schema, and writing RDDs in JSON format.

Several years ago, the Spark team ported the Hive query engine to Spark, calling it Shark. That port is now deprecated. SparkSQL will replace it once it is feature compatible with Hive. The new query planner is called Catalyst.

SparkStreaming

Spark Streaming uses a clever hack; it runs more or less the same Spark API (or code that at least looks conceptually the same) on deltas of data, say all the events received within one-second intervals (which is what we used here). Deltas of one second to several minutes are most common. Each delta of events is stored in its own RDD encapsulated in a DStream ("Discretized Stream").

A StreamingContext is used to wrap the normal SparkContext, too.