Why is MLLib so much faster than Map-Reduce Hadoop ? One reason might be, that Hadoop Map Reduce is not optimized for iterative computation while Spark is. Another one is probably the highly efficient RDD datastructure and Sparks Vector Data Type. As we can see in the following Cast performance experiments, MLLibs Vector is optimized in a sense, that it can even keep up with highly scientific libraries like breeze.

Results

Full result

===== Experiment: Addition
CPU cores:8
Rows:100000
Dimension:100
Partitions:3
Execution Times: 25 times
Spark Vector: 404 [msec]
Breeze Vector: 240 [msec]
Hadoop Vector: 703 [msec]
===== Experiment: Multiplication
CPU cores:8
Rows:100000
Dimension:100
Partitions:3
Execution Times: 25 times
Spark Vector: 180 [msec]
Breeze Vector: 162 [msec]
Hadoop Vector: 303 [msec]
===== Experiment:  Addition
CPU cores:8
Rows:100000
Dimension:100
Partitions:3
Execution Times: 25 times
Spark Vector: 165 [msec]
Breeze Vector: 154 [msec]
Hadoop Vector: 303 [msec]
===== Experiment: Addition
CPU cores:8
Rows:100000
Dimension:1000
Partitions:3
Execution Times: 25 times
Spark Vector: 1165 [msec]
Breeze Vector: 879 [msec]
Hadoop Vector: 3485 [msec]
===== Experiment: Multiplication
CPU cores:8
Rows:100000
Dimension:1000
Partitions:3
Execution Times: 25 times
Spark Vector: 834 [msec]
Breeze Vector: 827 [msec]
Hadoop Vector: 2785 [msec]
===== Experiment: Division
CPU cores:8
Rows:100000
Dimension:1000
Partitions:3
Execution Times: 25 times
Spark Vector: 843 [msec]
Breeze Vector: 845 [msec]
Hadoop Vector: 2703 [msec]