See the github folder "wordcount"

Results (local)

Hadoop Reduce: 11:42.41 Pig: 10:09.30 PySpark: 4:07.72 JavaSpark: 1:00.63 Spark: 58.861 Spark (8 cores): 53.247

(1.4G, 32.116.000 lines)

Results (ec2, standalone mode)

JavaSpark: 10m40.727s Spark: 10m14.702s

(6.5G, 160.580.000 lines)

Results (ec2, with 3 nodes)

ScalaSpark: 2m57.832s JavaSpark: 3m5.509s

(6.5G, 160.580.000 lines)

Results (local, 8 cores)

ScalaSpark: 1m40.07s JavaSpark: 2m12.98s PySpark: 11m01.30s

1 core:

JavaSpark: 5m12.84s ScalaSpark: 5m11.33s PySpark: 21m26.94s

(6.5G, 160.580.000 lines)

Generate dummy data

for i in {1..1000}; do cat inpt/pg5000.txt >> input/dummy ; done

Pig

export JAVA_HOME=/usr/lib/jvm/default
pig -f wordcount.py

Hadoop

# compile
mkdir wc_classes
javac -cp /path/to/hadoop-core-0.20.205.0.jar -d wc_classes WordCount.java
jar -cvf wordcount.jar -C wc_classes .

# submit it to hadoop
hadoop jar wordcount.jar WordCount ../input/dummy out

PySpark

spark-submit wordcount.py input/dummy out/py

or directly via

./wordcount.py input/dummy out/py

JavaSpark

mvn package
spark-submit --class WordCount target/word-counting-1.0.jar ../input/dummy out

ScalaSpark

sbt package
spark-submit --class WordCount target/scala-2.11/word-counting_2.11-1.0.jar ../input/dummy out