What I like about Hadoop

List of things that I have found interesting in Hadoop Map Reduce

Hadoop MapReduce has been around for quite some time now, but I would disagree with the popular belief that we should always favor Spark over Hadoop.

1. Why choose Hadoop over Spark?
In my opinion there are a couple of things that one should think about before jumping into Spark. First thing is how much memory will you be able to get for your Spark cluster, secondly how time sensitive your jobs are, and lastly how well do you know Java/Scala.
If you are okay with the batch approach, you do not have much money to spend on resources and you are already pretty good with Java I would recommend first using Hadoop before using Spark. First and foremost Hadoop focuses on linear batch processing and optimized partitioners and combiners/reducers. Spark on the other hand is a bit more generic and loses a couple of the things that Hadoop still lets you do.
The other thing that might convince you to use Hadoop over Spark is your application. If you do not care about lots of machine learning libraries and you only want to do some map and reduce operations in the most money efficient way then Hadoop is probably the way to go. I do know that Spark is probably the more modern way and if given enough resources is far superior compared to Hadoop, but if you are not Google nor Fortune 500 company it probably means that you are short on money. And Hadoop lets you leverage that. Mainly because Hadoop focuses on read/write operations, so most of your processing will depend on the hard drive speed. Additionally, Hadoop is very easy to set up initially and has really good integration with Kafka and Java - which also gives you some speed up compared to Spark used with Python.

2. So what makes Hadoop so nice?
I personally like the approach that you can control in which order you can process your parts within one key when it comes to the reduce phase. Through simple tricks you can sort the values for each key to get a certain order which will enable you to create fast processing pipeline. To achieve the same results with Spark you would have to group the results by the key and that would certainly use much much more space and RAM.
The other thing that I really like about Hadoop is how it forces you to think about the low level operations like maps, partitions, combiners and reducers. You can really understand what exactly your code is doing. Also because you are forced to use Java you will write more optimal code.
Next thing that I really like about Hadoop is SequentialFiles. These are simple binary files that keep some records in a given sequence inside them. But the thing that is the best about them is that they can be greatly optimized for hard drive storage and processing of files on the disk. If you do your pipeline well the only thing that will throttle you will be the file system, which can additionally be optimised through connected pipelines and RAM disks (given enough memory).
Additionally, Hadoop heavily recommends HDFS usage which is a great file system for big data. Especially if you are looking for Java/Python interaction which you can achieve through Apache Parquet files. It also gives you replication guarantees, so you do not have to set any RAID replication.

3. Writing jobs in Hadoop MapReduce
In my previous place where I worked (Cloud Technologies), we have heavily depended on Hadoop, but most of our jobs were so easy to configure that even a person who was not very good with Java could simply write new extraction jobs. Most of this comes from the fact that Hadoop MapReduce is so well structured. You only have at most 4 methods to specify. The configuration is minimal and really hard to mess up, whereas with Spark if you misconfigure your job you will simply break due to lack of memory on your executor or driver or some random exception that is not handled well.

4. Summary
To quickly summarize I would recommend using Hadoop MapReduce to anyone who is already proficient with Java and does not have enough resources to use Spark efficiently (Spark is kind of bad when it comes to disk usage).
If you do not have the money to invest in Spark infrastructure it is probably best to spend some money on the devs or training just to get the benefits of using Hadoop and distributed computing.
If neither of the two previously mentioned points make sense to you, it probably means that you should use Spark as it is so much easier when it comes to use from within Python and can be so much faster for some full in memory pipelines.
Anyway, if you have any questions about how to use efficiently Hadoop MapReduce or Spark feel free to contact me.