I'm learning more and more about distributed computing as I"m playing with larger and larger volumes of data. I understand a lot of the concepts behind the popular distributed computing platform
Hadoop, but have no actual experience deploying it. I'm diving into all the building blocks of
Hadoop one by one as I have time:
- Hadoop Common - The common utilities that support the other Hadoop subprojects.
- Chukwa -A data collection system for managing large distributed systems.
- HBase - A scalable, distributed database that supports structured data storage for large tables.
- HDFS - A distributed file system that provides high throughput access to application data.
- Hive - A data warehouse infrastructure that provides data summarization and ad hoc querying.
- MapReduce - A software framework for distributed processing of large data sets on compute clusters.
- Pig - A high-level data-flow language and execution framework for parallel computation.
- ZooKeeper - A high-performance coordination service for distributed applications.
Tonight, a post that really opened my eyes to some of the features and the approach Hadoop takes, was
A High Level Comparison of Hadoop and Dryad. Not sure why, but the comparison of Hadoops approach using
Map Reduce and Microsoft's approach using
Directed Acyclic Graph has got me reading more. I think its because of my history with Windows Server, SQL Server and being a recovering Microsoft Kool-Aid drinker that this comparison is shedding light on Hadoop for me.
It appears appears that Microsoft is attempting to take on the
Hadoop community with its
Dryad Project. I'm way behind in understanding and applying
Hadoop, but Josh Patterson's comparison has pushed me into learning more. Thanks Josh.