Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is an open source suite under an apache foundation .It makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative.
The Hadoop contains many different tools. Two of them are core parts of Hadoop:
Hadoop Distributed File System (HDFS) is a virtual file system that looks like any other file system except than when you move a file on HDFS, this file is split into many small files, each of those files is replicated and stored on (usually, may be customized) 3 servers for fault tolerance constraints.
Hadoop MapReduce is a way to split every request into smaller requests which are sent to many small servers, allowing a truly scalable use of CPU power (describing MapReduce would worth a dedicated post).
Some other components are often installed on Hadoop solutions:
HBase is inspired from Google’s BigTable. HBase is a non-relational, scalable, and fault-tolerant database that is layered on top of HDFS. HBase is written in Java. Each row is identified by a key and consists of an arbitrary number of columns that can be grouped into column families.
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Zookeeper is used by HBase, and can be used by MapReduce programs.
Solr / Lucene as search engine. This query engine library has been developed by Apache for more than 10 years.
Languages. Two languages are identified as original Hadoop languages: PIG and Hive. For instance, you can use them to develop MapReduce processes at a higher level than MapReduce procedures. Other languages may be used, like C, Java or JAQL. Through JDBC or ODBC connectors (or directly in the languages) SQL can be used too.
Integration of MicroStrategy with Hadoop
Cloudera and MicroStrategy have collaborated to develop a powerful and easy-to-use BI framework for Apache Hadoop by creating a connection between MicroStrategy 9 and CDH (Cloudera’s Distribution Including Apache Hadoop). This connection is established via an Open Database Connectivity (ODBC) driver for Apache Hive and is available as the Cloudera Connector for MicroStrategy.
The connector allows business users to perform sophisticated point and click analytics on data stored in Hadoop directly from MicroStrategy applications – just as they do on data stored in data warehouses, data marts and operational databases. MicroStrategy has developed Very Large Database Drivers (VLDB) specifically for Cloudera that generate optimized queries for Cloudera’s Distribution Including Apache Hadoop. By this user can run queries via MicroStrategy’s visual interface without the need to write unfamiliar HiveQL or Map Reduce scripts. In essence, any user, without programming skill in Hadoop, can ask questions against vast volumes of structured and unstructured data to gain valuable business insights.