What is Hadoop?
Hadoop is a system for storing data, but is different and more scaleable than a traditional database, such as an SQL database
A normal SQL database requires “schema on write”
If database B is expecting data from database A – database B must know the datatype etc that is being transferred
In Hadoop, there is no “preconfiguring”.
The data from A to B is just sent. Then the rules are applied to the code that reads the data.
This is called “schema on read”
In SQL database, the data is stored in a logical form; with related tables and columns.
In Hadoop the data is a compressed file, distributed and duplicated among different nodes/servers.
To search 100 servers, for all data containing the word “David”
The query is sent via a java-program, that will search across all the replications of the data
Instead of each copy conducting the same search, the search does a portion each
The search-answers are then sent separately to a ‘reducer’
If one server broke, Hadoop will still send an answer to the user.
If one server broke across a SQL network, SQL would not send any data. It adheres to “Two phase commit” which wouldn’t allow this.
SQL is better suited to transactions, that need 100% accuracy all of the time
Hadoop is great for searching massive data, that maybe be less structured
It is more complicated to retrieve data from Hadoop, as it is retrieved via a java program. There are however interfaces now that will use SQL
Hadoop can incorporate a number of machines at once, to retrieve data more quickly.
It harnesses the power of more machines, working together
Hadoop comes from Google white-papers
It has a master component – the NameNode and the JobTracker
and a slave component – the DataNode and the TaskTracker
For more detailed information, check out this Hadoop playlist on youtube: