Hadoop HDFS dependency -


in hadoop mapreduce programming model; when processing files mandatory keep files in hdfs file system or can keep files in other file system's , still have benefit of mapreduce programming model ?

mappers read input data implementation of inputformat. implementations descend fileinputformat, reads data local machine or hdfs. (by default, data read hdfs , results of mapreduce job stored in hdfs well.) can write custom inputformat, when want data read alternative data source, not being hdfs.

tableinputformat read data records directly hbase , dbinputformat access data relational databases. imagine system data streamed each machine on network on particular port; inputformat reads data port , parses individual records mapping.

however, in case, have data in ext4-filesystem on single or multiple servers. in order conveniently access data within hadoop you'd have copy hdfs first. way benefit data locality, when file chunks processed in parallel.

i suggest reading tutorial yahoo! on topic detailed information. collecting log files mapreduce processing take @ flume.


Comments