HDFS stores a file as a sequence of blocks , where every block is of same size [with exception of last block]. Replication factor dictates how many copies of a block needs to be kept in a cluster.
Interview Q1> What is the default replication factor.
The default is 3.
Interview Q2> Why do we need to have copy of blocks in cluster.
HDFS is built on commodity hardware, hence there could be a failure which would result in loss of data. To ensure high data availability blocks are replicated so that if one block is lost data can be fetched from the other.
Interview Q3> Can we change the replication number.
Yes we can.
(i) To change replication factor per file use command.
hdfs dfs –setrep –w 4 <File Path> //Replication factor set to 4
(ii) To change replication factor of all files inside a folder.
hdfs dfs –setrep –w 2 -R <File Path> //Replication factor set to 2
(iii) To Change replication factor in entire HDFS you need to modify hdfs-site.xml file.
<property> <name>dfs.replication<name> <value><value> <description>Block Replication<description> <property>
Interview Q4> Can we change the replication factor of file once assigned.
Yes, the replication factor can be set during creation time and can be later changed.
Interview Q5> Are all the replicas stored in the same rack.
When the default replication is used 3, then a copy in stored in the same rack and the other copy is stored in a different rack. The reason all copies are not maintained in different racks is because that would increase the inter-rack writes .
Interview Q6> Is there any impact if we increase the replication factor to a higher number like 5.
1.There could be unnecesary strain on the cluster.
2.As more number of replicas are increased we will get better reliability but all the metadata needs to be stored in namenode. This may impact namenode performance.