When a file is stored in HDFS, Hadoop breaks the file into BLOCKS before storing them. What this means is, when you store a file of big size Hadoop breaks them into smaller chunks based on predefined block size and then stores them in Data Nodes across the cluster. The default block size is 128mb but this can be configured.
how files are split into Blocks in HDFS
We just read that, when HDFS receives a big file it breaks the file in blocks based on the predefined block size. Lets say the the predefined block size is 128 mb in that case lets see how a file of of size 600 mb is stored.
File Size : 600 mb
Block Size : 128 mb
Number of blocks : UpperLimit(File Size / Block Size)
UpperLimit(600/128) = UpperLimit(4.68) = 5 blocks
Size of each block : Block1 (128mb) , Block2 (128mb) ,Block3 (128mb) ,Block4 (128mb) ,Block5 (88mb)
Note HDFS will use only ass much space as needed. In this example we see Block5 is just 88mb, hence HDFS will only block 88mb of space and not 128mb.
In case your file size is smaller than the HDFS block size then your file will not be split. This does not happen frequently as we use HADOOP for Big Data file processing which are in Terabytes.
how to change Default Block size in HDFS
The default block size in HDFS was 64mb for Hadoop 1.0 and 128mb for Hadoop 2.0 . The block size configuration change can be done on an entire cluster or can be configured for specific blocks. We will check below both the scenarios.
To change Block Size settings for a Cluster
To change the HDFS block size for the entire cluster we need to update the dfs.block.size property in hdfs.site.xml file.Once this change is done the cluster needs to restart so that this can take effect. For multinode clusters this needs to be done in each node(Name Node and Data Node). Also note that the size in the property needs to be specified in Bytes. So 128mb will be displayed as 128 * 1024 *1024 i,e 134217728 .
<property> <name>dfs.block.size</name> <value>134217728</value> <description>Block size</description> </property>
This however has no impact on the existing blocks. To modify the existing blocks we need to use DistCp(distributed copy) which is a tool used for large inter/intra-cluster copying. You can learn more about this here.
To change Block Size settings for a specific file
While placing a file in HDFS we can specifically choose to provide a different block size for that file. For example if we need to place the 600mb file in an HDFS location where the default block size is 128mb but we need to create blocks of size 256mb for just this specific file, then it can be done as below.
hdfs dfs -Ddfs.blocksize=268435456 -put /home/myfolder/data/test1.text /destpath/destfolder
Remember that block size and block replication factor are different. You can check more on Replication Factor here.
why choose bigger block size in HDFS
As we have seen till now , the default block size is 128mb which looks big compared to a block in Linux system which is 4kb. So this begs the question why does HDFS go for such huge block sizes.
- First, you need to remember that we expect Terabytes or Petabytes of data to be stored and processed in HDFS. If we maintain a low block size [may be in kbs] just imagine the number of blocks which will get created. On top of that each blocks information is maintained by namenode. If the number of block increases the amount of information maintained by namenode also increases which adversely impacts the performance.
- Secondly, we need the seek time to reduce. What seek time means is the time taken by the process to move disk header to particular place on the disk to read or write. Lets say the disk transfer rate is 100mb/sec then it is preferable that we transfer 100mb of data after each seek rather 10mb of data after spending same amount of seek time.
- Thirdly, we know that for each block a map job needs to run. If the file is of size 10TB and the block size id 128mb the 81920 blocks will be created. Since there are 81920 unique blocks the Map Reduce framework by default will launch 81920 Map tasks to process the file. So we need to decide the block size correctly keeping in mind the computing power of the system.
So today we learnt what are HDFS Data blocks and block size and how files are split into different blocks. What is block size and how we can modify the default settings for block size. Also we saw that we can apply block size while loading specific files. And lastly we understood the reason why Hadoop uses bigger block size.
🙂 kudos for learning something new 🙂