??在Hadoop分布式模式部署完畢后,通過(guò)
start-dfs.sh
啟動(dòng)
NameNode
、
DataNode
、
SecondaryNameNode
,在
master
節(jié)點(diǎn)通過(guò)
jps
命令查看,看到
NameNode
、
SecondaryNameNode
已啟動(dòng),在
slave
節(jié)點(diǎn)通過(guò)
jps
命令查看,
DataNode
也已經(jīng)啟動(dòng)。(此時(shí)頗為欣喜,首次全然分布式部署即成功,可是。。。)
1 發(fā)現(xiàn)問(wèn)題
??準(zhǔn)備好
WordCount
測(cè)試用例所需文件,通過(guò)
hadoop fs -put file /
命令上傳文件,發(fā)現(xiàn)竟然報(bào)錯(cuò),這但是在單節(jié)點(diǎn)偽分布模式中操作無(wú)數(shù)次的命令啊>_<
14/10/13 14:40:25 WARN hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /input/test._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1441) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2702) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:584) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:440) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) at org.apache.hadoop.ipc.Client.call(Client.java:1410) at org.apache.hadoop.ipc.Client.call(Client.java:1363) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy14.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at com.sun.proxy.$Proxy14.addBlock(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:361) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1439) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1261) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:525) put: File /input/test._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation.
??細(xì)致看看錯(cuò)誤,could only be replicated to 0 nodes,竟然是0個(gè)節(jié)點(diǎn)。
??通過(guò)訪問(wèn)http://master:50070通過(guò)web ui查看節(jié)點(diǎn)狀態(tài),發(fā)現(xiàn)
Live Nodes
為0,
Dead Nodes
也為0,說(shuō)明
slave
節(jié)點(diǎn)的
DataNode
節(jié)點(diǎn)盡管啟動(dòng)正常,可是與
master
通信失敗,查看
slave
上的
datanode
的日志,發(fā)現(xiàn)出現(xiàn)信息發(fā)送失敗的日志:
2014-10-13 14:04:46,379 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 50020 2014-10-13 14:04:46,496 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened IPC server at /0.0.0.0:50020 2014-10-13 14:04:46,531 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Refresh request received for nameservices: null 2014-10-13 14:04:46,856 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting BPOfferServices for nameservices: <default> 2014-10-13 14:04:46,905 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool <registering> (Datanode Uuid unassigned) service to master/192.168.1.132:9000 starting to offer service 2014-10-13 14:04:46,970 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting 2014-10-13 14:04:46,967 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2014-10-13 14:04:51,373 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.132:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2014-10-13 14:04:52,375 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.132:9000. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2014-10-13 14:04:53,376 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.132:9000. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 。。。 2014-10-13 14:05:00,394 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.132:9000. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2014-10-13 14:05:00,396 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: master/192.168.1.132:9000
2 分析問(wèn)題
??通過(guò)日志能夠看出,
slave
節(jié)點(diǎn)的
DataNode
啟動(dòng)成功之后嘗試向
master
傳輸心跳信息,每輪嘗試10次,所有失敗,然后繼續(xù)向
master
傳輸信息。
??再次檢查各種配置:/etc/hosts文件、slaves文件、ssh無(wú)password登錄、.bashrc中的環(huán)境變量、core-site.xml、hdfs-site.xml、mapred-site.xml、yarn-site.xml,檢查結(jié)果為正常,全部配置無(wú)誤。
??再分析,由于
master
向
slave
節(jié)點(diǎn)發(fā)送啟動(dòng)請(qǐng)求成功,
slave
啟動(dòng)
DataNode
,可是
slave
節(jié)點(diǎn)向
master
返回心跳失敗。通過(guò)
ping
命令驗(yàn)證一番,
master
與
slave
互相可以平通,說(shuō)明兩個(gè)節(jié)點(diǎn)間通信正常。
??此時(shí)僅僅剩下一種可能:防火墻。福爾摩斯說(shuō)過(guò),把全部不可能的情況剔除后,剩下的就是答案。
3 解決這個(gè)問(wèn)題
??使用root用戶(hù)查看master節(jié)點(diǎn)的防火墻是否啟動(dòng):
/etc/init.d/iptables status
Table: filter Chain INPUT (policy ACCEPT) num target prot opt source destination 1 ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 state RELATED,ESTABLISHED 2 ACCEPT icmp -- 0.0.0.0/0 0.0.0.0/0 3 ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 4 ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 state NEW tcp dpt:22 5 REJECT all -- 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited Chain FORWARD (policy ACCEPT) num target prot opt source destination 1 REJECT all -- 0.0.0.0/0 0.0.0.0/0 reject-with icmp-host-prohibited Chain OUTPUT (policy ACCEPT) num target prot opt source destination
??通過(guò)
/etc/init.d/iptables stop
命令直接關(guān)閉防火墻,在查看
slave
的
DataNode
日志:
2014-10-13 14:56:23,646 INFO org.apache.hadoop.hdfs.server.common.Storage: Data-node version: -55 and name-node layout version: -56 2014-10-13 14:56:23,662 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /home/lxh/hadoop/hdfs/data/in_use.lock acquired by nodename 2737@slave2 2014-10-13 14:56:23,790 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-480719120-192.168.1.132-1413164135404 2014-10-13 14:56:23,791 INFO org.apache.hadoop.hdfs.server.common.Storage: Locking is disabled 2014-10-13 14:56:23,791 INFO org.apache.hadoop.hdfs.server.common.Storage: Restored 0 block files from trash. 2014-10-13 14:56:23,793 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Setting up storage: nsid=161264755;bpid=BP-480719120-192.168.1.132-1413164135404;lv=-55;nsInfo=lv=-56;cid=CID-84fef657-142a-46c4-8329-3df7025286a0;nsid=161264755;c=0;bpid=BP-480719120-192.168.1.132-1413164135404;dnuuid=62322351-4f82-42eb-baa4-f3b38675234e 2014-10-13 14:56:23,815 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Added volume - /home/lxh/hadoop/hdfs/data/current, StorageType: DISK 2014-10-13 14:56:23,951 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Registered FSDatasetState MBean 2014-10-13 14:56:23,979 INFO org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: Periodic Directory Tree Verification scan starting at 1413202418979 with interval 21600000 2014-10-13 14:56:23,984 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Adding block pool BP-480719120-192.168.1.132-1413164135404 2014-10-13 14:56:23,986 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Scanning block pool BP-480719120-192.168.1.132-1413164135404 on volume /home/lxh/hadoop/hdfs/data/current... 2014-10-13 14:56:24,047 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time taken to scan block pool BP-480719120-192.168.1.132-1413164135404 on /home/lxh/hadoop/hdfs/data/current: 59ms 2014-10-13 14:56:24,047 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Total time to scan all replicas for block pool BP-480719120-192.168.1.132-1413164135404: 62ms 2014-10-13 14:56:24,049 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Adding replicas to map for block pool BP-480719120-192.168.1.132-1413164135404 on volume /home/lxh/hadoop/hdfs/data/current... 2014-10-13 14:56:24,050 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Time to add replicas to map for block pool BP-480719120-192.168.1.132-1413164135404 on volume /home/lxh/hadoop/hdfs/data/current: 0ms 2014-10-13 14:56:24,050 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Total time to add all replicas to map: 3ms 2014-10-13 14:56:24,052 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-480719120-192.168.1.132-1413164135404 (Datanode Uuid null) service to master/192.168.1.132:9000 beginning handshake with NN 2014-10-13 14:56:24,109 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool Block pool BP-480719120-192.168.1.132-1413164135404 (Datanode Uuid null) service to master/192.168.1.132:9000 successfully registered with NN 2014-10-13 14:56:24,109 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode master/192.168.1.132:9000 using DELETEREPORT_INTERVAL of 300000 msec BLOCKREPORT_INTERVAL of 21600000msec CACHEREPORT_INTERVAL of 10000msec Initial delay: 0msec; heartBeatInterval=3000 2014-10-13 14:56:24,249 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Namenode Block pool BP-480719120-192.168.1.132-1413164135404 (Datanode Uuid 62322351-4f82-42eb-baa4-f3b38675234e) service to master/192.168.1.132:9000 trying to claim ACTIVE state with txid=430 2014-10-13 14:56:24,249 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Acknowledging ACTIVE Namenode Block pool BP-480719120-192.168.1.132-1413164135404 (Datanode Uuid 62322351-4f82-42eb-baa4-f3b38675234e) service to master/192.168.1.132:9000 2014-10-13 14:56:24,350 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Sent 1 blockreports 0 blocks total. Took 1 msec to generate and 100 msecs for RPC and NN processing. Got back commands org.apache.hadoop.hdfs.server.protocol.FinalizeCommand@563b100c 2014-10-13 14:56:24,354 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize command for block pool BP-480719120-192.168.1.132-1413164135404 2014-10-13 14:56:24,369 INFO org.apache.hadoop.util.GSet: Computing capacity for map BlockMap 2014-10-13 14:56:24,370 INFO org.apache.hadoop.util.GSet: VM type = 64-bit 2014-10-13 14:56:24,373 INFO org.apache.hadoop.util.GSet: 0.5% max memory 966.7 MB = 4.8 MB 2014-10-13 14:56:24,375 INFO org.apache.hadoop.util.GSet: capacity = 2^19 = 524288 entries 2014-10-13 14:56:24,376 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Periodic Block Verification Scanner initialized with interval 504 hours for block pool BP-480719120-192.168.1.132-1413164135404 2014-10-13 14:56:24,389 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Added bpid=BP-480719120-192.168.1.132-1413164135404 to blockPoolScannerMap, new size=1
問(wèn)題解決。
??為了以防萬(wàn)一,再次把
master
的防火墻打開(kāi)
/etc/init.d/iptables start
,再查看
slave
的
DataNode
日志:
2014-10-13 14:57:45,119 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService java.net.SocketTimeoutException: Call From slave2/192.168.1.133 to master:9000 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.133:39010 remote=master/192.168.1.132:9000]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749) at org.apache.hadoop.ipc.Client.call(Client.java:1414) at org.apache.hadoop.ipc.Client.call(Client.java:1363) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy14.sendHeartbeat(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:190) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:103) at com.sun.proxy.$Proxy14.sendHeartbeat(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:178) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:570) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:838) at java.lang.Thread.run(Thread.java:662) Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/192.168.1.133:39010 remote=master/192.168.1.132:9000] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) at java.io.FilterInputStream.read(FilterInputStream.java:116) at java.io.FilterInputStream.read(FilterInputStream.java:116) at org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:510) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1054) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:949) 2014-10-13 14:57:46,128 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.132:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS) 2014-10-13 14:57:47,129 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: master/192.168.1.132:9000. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
果然,引起錯(cuò)誤的原因是
master
的防火墻。
4 總結(jié)問(wèn)題
??由于
master
與
slave
之間通信正常,互相能夠
ping
通,所以
master
的命令能夠向
slave
下發(fā),
slave
接收到啟動(dòng)命令,開(kāi)始啟動(dòng)
DataNode
,并在啟動(dòng)成功后,向
master
節(jié)點(diǎn)發(fā)送心跳信息。在正常情況下,
master
接收到心跳信息,記錄該節(jié)點(diǎn)的
DataNode
為正常,能夠講hdfs上的數(shù)據(jù)拷貝到該節(jié)點(diǎn),并能夠在該節(jié)點(diǎn)進(jìn)行運(yùn)算。
??但因?yàn)榉阑饓r截,
slave
的心跳信息不可以傳遞給
master
,所以
master
覺(jué)得該節(jié)點(diǎn)不存在,所以不會(huì)記錄在
Live Nodes
和
Dead Nodes
中,對(duì)
master
來(lái)說(shuō),該節(jié)點(diǎn)就變成了消失的幽靈節(jié)點(diǎn)。
??在解決過(guò)程中,出于安全方面的考慮,應(yīng)該開(kāi)放
master
的9000port,而不是關(guān)閉防火墻。由于我是在虛擬機(jī)中進(jìn)行的測(cè)試環(huán)境,所以直接關(guān)閉,并且還通過(guò)
chkconfig iptables off
命令,使防火墻開(kāi)機(jī)不啟動(dòng)了。
更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號(hào)聯(lián)系: 360901061
您的支持是博主寫(xiě)作最大的動(dòng)力,如果您喜歡我的文章,感覺(jué)我的文章對(duì)您有幫助,請(qǐng)用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點(diǎn)擊下面給點(diǎn)支持吧,站長(zhǎng)非常感激您!手機(jī)微信長(zhǎng)按不能支付解決辦法:請(qǐng)將微信支付二維碼保存到相冊(cè),切換到微信,然后點(diǎn)擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對(duì)您有幫助就好】元
