Prerequisite
Hadoop 2.2 has been installed ( and the below installation steps should be applied on each of Hadoop node )
?
Step 1. Install R (by yum)
[hadoop@c0046220 yum.repos.d]$ sudo yum update
?
[hadoop@c0046220 yum.repos.d]$ yum search r-project
?
[hadoop@c0046220 yum.repos.d]$ sudo yum install R
...
Installed:
R.x86_64 0:3.0.2-1.el6
?
Dependency Installed:
R-core.x86_64 0:3.0.2-1.el6 R-core-devel.x86_64 0:3.0.2-1.el6 R-devel.x86_64 0:3.0.2-1.el6 R-java.x86_64 0:3.0.2-1.el6
R-java-devel.x86_64 0:3.0.2-1.el6 bzip2-devel.x86_64 0:1.0.5-7.el6_0 fontconfig-devel.x86_64 0:2.8.0-3.el6 freetype-devel.x86_64 0:2.3.11-14.el6_3.1
java-1.6.0-openjdk-devel.x86_64 1:1.6.0.0-1.62.1.11.11.90.el6_4 kpathsea.x86_64 0:2007-57.el6_2 libRmath.x86_64 0:3.0.2-1.el6 libRmath-devel.x86_64 0:3.0.2-1.el6
libXft-devel.x86_64 0:2.3.1-2.el6 libXmu.x86_64 0:1.1.1-2.el6 libXrender-devel.x86_64 0:0.9.7-2.el6 libicu.x86_64 0:4.2.1-9.1.el6_2
netpbm.x86_64 0:10.47.05-11.el6 netpbm-progs.x86_64 0:10.47.05-11.el6 pcre-devel.x86_64 0:7.8-6.el6 psutils.x86_64 0:1.17-34.el6
tcl.x86_64 1:8.5.7-6.el6 tcl-devel.x86_64 1:8.5.7-6.el6 tex-preview.noarch 0:11.85-10.el6 texinfo.x86_64 0:4.13a-8.el6
texinfo-tex.x86_64 0:4.13a-8.el6 texlive.x86_64 0:2007-57.el6_2 texlive-dvips.x86_64 0:2007-57.el6_2 texlive-latex.x86_64 0:2007-57.el6_2
texlive-texmf.noarch 0:2007-38.el6 texlive-texmf-dvips.noarch 0:2007-38.el6 texlive-texmf-errata.noarch 0:2007-7.1.el6 texlive-texmf-errata-dvips.noarch 0:2007-7.1.el6
texlive-texmf-errata-fonts.noarch 0:2007-7.1.el6 texlive-texmf-errata-latex.noarch 0:2007-7.1.el6 texlive-texmf-fonts.noarch 0:2007-38.el6 texlive-texmf-latex.noarch 0:2007-38.el6
texlive-utils.x86_64 0:2007-57.el6_2 tk.x86_64 1:8.5.7-5.el6 tk-devel.x86_64 1:8.5.7-5.el6 zlib-devel.x86_64 0:1.2.3-29.el6
?
Complete!
?
Validation:
[hadoop@c0046220 yum.repos.d]$ R
?
R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: x86_64-redhat-linux-gnu (64-bit)
?
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
?
Natural language support but running in an English locale
?
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
?
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
?
>
?
?
Step 2. Install RHadoop
2.1 Getting RHadoop Packages
Download packages rhdfs, rhbase and rmr2 from https://github.com/RevolutionAnalytics/RHadoop/wiki/Downloads and then run the R code below.
[hadoop@c0046220 RHadoop]$ cd /tmp
[hadoop@c0046220 tmp]$ mkdir RHadoop
[hadoop@c0046220 tmp]$ cd RHadoop
[hadoop@c0046220 RHadoop]$ wget https://raw.githubusercontent.com/RevolutionAnalytics/rhdfs/master/build/rhdfs_1.0.8.tar.gz
[hadoop@c0046220 RHadoop]$ wget https://raw.githubusercontent.com/RevolutionAnalytics/rmr2/3.1.0/build/rmr2_3.1.0.tar.gz
?
[hadoop@c0046220 RHadoop]$ wget https://raw.githubusercontent.com/RevolutionAnalytics/rhbase/master/build/rhbase_1.2.0.tar.gz
?
2.2 Install R packages that RHadoop depends on.
[hadoop@c0046220 java]$ echo $JAVA_HOME
/usr/java/jdk1.8.0_05
?
[hadoop@c0046220 java]$ sudo -i
[root@c0046220 ~]# export JAVA_HOME=/usr/java/jdk1.8.0_05
[root@c0046220 ~]# R CMD javareconf
[root@c0046220 ~]# R
...
> .libPaths();
[1] "/usr/lib64/R/library" "/usr/share/R/library"
?
> install.packages(c("rJava", "Rcpp", "RJSONIO", "bitops", "digest", "functional", "stringr", "plyr", "reshape2", "caTools"))
> #install.packages("caTools") #needed for rmr2
?
2.3 Install RHadoop
Set environment variables
[hadoop@c0046220 ~]$ vi ~/.bashrc
# set HADOOP locations for RHADOOP
export HADOOP_CMD=$HADOOP_HOME/bin/hadoop
export HADOOP_STREAMING=/opt/hadoop/hadoop-2.2.0/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar
[hadoop@c0046220 ~]$ source .bashrc
?
[hadoop@c0040084 R]$ sudo -i
[root@c0040084 ~]# R
...
> Sys.setenv(HADOOP_HOME="/opt/hadoop/hadoop-2.2.0");
> Sys.setenv(HADOOP_CMD="/opt/hadoop/hadoop-2.2.0/bin/hadoop");
> Sys.setenv(HADOOP_STREAMING="/opt/hadoop/hadoop-2.2.0/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar");
> install.packages(pkgs="/tmp/RHadoop/rhdfs_1.0.8.tar.gz",repos=NULL);
> install.packages(pkgs="/tmp/RHadoop/rmr2_3.1.0.tar.gz",repos=NULL);
?
Step 3. Validation
Load and initialize the rhdfs package, and execute some simple commands as below:
library(rhdfs)
hdfs.init()
hdfs.ls("/")
[hadoop@c0046220 ~]$ R
...
> library(rhdfs)
Loading required package: rJava
...
Be sure to run hdfs.init()
> hdfs.init()
14/05/15 10:02:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> hdfs.ls("/")
permission owner group size modtime file
1 drwxr-xr-x hadoop supergroup 0 2014-05-14 03:05 /apps
2 drwxr-xr-x hadoop supergroup 0 2014-05-12 09:40 /data
3 drwxr-xr-x hadoop supergroup 0 2014-05-12 09:45 /output
4 drwxrwx--- hadoop supergroup 0 2014-05-15 10:02 /tmp
5 drwxr-xr-x hadoop supergroup 0 2014-05-14 05:48 /user
6 drwxr-xr-x hadoop supergroup 0 2014-05-13 06:43 /usr
?
Load and initialize the rmr2 package, and execute some simple commands as below:
library(rmr2)
from.dfs(to.dfs(1:100))
from.dfs(mapreduce(to.dfs(1:100)))
[hadoop@c0046220 ~]$ R
...
> library(rmr2)
Loading required package: Rcpp
Loading required package: RJSONIO
Loading required package: bitops
Loading required package: digest
Loading required package: functional
Loading required package: reshape2
Loading required package: stringr
Loading required package: plyr
Loading required package: caTools
?
> from.dfs(to.dfs(1:100))
...
$key
NULL
?
$val
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100
?
> from.dfs(mapreduce(to.dfs(1:100)))
...
$key
NULL
?
$val
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100
?
?
library(rmr2)
input<- '/user/hadoop/tmp.txt'
wordcount = function(input, output = NULL, pattern = " "){
wc.map = function(., lines) {
keyval(unlist( strsplit( x = lines,split = pattern)),1)
}
?
wc.reduce =function(word, counts ) {
keyval(word, sum(counts))
}
?
mapreduce(input = input ,output = output, input.format = "text",
map = wc.map, reduce = wc.reduce,combine = T)
}
?
wordcount(input)
?
> library(rmr2)
> input<- '/user/hadoop/tmp.txt'
> wordcount = function(input, output = NULL, pattern = " "){
+ wc.map = function(., lines) {
+ keyval(unlist( strsplit( x = lines,split = pattern)),1)
+ }
+
+ wc.reduce =function(word, counts ) {
+ keyval(word, sum(counts))
+ }
+
+ mapreduce(input = input ,output = output, input.format = "text",
+ map = wc.map, reduce = wc.reduce,combine = T)
+ }
>
> wordcount(input)
...
14/05/15 10:18:40 INFO mapreduce.Job: Job job_1399887026053_0013 completed successfully
14/05/15 10:18:40 INFO mapreduce.Job: Counters: 45
File System Counters
FILE: Number of bytes read=11018
FILE: Number of bytes written=278566
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2004
HDFS: Number of bytes written=11583
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Failed reduce tasks=1
Launched map tasks=2
Launched reduce tasks=2
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=23412
Total time spent by all reduces in occupied slots (ms)=13859
Map-Reduce Framework
Map input records=24
Map output records=112
Map output bytes=10522
Map output materialized bytes=11024
Input split bytes=208
Combine input records=112
Combine output records=114
Reduce input groups=105
Reduce shuffle bytes=11024
Reduce input records=114
Reduce output records=112
Spilled Records=228
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=569
CPU time spent (ms)=3700
Physical memory (bytes) snapshot=574214144
Virtual memory (bytes) snapshot=6258499584
Total committed heap usage (bytes)=365953024
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1796
File Output Format Counters
Bytes Written=11583
rmr
reduce calls=110
14/05/15 10:18:40 INFO streaming.StreamJob: Output directory: /tmp/file612355aa2e35
function ()
{
fname
}
<environment: 0x37d70d0>
>
>
> from.dfs("/tmp/file612355aa2e35")
$key
[1] "-"
[2] "of"
[3] "Hong"
[4] "Paul's"
[5] "School"
[6] "College"
[7] "Graduate"
...
References
https://s3.amazonaws.com/RHadoop/RHadoop2.0.2u2_Installation_Configuration_for_RedHat.pdf
http://cran.r-project.org/doc/manuals/r-devel/R-admin.html#Installing-R-under-Unix_002dalikes
?
http://www.rdatamining.com/tutorials/rhadoop
http://blog.fens.me/rhadoop-rhadoop/
http://datamgmt.com/installing-r-and-rstudio-on-redhat-or-centos-linux/
?
https://github.com/RevolutionAnalytics/RHadoop/wiki
https://github.com/RevolutionAnalytics/RHadoop/wiki/Which-Hadoop-for-rmr
更多文章、技術(shù)交流、商務(wù)合作、聯(lián)系博主
微信掃碼或搜索:z360901061

微信掃一掃加我為好友
QQ號聯(lián)系: 360901061
您的支持是博主寫作最大的動(dòng)力,如果您喜歡我的文章,感覺我的文章對您有幫助,請用微信掃描下面二維碼支持博主2元、5元、10元、20元等您想捐的金額吧,狠狠點(diǎn)擊下面給點(diǎn)支持吧,站長非常感激您!手機(jī)微信長按不能支付解決辦法:請將微信支付二維碼保存到相冊,切換到微信,然后點(diǎn)擊微信右上角掃一掃功能,選擇支付二維碼完成支付。
【本文對您有幫助就好】元
