事情起源于这位老兄统计了一下indeed上R的工作数目http://www.datasciencecentral.com/profiles/blogs/sas-dominates-analytics-job-market-r-up-42
然后微博上有另一位仁兄吐槽说R的数量是虚高;于是就去indeed想探个究竟
在indeed输入R,出来的结果有37000多条,但是大部分都是与R语言无关的,这也就给R工作岗位的分析带来不少麻烦。所幸的是indeed提供exact phrase搜索功能,而R又一般都是其他软件一起出现在工作要求中,结合这两者,计算出R语言的工作岗位就成为可能了。
例如,一个工作岗位要求求职者会SAS或者R,在英语中这两者一般都会以这样的四种方式出现:“SAS, R", "SAS or R", "R, SAS", "R or SAS”。利用indeed的exact phrase功能,就可以搜索出SAS跟R相邻出现的数目,利用同样的方法,我们还可以统计其他软件跟R一前一后出现的数目,最后得到总的R的工作岗位
统计了一些主要软件后,可得出R的工作岗位大概有3116个,这比在indeed中直接输入R的结果要低得多,但是比以上那位老兄得到的1693个还是要高的。。。。。
<br />
require(RCurl)<br />
require(XML)<br />
rm(list =ls())</p>
<p>RJobs <- function(x) {<br />
jobs <- matrix(0, length(x), 5)<br />
url1<-c("http://www.indeed.com/jobs?q=%22AnotherTool+R%22&l=united+states&radius=0")<br />
url2<-c("http://www.indeed.com/jobs?q=%22AnotherTool+or+R%22&l=united+states&radius=0")<br />
url3<-c("http://www.indeed.com/jobs?q=%22R+AnotherTool%22&l=united+states&radius=0")<br />
url4<-c("http://www.indeed.com/jobs?q=%22R+or+AnotherTool%22&l=united+states&radius=0")<br />
url5<- c("http://www.indeed.com/jobs?q=%22AnotherTool+R%22+or+%22AnotherTool+or+R%22+or+%22R+AnotherTool%22+or+%22R+or+AnotherTool+%22&l=United+States&radius=0")<br />
url <- c(url1, url2, url3, url4, url5)<br />
url.new <- t(sapply(x, function(x) gsub("AnotherTool", x, url)))<br />
count.func <- function(page) {<br />
webpage <- getURL(page)<br />
webpage <- readLines(tc <- textConnection(webpage)); close(tc)<br />
aa <- grep("Jobs 1 to ", webpage)<br />
count <- ifelse (length(aa) == 0, 0, as.integer(gsub("[^0-9]", "", strsplit(webpage[aa], " ")[[1]][7])))<br />
return(count)<br />
}<br />
for (i in 1:length(x)) {<br />
for (j in 1:5) jobs[i,j] <- count.func(url.new[i,j])<br />
}<br />
colnames(jobs) <- c("* R", "* or R", "R *", "R or *", "All")<br />
rownames(jobs) <- x<br />
return(jobs)<br />
}</p>
<p>soft <- c("SAS", "SPSS", "Minitab", "Stata", "JMP", "Statistica", "Systat", "BDMP",<br />
"Python", "Matlab", "Excel", "SQL", "java", "javascript", "perl", "PHP",<br />
"Fortran", "S-Plus", "Linux", "C%2B%2B", "Access", "Ruby", "Shell","Coffeescript",<br />
"Gauss") ## C%2B%2B is C++, should replace "AnotherTool" with C%2B%2B to search correctly, not C++<br />
system.time(jobs <- RJobs(soft))<br />
jobs <- jobs[order(-jobs[, 'All']), ]<br />
jobs<br />
* R * or R R * R or * All<br />
SAS 467 120 311 79 961<br />
Matlab 200 52 319 64 628<br />
SPSS 216 42 145 34 434<br />
Python 100 6 77 29 209<br />
SQL 105 6 89 10 209<br />
Stata 58 15 79 10 160<br />
S-Plus 45 23 72 7 147<br />
C%2B%2B 58 0 7 2 67<br />
java 33 2 28 2 65<br />
Excel 34 0 19 1 52<br />
perl 21 2 25 4 52<br />
JMP 24 6 20 2 50<br />
Minitab 10 0 13 7 28<br />
Ruby 6 2 14 0 22<br />
Linux 7 0 6 1 14<br />
Access 10 0 3 0 13<br />
PHP 4 0 6 0 10<br />
Statistica 3 0 5 0 8<br />
javascript 3 0 2 0 5<br />
Shell 1 0 3 0 4<br />
Fortran 2 0 1 0 3<br />
Systat 2 0 0 0 2<br />
Gauss 1 0 0 0 1<br />
BDMP 0 0 0 0 0<br />
Coffeescript 0 0 0 0 0<br />
</p>
但是当多个软件连着出现时,这样算就可能会有部分重复计算,如"SAS, R, Matlab"同时计算了“SAS, R"还有”R,Matlab",这个工作岗位出现了两次,所以工作岗位数目应该比3116少,但是比1558大。那么怎么把重复的值去掉呢?一个可行的办法就是把所有这些exact phrase同时输入到网址中,返回的结果就会自动把重复值去掉。
于是,我们打开这个奇葩的网站,就得到了R工作岗位的总数 http://www.indeed.com/jobs?q=%22SAS+R%22+or+%22SAS+or+R%22+or+%22R+SAS%22+or+%22R+or+SAS%22+or+%22Matlab+R%22+or+%22Matlab+or+R%22+or+%22R+Matlab%22+or+%22R+or+Matlab%22+or+%22SPSS+R%22+or+%22SPSS+or+R%22+or+%22R+SPSS%22+or+%22R+or+SPSS%22+or+%22Python+R%22+or+%22Python+or+R%22+or+%22R+Python%22+or+%22R+or+Python%22+or+%22SQL+R%22+or+%22SQL+or+R%22+or+%22R+SQL%22+or+%22R+or+SQL%22+or+%22Stata+R%22+or+%22Stata+or+R%22+or+%22R+Stata%22+or+%22R+or+Stata%22+or+%22S-Plus+R%22+or+%22S-Plus+or+R%22+or+%22R+S-Plus%22+or+%22R+or+S-Plus%22+or+%22C%2B%2B+R%22+or+%22C%2B%2B+or+R%22+or+%22R+C%2B%2B%22+or+%22R+or+C%2B%2B%22+or+%22java+R%22+or+%22java+or+R%22+or+%22R+java%22+or+%22R+or+java%22+or+%22Excel+R%22+or+%22Excel+or+R%22+or+%22R+Excel%22+or+%22R+or+Excel%22+or+%22perl+R%22+or+%22perl+or+R%22+or+%22R+perl%22+or+%22R+or+perl%22+or+%22JMP+R%22+or+%22JMP+or+R%22+or+%22R+JMP%22+or+%22R+or+JMP%22+or+%22Minitab+R%22+or+%22Minitab+or+R%22+or+%22R+Minitab%22+or+%22R+or+Minitab%22+or+%22Ruby+R%22+or+%22Ruby+or+R%22+or+%22R+Ruby%22+or+%22R+or+Ruby%22+or+%22Linux+R%22+or+%22Linux+or+R%22+or+%22R+Linux%22+or+%22R+or+Linux%22+or+%22Access+R%22+or+%22Access+or+R%22+or+%22R+Access%22+or+%22R+or+Access%22+or+%22PHP+R%22+or+%22PHP+or+R%22+or+%22R+PHP%22+or+%22R+or+PHP%22&l=united+states&radius=0
那么是多少呢?
答案是:2513个!