然后我做了个实际测试,发现这个代码更高效
N = 1e6
ncol1 = 1e5
ncol2 = 10 # you could change to 1e5, but this will lead most count =1
Dta= cbind(sample(1:ncol1,N,replace=T),sample(1:ncol2,N,replace=T))
# if you use win and don't have sort in batch, you can do it in R
# Dta= Dta[order(Dta[,1],Dta[,2]),]
write.table(x=Dta,file="dta410880",sep=" ",quote=F,col.names=F,row.names=F)
system("sort dta410880 | uniq -c > re.log")
# re.log example
# 1 100000 1
# 2 100000 3
# 1 100000 4
# 2 100000 5
# 1 100000 6
# 4 100000 7
# 1 100000 9
# 1 10000 1
# 2 10000 10
1e5*1e5大约2秒@ SSD 16G i7 2600K. 其实硬件关系不大,主要是排序算法的使用是关键。