在数据框中查找条形码并计算发生这种情况的次数,循环非常慢?(look for a barcode in a dataframe and count the amount of times this happens, Very slow for loop?)

我有一个非常慢的for循环并且无法正常工作,它在1个data.frame中查找条形码,而不是在另一个data.frame中搜索该条形码。 第二个data.frame的bar_code可以多次出现。 每次找到条形码时,计数器都应计算条形码所在的次数,并将条形码数写入第1个数据帧。

我的尝试:

for(i in 1:length(tcgadataUniek$Tumor_Sample_Barcode)){ for(j in 1:length(hprdDataSorted$Samples.Int1)){ count<-0 if(i==j){ count<-count+1 } else { count<-count+0 } hprdDataSorted$Samples.Int2<-count[j] } }

1st Data.Frame看起来如下(csv):

HUGO.Int1,HUGO.Int2,barcode.Int1 A1CF,APOBEC1,TCGA-B6-A0RS-01A-11D-A099-09 A1CF,TNPO2,TCGA-B6-A0RS-01A-11D-A099-09 A1CF,SYNCRIP,TCGA-B6-A0RS-01A-11D-A099-09 A1CF,KHSRP,TCGA-B6-A0RS-01A-11D-A099-09 A2M,SHBG,TCGA-D8-A1JK-01A-11D-A13L-09 A2M,C11orf58,TCGA-D8-A1JK-01A-11D-A13L-09 A2M,ATF7IP,TCGA-D8-A1JK-01A-11D-A13L-09 AAMP,TH1L,TCGA-A8-A08S-01A-11W-A050-09 AARS,EEF1B2,TCGA-AO-A0JC-01A-11W-A071-09

包含重复条形码的第二个Data.frame(csv)

Sample_Barcode TCGA-A8-A08G-01A-11W-A019-09 TCGA-AO-A03O-01A-11W-A019-09 TCGA-AO-A03O-01A-11W-A019-09 TCGA-B6-A0RS-01A-11D-A099-09 TCGA-BH-A0HP-01A-12D-A099-09 TCGA-BH-A0HP-01A-12D-A099-09 TCGA-BH-A18H-01A-11D-A12B-09 TCGA-BH-A18H-01A-11D-A12B-09 TCGA-BH-A18J-01A-11D-A12B-09 TCGA-D8-A1JK-01A-11D-A13L-09 TCGA-E2-A1BC-01A-11D-A14G-09 TCGA-E2-A1BC-01A-11D-A14G-09 TCGA-E9-A1NH-01A-11D-A14G-09 TCGA-E9-A22B-01A-11D-A159-09

如果条形码.Int1(数据帧1)中的条形码在Sample_barcode中是3次,则脚本应在条形码旁边添加3,脚本正在寻找.Int1。 例如:

HUGO.Int1,HUGO.Int2,barcode.Int1, number_of_times A1CF,APOBEC1,TCGA-B6-A0RS-01A-11D-A099-09,5

I have a for loop that is awfully slow and doesnt work proper, it looks in 1 data.frame for a barcode and than searches for that barcode in another data.frame. The bar_code of the 2nd data.frame can be there multiple times. Every time it finds a barcode a counter should count the amount of times the barcode is there and write the number of barcodes to the 1st data frame.

My try:

for(i in 1:length(tcgadataUniek$Tumor_Sample_Barcode)){ for(j in 1:length(hprdDataSorted$Samples.Int1)){ count<-0 if(i==j){ count<-count+1 } else { count<-count+0 } hprdDataSorted$Samples.Int2<-count[j] } }

1st Data.Frame looks as follows (csv):

HUGO.Int1,HUGO.Int2,barcode.Int1 A1CF,APOBEC1,TCGA-B6-A0RS-01A-11D-A099-09 A1CF,TNPO2,TCGA-B6-A0RS-01A-11D-A099-09 A1CF,SYNCRIP,TCGA-B6-A0RS-01A-11D-A099-09 A1CF,KHSRP,TCGA-B6-A0RS-01A-11D-A099-09 A2M,SHBG,TCGA-D8-A1JK-01A-11D-A13L-09 A2M,C11orf58,TCGA-D8-A1JK-01A-11D-A13L-09 A2M,ATF7IP,TCGA-D8-A1JK-01A-11D-A13L-09 AAMP,TH1L,TCGA-A8-A08S-01A-11W-A050-09 AARS,EEF1B2,TCGA-AO-A0JC-01A-11W-A071-09

2nd Data.frame which holds the duplicated barcodes (csv)

Sample_Barcode TCGA-A8-A08G-01A-11W-A019-09 TCGA-AO-A03O-01A-11W-A019-09 TCGA-AO-A03O-01A-11W-A019-09 TCGA-B6-A0RS-01A-11D-A099-09 TCGA-BH-A0HP-01A-12D-A099-09 TCGA-BH-A0HP-01A-12D-A099-09 TCGA-BH-A18H-01A-11D-A12B-09 TCGA-BH-A18H-01A-11D-A12B-09 TCGA-BH-A18J-01A-11D-A12B-09 TCGA-D8-A1JK-01A-11D-A13L-09 TCGA-E2-A1BC-01A-11D-A14G-09 TCGA-E2-A1BC-01A-11D-A14G-09 TCGA-E9-A1NH-01A-11D-A14G-09 TCGA-E9-A22B-01A-11D-A159-09

If the barcode from barcode.Int1 (dataframe 1) is 3 times in Sample_barcode the script should add a 3 next to the barcode.Int1 the script is looking for. for example:

HUGO.Int1,HUGO.Int2,barcode.Int1, number_of_times A1CF,APOBEC1,TCGA-B6-A0RS-01A-11D-A099-09,5

最满意答案

保罗的评论非常恰当,它将显着加快合并步骤。 我会使用table来获取第二个data.frame中唯一条形码的计数merge其merge到第一个数据框中,如下所示:

dat <- structure(list(HUGO.Int1 = c("A1CF", "A1CF", "A1CF", "A1CF", "A2M", "A2M", "A2M", "AAMP", "AARS"), HUGO.Int2 = c("APOBEC1", "TNPO2", "SYNCRIP", "KHSRP", "SHBG", "C11orf58", "ATF7IP", "TH1L", "EEF1B2"), barcode.Int1 = c("TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-D8-A1JK-01A-11D-A13L-09", "TCGA-D8-A1JK-01A-11D-A13L-09", "TCGA-D8-A1JK-01A-11D-A13L-09", "TCGA-A8-A08S-01A-11W-A050-09", "TCGA-AO-A0JC-01A-11W-A071-09")), .Names = c("HUGO.Int1", "HUGO.Int2", "barcode.Int1"), class = "data.frame", row.names = c(NA, -9L)) dat2 <- structure(list(Sample_Barcode = c("TCGA-A8-A08G-01A-11W-A019-09", "TCGA-AO-A03O-01A-11W-A019-09", "TCGA-AO-A03O-01A-11W-A019-09", "TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-BH-A0HP-01A-12D-A099-09", "TCGA-BH-A0HP-01A-12D-A099-09", "TCGA-BH-A18H-01A-11D-A12B-09", "TCGA-BH-A18H-01A-11D-A12B-09", "TCGA-BH-A18J-01A-11D-A12B-09", "TCGA-D8-A1JK-01A-11D-A13L-09", "TCGA-E2-A1BC-01A-11D-A14G-09", "TCGA-E2-A1BC-01A-11D-A14G-09", "TCGA-E9-A1NH-01A-11D-A14G-09", "TCGA-E9-A22B-01A-11D-A159-09")), .Names = "Sample_Barcode", class = "data.frame", row.names = c(NA, -14L)) foo <- as.data.frame(table(dat2)) merge(dat, foo, by.x='barcode.Int1', by.y='dat2', all.x=TRUE) # barcode.Int1 HUGO.Int1 HUGO.Int2 Freq # 1 TCGA-A8-A08S-01A-11W-A050-09 AAMP TH1L NA # 2 TCGA-AO-A0JC-01A-11W-A071-09 AARS EEF1B2 NA # 3 TCGA-B6-A0RS-01A-11D-A099-09 A1CF TNPO2 1 # 4 TCGA-B6-A0RS-01A-11D-A099-09 A1CF SYNCRIP 1 # 5 TCGA-B6-A0RS-01A-11D-A099-09 A1CF APOBEC1 1 # 6 TCGA-B6-A0RS-01A-11D-A099-09 A1CF KHSRP 1 # 7 TCGA-D8-A1JK-01A-11D-A13L-09 A2M C11orf58 1 # 8 TCGA-D8-A1JK-01A-11D-A13L-09 A2M ATF7IP 1 # 9 TCGA-D8-A1JK-01A-11D-A13L-09 A2M SHBG 1

data.table版本:

library(data.table) foo <- data.table(as.data.frame(table(dat2))) setnames(foo, c('barcode.Int1', 'Freq')) setkey(foo, barcode.Int1) dat <- data.table(dat, key='barcode.Int1') foo[dat] # barcode.Int1 Freq HUGO.Int1 HUGO.Int2 # 1: NA NA AAMP TH1L # 2: NA NA AARS EEF1B2 # 3: TCGA-B6-A0RS-01A-11D-A099-09 1 A1CF APOBEC1 # 4: TCGA-B6-A0RS-01A-11D-A099-09 1 A1CF TNPO2 # 5: TCGA-B6-A0RS-01A-11D-A099-09 1 A1CF SYNCRIP # 6: TCGA-B6-A0RS-01A-11D-A099-09 1 A1CF KHSRP # 7: TCGA-D8-A1JK-01A-11D-A13L-09 1 A2M SHBG # 8: TCGA-D8-A1JK-01A-11D-A13L-09 1 A2M C11orf58 # 9: TCGA-D8-A1JK-01A-11D-A13L-09 1 A2M ATF7IP

在“纯”data.table:

dat <- data.table(dat, key='barcode.Int1') dat2 <- data.table(dat2) setnames(dat2, 'barcode.Int1') setkey(dat2, barcode.Int1) counts <- dat2[, list(count= .N), by=barcode.Int1] counts[dat]

Paul's comment is very appropriate, it will speed up the merge step significantly. I would use table to get the counts of the unique barcodes in your second data.frame and merge it onto your first, see below:

dat <- structure(list(HUGO.Int1 = c("A1CF", "A1CF", "A1CF", "A1CF", "A2M", "A2M", "A2M", "AAMP", "AARS"), HUGO.Int2 = c("APOBEC1", "TNPO2", "SYNCRIP", "KHSRP", "SHBG", "C11orf58", "ATF7IP", "TH1L", "EEF1B2"), barcode.Int1 = c("TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-D8-A1JK-01A-11D-A13L-09", "TCGA-D8-A1JK-01A-11D-A13L-09", "TCGA-D8-A1JK-01A-11D-A13L-09", "TCGA-A8-A08S-01A-11W-A050-09", "TCGA-AO-A0JC-01A-11W-A071-09")), .Names = c("HUGO.Int1", "HUGO.Int2", "barcode.Int1"), class = "data.frame", row.names = c(NA, -9L)) dat2 <- structure(list(Sample_Barcode = c("TCGA-A8-A08G-01A-11W-A019-09", "TCGA-AO-A03O-01A-11W-A019-09", "TCGA-AO-A03O-01A-11W-A019-09", "TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-BH-A0HP-01A-12D-A099-09", "TCGA-BH-A0HP-01A-12D-A099-09", "TCGA-BH-A18H-01A-11D-A12B-09", "TCGA-BH-A18H-01A-11D-A12B-09", "TCGA-BH-A18J-01A-11D-A12B-09", "TCGA-D8-A1JK-01A-11D-A13L-09", "TCGA-E2-A1BC-01A-11D-A14G-09", "TCGA-E2-A1BC-01A-11D-A14G-09", "TCGA-E9-A1NH-01A-11D-A14G-09", "TCGA-E9-A22B-01A-11D-A159-09")), .Names = "Sample_Barcode", class = "data.frame", row.names = c(NA, -14L)) foo <- as.data.frame(table(dat2)) merge(dat, foo, by.x='barcode.Int1', by.y='dat2', all.x=TRUE) # barcode.Int1 HUGO.Int1 HUGO.Int2 Freq # 1 TCGA-A8-A08S-01A-11W-A050-09 AAMP TH1L NA # 2 TCGA-AO-A0JC-01A-11W-A071-09 AARS EEF1B2 NA # 3 TCGA-B6-A0RS-01A-11D-A099-09 A1CF TNPO2 1 # 4 TCGA-B6-A0RS-01A-11D-A099-09 A1CF SYNCRIP 1 # 5 TCGA-B6-A0RS-01A-11D-A099-09 A1CF APOBEC1 1 # 6 TCGA-B6-A0RS-01A-11D-A099-09 A1CF KHSRP 1 # 7 TCGA-D8-A1JK-01A-11D-A13L-09 A2M C11orf58 1 # 8 TCGA-D8-A1JK-01A-11D-A13L-09 A2M ATF7IP 1 # 9 TCGA-D8-A1JK-01A-11D-A13L-09 A2M SHBG 1

The data.table version:

library(data.table) foo <- data.table(as.data.frame(table(dat2))) setnames(foo, c('barcode.Int1', 'Freq')) setkey(foo, barcode.Int1) dat <- data.table(dat, key='barcode.Int1') foo[dat] # barcode.Int1 Freq HUGO.Int1 HUGO.Int2 # 1: NA NA AAMP TH1L # 2: NA NA AARS EEF1B2 # 3: TCGA-B6-A0RS-01A-11D-A099-09 1 A1CF APOBEC1 # 4: TCGA-B6-A0RS-01A-11D-A099-09 1 A1CF TNPO2 # 5: TCGA-B6-A0RS-01A-11D-A099-09 1 A1CF SYNCRIP # 6: TCGA-B6-A0RS-01A-11D-A099-09 1 A1CF KHSRP # 7: TCGA-D8-A1JK-01A-11D-A13L-09 1 A2M SHBG # 8: TCGA-D8-A1JK-01A-11D-A13L-09 1 A2M C11orf58 # 9: TCGA-D8-A1JK-01A-11D-A13L-09 1 A2M ATF7IP

in "Pure" data.table:

dat <- data.table(dat, key='barcode.Int1') dat2 <- data.table(dat2) setnames(dat2, 'barcode.Int1') setkey(dat2, barcode.Int1) counts <- dat2[, list(count= .N), by=barcode.Int1] counts[dat]在数据框中查找条形码并计算发生这种情况的次数,循环非常慢?(look for a barcode in a dataframe and count the amount of times this happens, Very slow for loop?)

我有一个非常慢的for循环并且无法正常工作,它在1个data.frame中查找条形码,而不是在另一个data.frame中搜索该条形码。 第二个data.frame的bar_code可以多次出现。 每次找到条形码时,计数器都应计算条形码所在的次数,并将条形码数写入第1个数据帧。

我的尝试:

for(i in 1:length(tcgadataUniek$Tumor_Sample_Barcode)){ for(j in 1:length(hprdDataSorted$Samples.Int1)){ count<-0 if(i==j){ count<-count+1 } else { count<-count+0 } hprdDataSorted$Samples.Int2<-count[j] } }

1st Data.Frame看起来如下(csv):

HUGO.Int1,HUGO.Int2,barcode.Int1 A1CF,APOBEC1,TCGA-B6-A0RS-01A-11D-A099-09 A1CF,TNPO2,TCGA-B6-A0RS-01A-11D-A099-09 A1CF,SYNCRIP,TCGA-B6-A0RS-01A-11D-A099-09 A1CF,KHSRP,TCGA-B6-A0RS-01A-11D-A099-09 A2M,SHBG,TCGA-D8-A1JK-01A-11D-A13L-09 A2M,C11orf58,TCGA-D8-A1JK-01A-11D-A13L-09 A2M,ATF7IP,TCGA-D8-A1JK-01A-11D-A13L-09 AAMP,TH1L,TCGA-A8-A08S-01A-11W-A050-09 AARS,EEF1B2,TCGA-AO-A0JC-01A-11W-A071-09

包含重复条形码的第二个Data.frame(csv)

Sample_Barcode TCGA-A8-A08G-01A-11W-A019-09 TCGA-AO-A03O-01A-11W-A019-09 TCGA-AO-A03O-01A-11W-A019-09 TCGA-B6-A0RS-01A-11D-A099-09 TCGA-BH-A0HP-01A-12D-A099-09 TCGA-BH-A0HP-01A-12D-A099-09 TCGA-BH-A18H-01A-11D-A12B-09 TCGA-BH-A18H-01A-11D-A12B-09 TCGA-BH-A18J-01A-11D-A12B-09 TCGA-D8-A1JK-01A-11D-A13L-09 TCGA-E2-A1BC-01A-11D-A14G-09 TCGA-E2-A1BC-01A-11D-A14G-09 TCGA-E9-A1NH-01A-11D-A14G-09 TCGA-E9-A22B-01A-11D-A159-09

如果条形码.Int1(数据帧1)中的条形码在Sample_barcode中是3次,则脚本应在条形码旁边添加3,脚本正在寻找.Int1。 例如:

HUGO.Int1,HUGO.Int2,barcode.Int1, number_of_times A1CF,APOBEC1,TCGA-B6-A0RS-01A-11D-A099-09,5

I have a for loop that is awfully slow and doesnt work proper, it looks in 1 data.frame for a barcode and than searches for that barcode in another data.frame. The bar_code of the 2nd data.frame can be there multiple times. Every time it finds a barcode a counter should count the amount of times the barcode is there and write the number of barcodes to the 1st data frame.

My try:

for(i in 1:length(tcgadataUniek$Tumor_Sample_Barcode)){ for(j in 1:length(hprdDataSorted$Samples.Int1)){ count<-0 if(i==j){ count<-count+1 } else { count<-count+0 } hprdDataSorted$Samples.Int2<-count[j] } }

1st Data.Frame looks as follows (csv):

HUGO.Int1,HUGO.Int2,barcode.Int1 A1CF,APOBEC1,TCGA-B6-A0RS-01A-11D-A099-09 A1CF,TNPO2,TCGA-B6-A0RS-01A-11D-A099-09 A1CF,SYNCRIP,TCGA-B6-A0RS-01A-11D-A099-09 A1CF,KHSRP,TCGA-B6-A0RS-01A-11D-A099-09 A2M,SHBG,TCGA-D8-A1JK-01A-11D-A13L-09 A2M,C11orf58,TCGA-D8-A1JK-01A-11D-A13L-09 A2M,ATF7IP,TCGA-D8-A1JK-01A-11D-A13L-09 AAMP,TH1L,TCGA-A8-A08S-01A-11W-A050-09 AARS,EEF1B2,TCGA-AO-A0JC-01A-11W-A071-09

2nd Data.frame which holds the duplicated barcodes (csv)

Sample_Barcode TCGA-A8-A08G-01A-11W-A019-09 TCGA-AO-A03O-01A-11W-A019-09 TCGA-AO-A03O-01A-11W-A019-09 TCGA-B6-A0RS-01A-11D-A099-09 TCGA-BH-A0HP-01A-12D-A099-09 TCGA-BH-A0HP-01A-12D-A099-09 TCGA-BH-A18H-01A-11D-A12B-09 TCGA-BH-A18H-01A-11D-A12B-09 TCGA-BH-A18J-01A-11D-A12B-09 TCGA-D8-A1JK-01A-11D-A13L-09 TCGA-E2-A1BC-01A-11D-A14G-09 TCGA-E2-A1BC-01A-11D-A14G-09 TCGA-E9-A1NH-01A-11D-A14G-09 TCGA-E9-A22B-01A-11D-A159-09

If the barcode from barcode.Int1 (dataframe 1) is 3 times in Sample_barcode the script should add a 3 next to the barcode.Int1 the script is looking for. for example:

HUGO.Int1,HUGO.Int2,barcode.Int1, number_of_times A1CF,APOBEC1,TCGA-B6-A0RS-01A-11D-A099-09,5

最满意答案

保罗的评论非常恰当,它将显着加快合并步骤。 我会使用table来获取第二个data.frame中唯一条形码的计数merge其merge到第一个数据框中,如下所示:

dat <- structure(list(HUGO.Int1 = c("A1CF", "A1CF", "A1CF", "A1CF", "A2M", "A2M", "A2M", "AAMP", "AARS"), HUGO.Int2 = c("APOBEC1", "TNPO2", "SYNCRIP", "KHSRP", "SHBG", "C11orf58", "ATF7IP", "TH1L", "EEF1B2"), barcode.Int1 = c("TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-D8-A1JK-01A-11D-A13L-09", "TCGA-D8-A1JK-01A-11D-A13L-09", "TCGA-D8-A1JK-01A-11D-A13L-09", "TCGA-A8-A08S-01A-11W-A050-09", "TCGA-AO-A0JC-01A-11W-A071-09")), .Names = c("HUGO.Int1", "HUGO.Int2", "barcode.Int1"), class = "data.frame", row.names = c(NA, -9L)) dat2 <- structure(list(Sample_Barcode = c("TCGA-A8-A08G-01A-11W-A019-09", "TCGA-AO-A03O-01A-11W-A019-09", "TCGA-AO-A03O-01A-11W-A019-09", "TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-BH-A0HP-01A-12D-A099-09", "TCGA-BH-A0HP-01A-12D-A099-09", "TCGA-BH-A18H-01A-11D-A12B-09", "TCGA-BH-A18H-01A-11D-A12B-09", "TCGA-BH-A18J-01A-11D-A12B-09", "TCGA-D8-A1JK-01A-11D-A13L-09", "TCGA-E2-A1BC-01A-11D-A14G-09", "TCGA-E2-A1BC-01A-11D-A14G-09", "TCGA-E9-A1NH-01A-11D-A14G-09", "TCGA-E9-A22B-01A-11D-A159-09")), .Names = "Sample_Barcode", class = "data.frame", row.names = c(NA, -14L)) foo <- as.data.frame(table(dat2)) merge(dat, foo, by.x='barcode.Int1', by.y='dat2', all.x=TRUE) # barcode.Int1 HUGO.Int1 HUGO.Int2 Freq # 1 TCGA-A8-A08S-01A-11W-A050-09 AAMP TH1L NA # 2 TCGA-AO-A0JC-01A-11W-A071-09 AARS EEF1B2 NA # 3 TCGA-B6-A0RS-01A-11D-A099-09 A1CF TNPO2 1 # 4 TCGA-B6-A0RS-01A-11D-A099-09 A1CF SYNCRIP 1 # 5 TCGA-B6-A0RS-01A-11D-A099-09 A1CF APOBEC1 1 # 6 TCGA-B6-A0RS-01A-11D-A099-09 A1CF KHSRP 1 # 7 TCGA-D8-A1JK-01A-11D-A13L-09 A2M C11orf58 1 # 8 TCGA-D8-A1JK-01A-11D-A13L-09 A2M ATF7IP 1 # 9 TCGA-D8-A1JK-01A-11D-A13L-09 A2M SHBG 1

data.table版本:

library(data.table) foo <- data.table(as.data.frame(table(dat2))) setnames(foo, c('barcode.Int1', 'Freq')) setkey(foo, barcode.Int1) dat <- data.table(dat, key='barcode.Int1') foo[dat] # barcode.Int1 Freq HUGO.Int1 HUGO.Int2 # 1: NA NA AAMP TH1L # 2: NA NA AARS EEF1B2 # 3: TCGA-B6-A0RS-01A-11D-A099-09 1 A1CF APOBEC1 # 4: TCGA-B6-A0RS-01A-11D-A099-09 1 A1CF TNPO2 # 5: TCGA-B6-A0RS-01A-11D-A099-09 1 A1CF SYNCRIP # 6: TCGA-B6-A0RS-01A-11D-A099-09 1 A1CF KHSRP # 7: TCGA-D8-A1JK-01A-11D-A13L-09 1 A2M SHBG # 8: TCGA-D8-A1JK-01A-11D-A13L-09 1 A2M C11orf58 # 9: TCGA-D8-A1JK-01A-11D-A13L-09 1 A2M ATF7IP

在“纯”data.table:

dat <- data.table(dat, key='barcode.Int1') dat2 <- data.table(dat2) setnames(dat2, 'barcode.Int1') setkey(dat2, barcode.Int1) counts <- dat2[, list(count= .N), by=barcode.Int1] counts[dat]

Paul's comment is very appropriate, it will speed up the merge step significantly. I would use table to get the counts of the unique barcodes in your second data.frame and merge it onto your first, see below:

dat <- structure(list(HUGO.Int1 = c("A1CF", "A1CF", "A1CF", "A1CF", "A2M", "A2M", "A2M", "AAMP", "AARS"), HUGO.Int2 = c("APOBEC1", "TNPO2", "SYNCRIP", "KHSRP", "SHBG", "C11orf58", "ATF7IP", "TH1L", "EEF1B2"), barcode.Int1 = c("TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-D8-A1JK-01A-11D-A13L-09", "TCGA-D8-A1JK-01A-11D-A13L-09", "TCGA-D8-A1JK-01A-11D-A13L-09", "TCGA-A8-A08S-01A-11W-A050-09", "TCGA-AO-A0JC-01A-11W-A071-09")), .Names = c("HUGO.Int1", "HUGO.Int2", "barcode.Int1"), class = "data.frame", row.names = c(NA, -9L)) dat2 <- structure(list(Sample_Barcode = c("TCGA-A8-A08G-01A-11W-A019-09", "TCGA-AO-A03O-01A-11W-A019-09", "TCGA-AO-A03O-01A-11W-A019-09", "TCGA-B6-A0RS-01A-11D-A099-09", "TCGA-BH-A0HP-01A-12D-A099-09", "TCGA-BH-A0HP-01A-12D-A099-09", "TCGA-BH-A18H-01A-11D-A12B-09", "TCGA-BH-A18H-01A-11D-A12B-09", "TCGA-BH-A18J-01A-11D-A12B-09", "TCGA-D8-A1JK-01A-11D-A13L-09", "TCGA-E2-A1BC-01A-11D-A14G-09", "TCGA-E2-A1BC-01A-11D-A14G-09", "TCGA-E9-A1NH-01A-11D-A14G-09", "TCGA-E9-A22B-01A-11D-A159-09")), .Names = "Sample_Barcode", class = "data.frame", row.names = c(NA, -14L)) foo <- as.data.frame(table(dat2)) merge(dat, foo, by.x='barcode.Int1', by.y='dat2', all.x=TRUE) # barcode.Int1 HUGO.Int1 HUGO.Int2 Freq # 1 TCGA-A8-A08S-01A-11W-A050-09 AAMP TH1L NA # 2 TCGA-AO-A0JC-01A-11W-A071-09 AARS EEF1B2 NA # 3 TCGA-B6-A0RS-01A-11D-A099-09 A1CF TNPO2 1 # 4 TCGA-B6-A0RS-01A-11D-A099-09 A1CF SYNCRIP 1 # 5 TCGA-B6-A0RS-01A-11D-A099-09 A1CF APOBEC1 1 # 6 TCGA-B6-A0RS-01A-11D-A099-09 A1CF KHSRP 1 # 7 TCGA-D8-A1JK-01A-11D-A13L-09 A2M C11orf58 1 # 8 TCGA-D8-A1JK-01A-11D-A13L-09 A2M ATF7IP 1 # 9 TCGA-D8-A1JK-01A-11D-A13L-09 A2M SHBG 1

The data.table version:

library(data.table) foo <- data.table(as.data.frame(table(dat2))) setnames(foo, c('barcode.Int1', 'Freq')) setkey(foo, barcode.Int1) dat <- data.table(dat, key='barcode.Int1') foo[dat] # barcode.Int1 Freq HUGO.Int1 HUGO.Int2 # 1: NA NA AAMP TH1L # 2: NA NA AARS EEF1B2 # 3: TCGA-B6-A0RS-01A-11D-A099-09 1 A1CF APOBEC1 # 4: TCGA-B6-A0RS-01A-11D-A099-09 1 A1CF TNPO2 # 5: TCGA-B6-A0RS-01A-11D-A099-09 1 A1CF SYNCRIP # 6: TCGA-B6-A0RS-01A-11D-A099-09 1 A1CF KHSRP # 7: TCGA-D8-A1JK-01A-11D-A13L-09 1 A2M SHBG # 8: TCGA-D8-A1JK-01A-11D-A13L-09 1 A2M C11orf58 # 9: TCGA-D8-A1JK-01A-11D-A13L-09 1 A2M ATF7IP

in "Pure" data.table:

dat <- data.table(dat, key='barcode.Int1') dat2 <- data.table(dat2) setnames(dat2, 'barcode.Int1') setkey(dat2, barcode.Int1) counts <- dat2[, list(count= .N), by=barcode.Int1] counts[dat]