我希望我能得到帮助,使我的代码更高效地运行。 我的代码的目的是取出第一个ID(RUID),并根据ID的密钥文件将其替换为去识别ID(RESPID)。 输入数据文件是一个大的制表符分隔文本文件,大约2.5GB。 数据非常广泛,每行有数千列。 我有一个有效的功能,但在实际数据上它非常慢。 我的第一个文件已运行4天,仅为1.4GB。 我不知道我的代码的哪一部分是最有问题的,但我怀疑它是我在一起构建行并分别编写每一行的地方。 任何关于如何改进这一点的建议都将非常感激,4天的处理时间太长了! 谢谢!
def swap(): #input files infile1 = open(r"Z:\ped_test.txt", 'rb') keyfile = open(r"Z:\ruid_respid_test.txt", 'rb') #output file outfile=open(r"Z:\ped_testRESPID.txt", 'wb') # create dictionary of RUID-RESPID COLUMN = 1 #Column containing RUID RESPID={} for k in keyfile: kList = k.rstrip('\r\n').split('\t') if kList[0] not in RESPID and kList[0] != "": RESPID[kList[0]]=kList[1] #print RESPID print "creating RESPID-RUID xwalk dictionary is done" print "Start creating new file" print str(datetime.datetime.now()) count=0 for line in infile1: #if not re.match('#', line): #if there is a header sline = line.split() #slen = len(sline) RUID = sline[COLUMN] #print RUID C0 = sline[0] #print C0 DAT=sline[2:] for key in RESPID: if key==RUID: NewID=RESPID[key] row=str(C0+'\t'+NewID) for a in DAT: row=row+'\t'+a #print row outfile.write(row) outfile.write('\n') infile1.close() keyfile.close() outfile.close() print "All Done: RESPID replacement is complete" print str(datetime.datetime.now())I'm hoping I can get help making my code run more efficiently. The purpose of my code is to take out the first ID (RUID) and replace it with a de-identified ID (RESPID) based on a key file of ids. The input data file is a large tab-delimited text file at about 2.5GB. The data is very wide, each row has thousands of columns. I have a function that works, but on the actual data it is incredibly slow. My first file has been running for 4 days and is only at 1.4GB. I don't know which part of my code is the most problematic, but I suspect it is where I build the row back together and write each row individually. Any advice on how to improve this would be greatly appreciated, 4 days is way too long for processing! Thank you!
def swap(): #input files infile1 = open(r"Z:\ped_test.txt", 'rb') keyfile = open(r"Z:\ruid_respid_test.txt", 'rb') #output file outfile=open(r"Z:\ped_testRESPID.txt", 'wb') # create dictionary of RUID-RESPID COLUMN = 1 #Column containing RUID RESPID={} for k in keyfile: kList = k.rstrip('\r\n').split('\t') if kList[0] not in RESPID and kList[0] != "": RESPID[kList[0]]=kList[1] #print RESPID print "creating RESPID-RUID xwalk dictionary is done" print "Start creating new file" print str(datetime.datetime.now()) count=0 for line in infile1: #if not re.match('#', line): #if there is a header sline = line.split() #slen = len(sline) RUID = sline[COLUMN] #print RUID C0 = sline[0] #print C0 DAT=sline[2:] for key in RESPID: if key==RUID: NewID=RESPID[key] row=str(C0+'\t'+NewID) for a in DAT: row=row+'\t'+a #print row outfile.write(row) outfile.write('\n') infile1.close() keyfile.close() outfile.close() print "All Done: RESPID replacement is complete" print str(datetime.datetime.now())最满意答案
你有几个地方可以加快速度。 主要是,当你可以使用'get'函数来读取值时,枚举RESPID中的所有键是一个问题。 但是由于你的线条很宽,所以还有其他一些有用的东西会有所不同。
def swap(): #input files infile1 = open(r"Z:\ped_test.txt", 'rb') keyfile = open(r"Z:\ruid_respid_test.txt", 'rb') #output file outfile=open(r"Z:\ped_testRESPID.txt", 'wb') # create dictionary of RUID-RESPID COLUMN = 1 #Column containing RUID RESPID={} for k in keyfile: kList = k.split('\t', 2) # minor: jut grab what you need if kList[0] and kList[0] not in RESPID: # minor: do the cheap test first RESPID[kList[0]]=kList[1] #print RESPID print "creating RESPID-RUID xwalk dictionary is done" print "Start creating new file" print str(datetime.datetime.now()) count=0 for line in infile1: #if not re.match('#', line): #if there is a header sline = line.split('\t', 2) # minor: just grab what you need #slen = len(sline) RUID = sline[COLUMN] #print RUID C0 = sline[0] #print C0 DAT=sline[2:] # the biggie, just use a lookup #for key in RESPID: # if key==RUID: # NewID=RESPID[key] rows = '\t'.join([sline[0], RESPID.get(RUID, sline[1]), sline[2]]) #row=str(C0+'\t'+NewID) #for a in DAT: # row=row+'\t'+a #print row outfile.write(row) outfile.write('\n') infile1.close() keyfile.close() outfile.close() print "All Done: RESPID replacement is complete" print str(datetime.datetime.now())You have several places you can speed things up. Primarily, its a problem with enumerating all of the keys in RESPID when you can just use the 'get' function to read the value. But since you have very wide lines, there are a couple of other tweeks that will make a difference.
def swap(): #input files infile1 = open(r"Z:\ped_test.txt", 'rb') keyfile = open(r"Z:\ruid_respid_test.txt", 'rb') #output file outfile=open(r"Z:\ped_testRESPID.txt", 'wb') # create dictionary of RUID-RESPID COLUMN = 1 #Column containing RUID RESPID={} for k in keyfile: kList = k.split('\t', 2) # minor: jut grab what you need if kList[0] and kList[0] not in RESPID: # minor: do the cheap test first RESPID[kList[0]]=kList[1] #print RESPID print "creating RESPID-RUID xwalk dictionary is done" print "Start creating new file" print str(datetime.datetime.now()) count=0 for line in infile1: #if not re.match('#', line): #if there is a header sline = line.split('\t', 2) # minor: just grab what you need #slen = len(sline) RUID = sline[COLUMN] #print RUID C0 = sline[0] #print C0 DAT=sline[2:] # the biggie, just use a lookup #for key in RESPID: # if key==RUID: # NewID=RESPID[key] rows = '\t'.join([sline[0], RESPID.get(RUID, sline[1]), sline[2]]) #row=str(C0+'\t'+NewID) #for a in DAT: # row=row+'\t'+a #print row outfile.write(row) outfile.write('\n') infile1.close() keyfile.close() outfile.close() print "All Done: RESPID replacement is complete" print str(datetime.datetime.now())交换ID和Python性能(Swapping IDs & Python Performance)我希望我能得到帮助,使我的代码更高效地运行。 我的代码的目的是取出第一个ID(RUID),并根据ID的密钥文件将其替换为去识别ID(RESPID)。 输入数据文件是一个大的制表符分隔文本文件,大约2.5GB。 数据非常广泛,每行有数千列。 我有一个有效的功能,但在实际数据上它非常慢。 我的第一个文件已运行4天,仅为1.4GB。 我不知道我的代码的哪一部分是最有问题的,但我怀疑它是我在一起构建行并分别编写每一行的地方。 任何关于如何改进这一点的建议都将非常感激,4天的处理时间太长了! 谢谢!
def swap(): #input files infile1 = open(r"Z:\ped_test.txt", 'rb') keyfile = open(r"Z:\ruid_respid_test.txt", 'rb') #output file outfile=open(r"Z:\ped_testRESPID.txt", 'wb') # create dictionary of RUID-RESPID COLUMN = 1 #Column containing RUID RESPID={} for k in keyfile: kList = k.rstrip('\r\n').split('\t') if kList[0] not in RESPID and kList[0] != "": RESPID[kList[0]]=kList[1] #print RESPID print "creating RESPID-RUID xwalk dictionary is done" print "Start creating new file" print str(datetime.datetime.now()) count=0 for line in infile1: #if not re.match('#', line): #if there is a header sline = line.split() #slen = len(sline) RUID = sline[COLUMN] #print RUID C0 = sline[0] #print C0 DAT=sline[2:] for key in RESPID: if key==RUID: NewID=RESPID[key] row=str(C0+'\t'+NewID) for a in DAT: row=row+'\t'+a #print row outfile.write(row) outfile.write('\n') infile1.close() keyfile.close() outfile.close() print "All Done: RESPID replacement is complete" print str(datetime.datetime.now())I'm hoping I can get help making my code run more efficiently. The purpose of my code is to take out the first ID (RUID) and replace it with a de-identified ID (RESPID) based on a key file of ids. The input data file is a large tab-delimited text file at about 2.5GB. The data is very wide, each row has thousands of columns. I have a function that works, but on the actual data it is incredibly slow. My first file has been running for 4 days and is only at 1.4GB. I don't know which part of my code is the most problematic, but I suspect it is where I build the row back together and write each row individually. Any advice on how to improve this would be greatly appreciated, 4 days is way too long for processing! Thank you!
def swap(): #input files infile1 = open(r"Z:\ped_test.txt", 'rb') keyfile = open(r"Z:\ruid_respid_test.txt", 'rb') #output file outfile=open(r"Z:\ped_testRESPID.txt", 'wb') # create dictionary of RUID-RESPID COLUMN = 1 #Column containing RUID RESPID={} for k in keyfile: kList = k.rstrip('\r\n').split('\t') if kList[0] not in RESPID and kList[0] != "": RESPID[kList[0]]=kList[1] #print RESPID print "creating RESPID-RUID xwalk dictionary is done" print "Start creating new file" print str(datetime.datetime.now()) count=0 for line in infile1: #if not re.match('#', line): #if there is a header sline = line.split() #slen = len(sline) RUID = sline[COLUMN] #print RUID C0 = sline[0] #print C0 DAT=sline[2:] for key in RESPID: if key==RUID: NewID=RESPID[key] row=str(C0+'\t'+NewID) for a in DAT: row=row+'\t'+a #print row outfile.write(row) outfile.write('\n') infile1.close() keyfile.close() outfile.close() print "All Done: RESPID replacement is complete" print str(datetime.datetime.now())最满意答案
你有几个地方可以加快速度。 主要是,当你可以使用'get'函数来读取值时,枚举RESPID中的所有键是一个问题。 但是由于你的线条很宽,所以还有其他一些有用的东西会有所不同。
def swap(): #input files infile1 = open(r"Z:\ped_test.txt", 'rb') keyfile = open(r"Z:\ruid_respid_test.txt", 'rb') #output file outfile=open(r"Z:\ped_testRESPID.txt", 'wb') # create dictionary of RUID-RESPID COLUMN = 1 #Column containing RUID RESPID={} for k in keyfile: kList = k.split('\t', 2) # minor: jut grab what you need if kList[0] and kList[0] not in RESPID: # minor: do the cheap test first RESPID[kList[0]]=kList[1] #print RESPID print "creating RESPID-RUID xwalk dictionary is done" print "Start creating new file" print str(datetime.datetime.now()) count=0 for line in infile1: #if not re.match('#', line): #if there is a header sline = line.split('\t', 2) # minor: just grab what you need #slen = len(sline) RUID = sline[COLUMN] #print RUID C0 = sline[0] #print C0 DAT=sline[2:] # the biggie, just use a lookup #for key in RESPID: # if key==RUID: # NewID=RESPID[key] rows = '\t'.join([sline[0], RESPID.get(RUID, sline[1]), sline[2]]) #row=str(C0+'\t'+NewID) #for a in DAT: # row=row+'\t'+a #print row outfile.write(row) outfile.write('\n') infile1.close() keyfile.close() outfile.close() print "All Done: RESPID replacement is complete" print str(datetime.datetime.now())You have several places you can speed things up. Primarily, its a problem with enumerating all of the keys in RESPID when you can just use the 'get' function to read the value. But since you have very wide lines, there are a couple of other tweeks that will make a difference.
def swap(): #input files infile1 = open(r"Z:\ped_test.txt", 'rb') keyfile = open(r"Z:\ruid_respid_test.txt", 'rb') #output file outfile=open(r"Z:\ped_testRESPID.txt", 'wb') # create dictionary of RUID-RESPID COLUMN = 1 #Column containing RUID RESPID={} for k in keyfile: kList = k.split('\t', 2) # minor: jut grab what you need if kList[0] and kList[0] not in RESPID: # minor: do the cheap test first RESPID[kList[0]]=kList[1] #print RESPID print "creating RESPID-RUID xwalk dictionary is done" print "Start creating new file" print str(datetime.datetime.now()) count=0 for line in infile1: #if not re.match('#', line): #if there is a header sline = line.split('\t', 2) # minor: just grab what you need #slen = len(sline) RUID = sline[COLUMN] #print RUID C0 = sline[0] #print C0 DAT=sline[2:] # the biggie, just use a lookup #for key in RESPID: # if key==RUID: # NewID=RESPID[key] rows = '\t'.join([sline[0], RESPID.get(RUID, sline[1]), sline[2]]) #row=str(C0+'\t'+NewID) #for a in DAT: # row=row+'\t'+a #print row outfile.write(row) outfile.write('\n') infile1.close() keyfile.close() outfile.close() print "All Done: RESPID replacement is complete" print str(datetime.datetime.now())
发布评论