交换ID和Python性能(Swapping IDs & Python Performance)

网站建设800 更新时间：2025-06-17 14:13:14

我希望我能得到帮助，使我的代码更高效地运行。我的代码的目的是取出第一个ID（RUID），并根据ID的密钥文件将其替换为去识别ID（RESPID）。输入数据文件是一个大的制表符分隔文本文件，大约2.5GB。数据非常广泛，每行有数千列。我有一个有效的功能，但在实际数据上它非常慢。我的第一个文件已运行4天，仅为1.4GB。我不知道我的代码的哪一部分是最有问题的，但我怀疑它是我在一起构建行并分别编写每一行的地方。任何关于如何改进这一点的建议都将非常感激，4天的处理时间太长了！谢谢！

def swap(): #input files infile1 = open(r"Z:\ped_test.txt", 'rb') keyfile = open(r"Z:\ruid_respid_test.txt", 'rb') #output file outfile=open(r"Z:\ped_testRESPID.txt", 'wb') # create dictionary of RUID-RESPID COLUMN = 1 #Column containing RUID RESPID={} for k in keyfile: kList = k.rstrip('\r\n').split('\t') if kList[0] not in RESPID and kList[0] != "": RESPID[kList[0]]=kList[1] #print RESPID print "creating RESPID-RUID xwalk dictionary is done" print "Start creating new file" print str(datetime.datetime.now()) count=0 for line in infile1: #if not re.match('#', line): #if there is a header sline = line.split() #slen = len(sline) RUID = sline[COLUMN] #print RUID C0 = sline[0] #print C0 DAT=sline[2:] for key in RESPID: if key==RUID: NewID=RESPID[key] row=str(C0+'\t'+NewID) for a in DAT: row=row+'\t'+a #print row outfile.write(row) outfile.write('\n') infile1.close() keyfile.close() outfile.close() print "All Done: RESPID replacement is complete" print str(datetime.datetime.now())

I'm hoping I can get help making my code run more efficiently. The purpose of my code is to take out the first ID (RUID) and replace it with a de-identified ID (RESPID) based on a key file of ids. The input data file is a large tab-delimited text file at about 2.5GB. The data is very wide, each row has thousands of columns. I have a function that works, but on the actual data it is incredibly slow. My first file has been running for 4 days and is only at 1.4GB. I don't know which part of my code is the most problematic, but I suspect it is where I build the row back together and write each row individually. Any advice on how to improve this would be greatly appreciated, 4 days is way too long for processing! Thank you!

def swap(): #input files infile1 = open(r"Z:\ped_test.txt", 'rb') keyfile = open(r"Z:\ruid_respid_test.txt", 'rb') #output file outfile=open(r"Z:\ped_testRESPID.txt", 'wb') # create dictionary of RUID-RESPID COLUMN = 1 #Column containing RUID RESPID={} for k in keyfile: kList = k.rstrip('\r\n').split('\t') if kList[0] not in RESPID and kList[0] != "": RESPID[kList[0]]=kList[1] #print RESPID print "creating RESPID-RUID xwalk dictionary is done" print "Start creating new file" print str(datetime.datetime.now()) count=0 for line in infile1: #if not re.match('#', line): #if there is a header sline = line.split() #slen = len(sline) RUID = sline[COLUMN] #print RUID C0 = sline[0] #print C0 DAT=sline[2:] for key in RESPID: if key==RUID: NewID=RESPID[key] row=str(C0+'\t'+NewID) for a in DAT: row=row+'\t'+a #print row outfile.write(row) outfile.write('\n') infile1.close() keyfile.close() outfile.close() print "All Done: RESPID replacement is complete" print str(datetime.datetime.now())

最满意答案

你有几个地方可以加快速度。主要是，当你可以使用'get'函数来读取值时，枚举RESPID中的所有键是一个问题。但是由于你的线条很宽，所以还有其他一些有用的东西会有所不同。

def swap(): #input files infile1 = open(r"Z:\ped_test.txt", 'rb') keyfile = open(r"Z:\ruid_respid_test.txt", 'rb') #output file outfile=open(r"Z:\ped_testRESPID.txt", 'wb') # create dictionary of RUID-RESPID COLUMN = 1 #Column containing RUID RESPID={} for k in keyfile: kList = k.split('\t', 2) # minor: jut grab what you need if kList[0] and kList[0] not in RESPID: # minor: do the cheap test first RESPID[kList[0]]=kList[1] #print RESPID print "creating RESPID-RUID xwalk dictionary is done" print "Start creating new file" print str(datetime.datetime.now()) count=0 for line in infile1: #if not re.match('#', line): #if there is a header sline = line.split('\t', 2) # minor: just grab what you need #slen = len(sline) RUID = sline[COLUMN] #print RUID C0 = sline[0] #print C0 DAT=sline[2:] # the biggie, just use a lookup #for key in RESPID: # if key==RUID: # NewID=RESPID[key] rows = '\t'.join([sline[0], RESPID.get(RUID, sline[1]), sline[2]]) #row=str(C0+'\t'+NewID) #for a in DAT: # row=row+'\t'+a #print row outfile.write(row) outfile.write('\n') infile1.close() keyfile.close() outfile.close() print "All Done: RESPID replacement is complete" print str(datetime.datetime.now())

You have several places you can speed things up. Primarily, its a problem with enumerating all of the keys in RESPID when you can just use the 'get' function to read the value. But since you have very wide lines, there are a couple of other tweeks that will make a difference.

def swap(): #input files infile1 = open(r"Z:\ped_test.txt", 'rb') keyfile = open(r"Z:\ruid_respid_test.txt", 'rb') #output file outfile=open(r"Z:\ped_testRESPID.txt", 'wb') # create dictionary of RUID-RESPID COLUMN = 1 #Column containing RUID RESPID={} for k in keyfile: kList = k.split('\t', 2) # minor: jut grab what you need if kList[0] and kList[0] not in RESPID: # minor: do the cheap test first RESPID[kList[0]]=kList[1] #print RESPID print "creating RESPID-RUID xwalk dictionary is done" print "Start creating new file" print str(datetime.datetime.now()) count=0 for line in infile1: #if not re.match('#', line): #if there is a header sline = line.split('\t', 2) # minor: just grab what you need #slen = len(sline) RUID = sline[COLUMN] #print RUID C0 = sline[0] #print C0 DAT=sline[2:] # the biggie, just use a lookup #for key in RESPID: # if key==RUID: # NewID=RESPID[key] rows = '\t'.join([sline[0], RESPID.get(RUID, sline[1]), sline[2]]) #row=str(C0+'\t'+NewID) #for a in DAT: # row=row+'\t'+a #print row outfile.write(row) outfile.write('\n') infile1.close() keyfile.close() outfile.close() print "All Done: RESPID replacement is complete" print str(datetime.datetime.now())交换ID和Python性能(Swapping IDs & Python Performance)

我希望我能得到帮助，使我的代码更高效地运行。我的代码的目的是取出第一个ID（RUID），并根据ID的密钥文件将其替换为去识别ID（RESPID）。输入数据文件是一个大的制表符分隔文本文件，大约2.5GB。数据非常广泛，每行有数千列。我有一个有效的功能，但在实际数据上它非常慢。我的第一个文件已运行4天，仅为1.4GB。我不知道我的代码的哪一部分是最有问题的，但我怀疑它是我在一起构建行并分别编写每一行的地方。任何关于如何改进这一点的建议都将非常感激，4天的处理时间太长了！谢谢！

def swap(): #input files infile1 = open(r"Z:\ped_test.txt", 'rb') keyfile = open(r"Z:\ruid_respid_test.txt", 'rb') #output file outfile=open(r"Z:\ped_testRESPID.txt", 'wb') # create dictionary of RUID-RESPID COLUMN = 1 #Column containing RUID RESPID={} for k in keyfile: kList = k.rstrip('\r\n').split('\t') if kList[0] not in RESPID and kList[0] != "": RESPID[kList[0]]=kList[1] #print RESPID print "creating RESPID-RUID xwalk dictionary is done" print "Start creating new file" print str(datetime.datetime.now()) count=0 for line in infile1: #if not re.match('#', line): #if there is a header sline = line.split() #slen = len(sline) RUID = sline[COLUMN] #print RUID C0 = sline[0] #print C0 DAT=sline[2:] for key in RESPID: if key==RUID: NewID=RESPID[key] row=str(C0+'\t'+NewID) for a in DAT: row=row+'\t'+a #print row outfile.write(row) outfile.write('\n') infile1.close() keyfile.close() outfile.close() print "All Done: RESPID replacement is complete" print str(datetime.datetime.now())

I'm hoping I can get help making my code run more efficiently. The purpose of my code is to take out the first ID (RUID) and replace it with a de-identified ID (RESPID) based on a key file of ids. The input data file is a large tab-delimited text file at about 2.5GB. The data is very wide, each row has thousands of columns. I have a function that works, but on the actual data it is incredibly slow. My first file has been running for 4 days and is only at 1.4GB. I don't know which part of my code is the most problematic, but I suspect it is where I build the row back together and write each row individually. Any advice on how to improve this would be greatly appreciated, 4 days is way too long for processing! Thank you!

def swap(): #input files infile1 = open(r"Z:\ped_test.txt", 'rb') keyfile = open(r"Z:\ruid_respid_test.txt", 'rb') #output file outfile=open(r"Z:\ped_testRESPID.txt", 'wb') # create dictionary of RUID-RESPID COLUMN = 1 #Column containing RUID RESPID={} for k in keyfile: kList = k.rstrip('\r\n').split('\t') if kList[0] not in RESPID and kList[0] != "": RESPID[kList[0]]=kList[1] #print RESPID print "creating RESPID-RUID xwalk dictionary is done" print "Start creating new file" print str(datetime.datetime.now()) count=0 for line in infile1: #if not re.match('#', line): #if there is a header sline = line.split() #slen = len(sline) RUID = sline[COLUMN] #print RUID C0 = sline[0] #print C0 DAT=sline[2:] for key in RESPID: if key==RUID: NewID=RESPID[key] row=str(C0+'\t'+NewID) for a in DAT: row=row+'\t'+a #print row outfile.write(row) outfile.write('\n') infile1.close() keyfile.close() outfile.close() print "All Done: RESPID replacement is complete" print str(datetime.datetime.now())

最满意答案

你有几个地方可以加快速度。主要是，当你可以使用'get'函数来读取值时，枚举RESPID中的所有键是一个问题。但是由于你的线条很宽，所以还有其他一些有用的东西会有所不同。

def swap(): #input files infile1 = open(r"Z:\ped_test.txt", 'rb') keyfile = open(r"Z:\ruid_respid_test.txt", 'rb') #output file outfile=open(r"Z:\ped_testRESPID.txt", 'wb') # create dictionary of RUID-RESPID COLUMN = 1 #Column containing RUID RESPID={} for k in keyfile: kList = k.split('\t', 2) # minor: jut grab what you need if kList[0] and kList[0] not in RESPID: # minor: do the cheap test first RESPID[kList[0]]=kList[1] #print RESPID print "creating RESPID-RUID xwalk dictionary is done" print "Start creating new file" print str(datetime.datetime.now()) count=0 for line in infile1: #if not re.match('#', line): #if there is a header sline = line.split('\t', 2) # minor: just grab what you need #slen = len(sline) RUID = sline[COLUMN] #print RUID C0 = sline[0] #print C0 DAT=sline[2:] # the biggie, just use a lookup #for key in RESPID: # if key==RUID: # NewID=RESPID[key] rows = '\t'.join([sline[0], RESPID.get(RUID, sline[1]), sline[2]]) #row=str(C0+'\t'+NewID) #for a in DAT: # row=row+'\t'+a #print row outfile.write(row) outfile.write('\n') infile1.close() keyfile.close() outfile.close() print "All Done: RESPID replacement is complete" print str(datetime.datetime.now())

You have several places you can speed things up. Primarily, its a problem with enumerating all of the keys in RESPID when you can just use the 'get' function to read the value. But since you have very wide lines, there are a couple of other tweeks that will make a difference.

def swap(): #input files infile1 = open(r"Z:\ped_test.txt", 'rb') keyfile = open(r"Z:\ruid_respid_test.txt", 'rb') #output file outfile=open(r"Z:\ped_testRESPID.txt", 'wb') # create dictionary of RUID-RESPID COLUMN = 1 #Column containing RUID RESPID={} for k in keyfile: kList = k.split('\t', 2) # minor: jut grab what you need if kList[0] and kList[0] not in RESPID: # minor: do the cheap test first RESPID[kList[0]]=kList[1] #print RESPID print "creating RESPID-RUID xwalk dictionary is done" print "Start creating new file" print str(datetime.datetime.now()) count=0 for line in infile1: #if not re.match('#', line): #if there is a header sline = line.split('\t', 2) # minor: just grab what you need #slen = len(sline) RUID = sline[COLUMN] #print RUID C0 = sline[0] #print C0 DAT=sline[2:] # the biggie, just use a lookup #for key in RESPID: # if key==RUID: # NewID=RESPID[key] rows = '\t'.join([sline[0], RESPID.get(RUID, sline[1]), sline[2]]) #row=str(C0+'\t'+NewID) #for a in DAT: # row=row+'\t'+a #print row outfile.write(row) outfile.write('\n') infile1.close() keyfile.close() outfile.close() print "All Done: RESPID replacement is complete" print str(datetime.datetime.now())

本文发布于:2023-08-28，感谢您对本站的认可！

本文链接:http://www.torson.com.cn/wangzhan/1693238331a701934.html

交换ID和Python性能(Swapping IDs & Python Performance)

最满意答案

最满意答案

发布评论取消回复

最近发表

相关推荐

标签列表

交换ID和Python性能(Swapping IDs & Python Performance)

最满意答案

最满意答案

发布评论 取消回复

最近发表

相关推荐

标签列表

发布评论取消回复