在两个字符串之间提取和处理信息,这些字符串在文件中重复多次(Extract and process information between two strings, being these strings repeated multiple times along the file)

我有一个这种结构的文件:

LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 122.771603 - DENSITY 2.704 g/cm^3 A B C ALPHA BETA GAMMA 6.32540491 6.32540491 6.32540491 46.774144 46.774144 46.774144 ******************************************************************************* ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10 ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01 3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01 4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01 5 T 8 O -4.912600492192E-01 -8.739950780750E-03 2.500000000000E-01 6 F 8 O 2.500000000000E-01 -4.912600492193E-01 -8.739950780750E-03 7 F 8 O -8.739950780750E-03 2.500000000000E-01 -4.912600492193E-01 8 F 8 O 4.912600492193E-01 8.739950780750E-03 -2.500000000000E-01 9 F 8 O -2.500000000000E-01 4.912600492193E-01 8.739950780750E-03 10 F 8 O 8.739950780750E-03 -2.500000000000E-01 4.912600492193E-01 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL 1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000 ******************************************************************************* CRYSTALLOGRAPHIC CELL (VOLUME= 368.31480902) A B C ALPHA BETA GAMMA 5.02162261 5.02162261 16.86554607 90.000000 90.000000 120.000000 COORDINATES IN THE CRYSTALLOGRAPHIC CELL ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA 0.000000000000E+00 0.000000000000E+00 -5.000000000000E-01 3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02 5 T 8 O -4.079267158859E-01 -3.333333333333E-01 -8.333333333333E-02 6 F 8 O 3.333333333333E-01 -7.459338255258E-02 -8.333333333333E-02 7 F 8 O 7.459338255258E-02 4.079267158859E-01 -8.333333333333E-02 8 F 8 O 4.079267158859E-01 3.333333333333E-01 8.333333333333E-02 9 F 8 O -3.333333333333E-01 7.459338255258E-02 8.333333333333E-02 10 F 8 O -7.459338255258E-02 -4.079267158859E-01 8.333333333333E-02 T = ATOM BELONGING TO THE ASYMMETRIC UNIT more lines more lines more lines FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500) ******************************************************************************* LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 119.823364 - DENSITY 2.770 g/cm^3 A B C ALPHA BETA GAMMA 6.28373604 6.28373604 6.28373604 46.646397 46.646397 46.646397 ******************************************************************************* ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10 ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01 3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01 4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01 5 T 8 O -4.924094276183E-01 -7.590572381674E-03 2.500000000000E-01 6 F 8 O 2.500000000000E-01 -4.924094276183E-01 -7.590572381674E-03 7 F 8 O -7.590572381674E-03 2.500000000000E-01 -4.924094276183E-01 8 F 8 O 4.924094276183E-01 7.590572381674E-03 -2.500000000000E-01 9 F 8 O -2.500000000000E-01 4.924094276183E-01 7.590572381674E-03 10 F 8 O 7.590572381674E-03 -2.500000000000E-01 4.924094276183E-01 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL 1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000 ******************************************************************************* CRYSTALLOGRAPHIC CELL (VOLUME= 359.47009054) A B C ALPHA BETA GAMMA 4.97568007 4.97568007 16.76591397 90.000000 90.000000 120.000000 COORDINATES IN THE CRYSTALLOGRAPHIC CELL ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.491739570355E-17 -2.745869785177E-17 -5.000000000000E-01 3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02 5 T 8 O -4.090760942850E-01 -3.333333333333E-01 -8.333333333333E-02 6 F 8 O 3.333333333333E-01 -7.574276095166E-02 -8.333333333333E-02 7 F 8 O 7.574276095166E-02 4.090760942850E-01 -8.333333333333E-02 8 F 8 O 4.090760942850E-01 3.333333333333E-01 8.333333333333E-02 9 F 8 O -3.333333333333E-01 7.574276095166E-02 8.333333333333E-02 10 F 8 O -7.574276095166E-02 -4.090760942850E-01 8.333333333333E-02 T = ATOM BELONGING TO THE ASYMMETRIC UNIT INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE more lines more lines more lines FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500) ******************************************************************************* LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 121.143469 - DENSITY 2.740 g/cm^3 A B C ALPHA BETA GAMMA 6.32229536 6.32229536 6.32229536 46.436583 46.436583 46.436583 ******************************************************************************* ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10 ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA 5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01 3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01 4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01 5 T 8 O -4.927088991116E-01 -7.291100888437E-03 2.500000000000E-01 6 F 8 O 2.500000000000E-01 -4.927088991116E-01 -7.291100888437E-03 7 F 8 O -7.291100888437E-03 2.500000000000E-01 -4.927088991116E-01 8 F 8 O 4.927088991116E-01 7.291100888437E-03 -2.500000000000E-01 9 F 8 O -2.500000000000E-01 4.927088991116E-01 7.291100888437E-03 10 F 8 O 7.291100888437E-03 -2.500000000000E-01 4.927088991116E-01 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL 1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000 ******************************************************************************* CRYSTALLOGRAPHIC CELL (VOLUME= 363.43040599) A B C ALPHA BETA GAMMA 4.98494429 4.98494429 16.88768068 90.000000 90.000000 120.000000 COORDINATES IN THE CRYSTALLOGRAPHIC CELL ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.471726358381E-17 -2.735863179191E-17 -5.000000000000E-01 3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02 5 T 8 O -4.093755657782E-01 -3.333333333333E-01 -8.333333333333E-02 6 F 8 O 3.333333333333E-01 -7.604223244490E-02 -8.333333333333E-02 7 F 8 O 7.604223244490E-02 4.093755657782E-01 -8.333333333333E-02 8 F 8 O 4.093755657782E-01 3.333333333333E-01 8.333333333333E-02 9 F 8 O -3.333333333333E-01 7.604223244490E-02 8.333333333333E-02 10 F 8 O -7.604223244490E-02 -4.093755657782E-01 8.333333333333E-02 T = ATOM BELONGING TO THE ASYMMETRIC UNIT INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE more lines more lines more lines

我想提取CRYSTALLOGRAPHIC CELL的信息; 但只有来自FINAL OPTIMIZED GEOMETRY那个。

以下3场比赛:

initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3$' middle_pattern = '^ CRYSTALLOGRAPHIC CELL ' end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$'

允许搜索信息。

首先,我定义了一个标志passed_mid_point = False ,

然后程序的以下部分提取FINAL OPTIMIZED GEOMETRY的CRYSTALLOGRAPHIC CELL的VOLUME :

VOLUMES = [] with open('g.out') as file: passed_mid_point = False for line in file: if re.match(initial_pattern, line): passed_mid_point = False print file.next() print file.next() print file.next() volume_line = file.next() print volume_line aux = volume_line.split() each_volume = aux[7] print each_volume VOLUMES.append(each_volume) print 'VOLUMES = ', VOLUMES

这是正确的,因为VOLUMES = ['119.823364', '121.143469'] 。 请注意,初始122.771603 (请参阅原始文件)未按预期收集在列表中。

当提取A和C (在我的程序中, P0和P1 )时, FINAL OPTIMIZED GEOMETRY的CRYSTALLOGRAPHIC CELL参数以及坐标:

if re.match(middle_pattern, line): passed_mid_point = True print line print file.next() parameters_line = file.next() aux = parameters_line.split() p0 = aux[0] p1 = aux[1] p2 = aux[2] p3 = aux[3] p4 = aux[4] p5 = aux[5] # print p0 print p2 P0.append(p0) P2.append(p2) print file.next() print file.next() print file.next() print file.next() if re.match(end_pattern, line): passed_mid_point = False elif passed_mid_point: # parse the coordinates print 'line2 =', line terms = line.split() print 'terms =', terms # print 'terms[1] =', terms[1] if terms and terms[1] == 'T': print terms[1] atomic_number = terms[2] print 'atomic_number = ', atomic_number ATOMIC_NUMBERS.append(atomic_number) x = terms[4] print 'x =', x Xs.append(x) y = terms[5] print 'y = ', y Ys.append(y) z = terms[6] print 'z = ', z Zs.append(z) print 'VOLUMES = ', VOLUMES print 'P0 = ', P0 print 'P2 = ', P2 print 'Xs = ', Xs print 'Ys = ', Ys print 'Zs = ', Zs print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS

结果如下:

P0 = ['5.02162261', '4.97568007', '4.98494429']

这是错误的,因为5.02162261不是来自FINAL OPTIMIZED GEOMETRY (参见文件)。

坐标也是错误的:

Xs = ['0.000000000000E+00', '3.333333333333E-01', '-4.079267158859E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01'] Ys = ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01'] Zs = ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02'] ATOMIC_NUMBERS = ['20', '6', '8', '20', '6', '8', '20', '6', '8']

这将是理想的结果:

VOLUMES = ['119.823364', '121.143469'] P0 = ['4.97568007', '4.98494429'] P1 = [16.76591397, '16.88768068'] Xs = ['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01'] Ys = ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01'] Zs = ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02'] ATOMIC_NUMBERS = ['20', '6', '8', '20', '6', '8']

如果你能帮助我,我将不胜感激

整个代码:

import sys import re import os initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3$' middle_pattern = '^ CRYSTALLOGRAPHIC CELL ' end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$' global N_atom_irreducible_unit N_atom_irreducible_unit = 3 VOLUMES = [] P0 = [] P2 = [] ATOMIC_NUMBERS = [] Xs = [] Ys = [] Zs = [] with open('g.out') as file: passed_mid_point = False for line in file: if re.match(initial_pattern, line): passed_mid_point = False print file.next() print file.next() print file.next() volume_line = file.next() print volume_line aux = volume_line.split() each_volume = aux[7] print each_volume VOLUMES.append(each_volume) if re.match(middle_pattern, line): passed_mid_point = True print line print file.next() parameters_line = file.next() aux = parameters_line.split() p0 = aux[0] p1 = aux[1] p2 = aux[2] p3 = aux[3] p4 = aux[4] p5 = aux[5] # print p0 print p2 P0.append(p0) P2.append(p2) print file.next() print file.next() print file.next() print file.next() if re.match(end_pattern, line): passed_mid_point = False elif passed_mid_point: # parse the coordinates print 'line2 =', line terms = line.split() print 'terms =', terms # print 'terms[1] =', terms[1] if terms and terms[1] == 'T': print terms[1] atomic_number = terms[2] print 'atomic_number = ', atomic_number ATOMIC_NUMBERS.append(atomic_number) x = terms[4] print 'x =', x Xs.append(x) y = terms[5] print 'y = ', y Ys.append(y) z = terms[6] print 'z = ', z Zs.append(z) print 'VOLUMES = ', VOLUMES print 'P0 = ', P0 print 'P2 = ', P2 print 'Xs = ', Xs print 'Ys = ', Ys print 'Zs = ', Zs print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS

I have a file with this structure:

LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 122.771603 - DENSITY 2.704 g/cm^3 A B C ALPHA BETA GAMMA 6.32540491 6.32540491 6.32540491 46.774144 46.774144 46.774144 ******************************************************************************* ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10 ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01 3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01 4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01 5 T 8 O -4.912600492192E-01 -8.739950780750E-03 2.500000000000E-01 6 F 8 O 2.500000000000E-01 -4.912600492193E-01 -8.739950780750E-03 7 F 8 O -8.739950780750E-03 2.500000000000E-01 -4.912600492193E-01 8 F 8 O 4.912600492193E-01 8.739950780750E-03 -2.500000000000E-01 9 F 8 O -2.500000000000E-01 4.912600492193E-01 8.739950780750E-03 10 F 8 O 8.739950780750E-03 -2.500000000000E-01 4.912600492193E-01 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL 1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000 ******************************************************************************* CRYSTALLOGRAPHIC CELL (VOLUME= 368.31480902) A B C ALPHA BETA GAMMA 5.02162261 5.02162261 16.86554607 90.000000 90.000000 120.000000 COORDINATES IN THE CRYSTALLOGRAPHIC CELL ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA 0.000000000000E+00 0.000000000000E+00 -5.000000000000E-01 3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02 5 T 8 O -4.079267158859E-01 -3.333333333333E-01 -8.333333333333E-02 6 F 8 O 3.333333333333E-01 -7.459338255258E-02 -8.333333333333E-02 7 F 8 O 7.459338255258E-02 4.079267158859E-01 -8.333333333333E-02 8 F 8 O 4.079267158859E-01 3.333333333333E-01 8.333333333333E-02 9 F 8 O -3.333333333333E-01 7.459338255258E-02 8.333333333333E-02 10 F 8 O -7.459338255258E-02 -4.079267158859E-01 8.333333333333E-02 T = ATOM BELONGING TO THE ASYMMETRIC UNIT more lines more lines more lines FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500) ******************************************************************************* LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 119.823364 - DENSITY 2.770 g/cm^3 A B C ALPHA BETA GAMMA 6.28373604 6.28373604 6.28373604 46.646397 46.646397 46.646397 ******************************************************************************* ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10 ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01 3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01 4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01 5 T 8 O -4.924094276183E-01 -7.590572381674E-03 2.500000000000E-01 6 F 8 O 2.500000000000E-01 -4.924094276183E-01 -7.590572381674E-03 7 F 8 O -7.590572381674E-03 2.500000000000E-01 -4.924094276183E-01 8 F 8 O 4.924094276183E-01 7.590572381674E-03 -2.500000000000E-01 9 F 8 O -2.500000000000E-01 4.924094276183E-01 7.590572381674E-03 10 F 8 O 7.590572381674E-03 -2.500000000000E-01 4.924094276183E-01 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL 1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000 ******************************************************************************* CRYSTALLOGRAPHIC CELL (VOLUME= 359.47009054) A B C ALPHA BETA GAMMA 4.97568007 4.97568007 16.76591397 90.000000 90.000000 120.000000 COORDINATES IN THE CRYSTALLOGRAPHIC CELL ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.491739570355E-17 -2.745869785177E-17 -5.000000000000E-01 3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02 5 T 8 O -4.090760942850E-01 -3.333333333333E-01 -8.333333333333E-02 6 F 8 O 3.333333333333E-01 -7.574276095166E-02 -8.333333333333E-02 7 F 8 O 7.574276095166E-02 4.090760942850E-01 -8.333333333333E-02 8 F 8 O 4.090760942850E-01 3.333333333333E-01 8.333333333333E-02 9 F 8 O -3.333333333333E-01 7.574276095166E-02 8.333333333333E-02 10 F 8 O -7.574276095166E-02 -4.090760942850E-01 8.333333333333E-02 T = ATOM BELONGING TO THE ASYMMETRIC UNIT INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE more lines more lines more lines FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500) ******************************************************************************* LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 121.143469 - DENSITY 2.740 g/cm^3 A B C ALPHA BETA GAMMA 6.32229536 6.32229536 6.32229536 46.436583 46.436583 46.436583 ******************************************************************************* ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10 ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA 5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01 3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01 4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01 5 T 8 O -4.927088991116E-01 -7.291100888437E-03 2.500000000000E-01 6 F 8 O 2.500000000000E-01 -4.927088991116E-01 -7.291100888437E-03 7 F 8 O -7.291100888437E-03 2.500000000000E-01 -4.927088991116E-01 8 F 8 O 4.927088991116E-01 7.291100888437E-03 -2.500000000000E-01 9 F 8 O -2.500000000000E-01 4.927088991116E-01 7.291100888437E-03 10 F 8 O 7.291100888437E-03 -2.500000000000E-01 4.927088991116E-01 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL 1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000 ******************************************************************************* CRYSTALLOGRAPHIC CELL (VOLUME= 363.43040599) A B C ALPHA BETA GAMMA 4.98494429 4.98494429 16.88768068 90.000000 90.000000 120.000000 COORDINATES IN THE CRYSTALLOGRAPHIC CELL ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.471726358381E-17 -2.735863179191E-17 -5.000000000000E-01 3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02 5 T 8 O -4.093755657782E-01 -3.333333333333E-01 -8.333333333333E-02 6 F 8 O 3.333333333333E-01 -7.604223244490E-02 -8.333333333333E-02 7 F 8 O 7.604223244490E-02 4.093755657782E-01 -8.333333333333E-02 8 F 8 O 4.093755657782E-01 3.333333333333E-01 8.333333333333E-02 9 F 8 O -3.333333333333E-01 7.604223244490E-02 8.333333333333E-02 10 F 8 O -7.604223244490E-02 -4.093755657782E-01 8.333333333333E-02 T = ATOM BELONGING TO THE ASYMMETRIC UNIT INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE more lines more lines more lines

I would like to extract CRYSTALLOGRAPHIC CELL's information; but only the one that comes from a FINAL OPTIMIZED GEOMETRY.

The following 3 matches:

initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3$' middle_pattern = '^ CRYSTALLOGRAPHIC CELL ' end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$'

allow to search for the information.

Firstly, I define a flag passed_mid_point = False,

and then the following part of the program extracts the VOLUME of the FINAL OPTIMIZED GEOMETRY's CRYSTALLOGRAPHIC CELL:

VOLUMES = [] with open('g.out') as file: passed_mid_point = False for line in file: if re.match(initial_pattern, line): passed_mid_point = False print file.next() print file.next() print file.next() volume_line = file.next() print volume_line aux = volume_line.split() each_volume = aux[7] print each_volume VOLUMES.append(each_volume) print 'VOLUMES = ', VOLUMES

which is correct, because VOLUMES = ['119.823364', '121.143469']. Note that the initial 122.771603 (see original file) is not gathered in the list, as expected.

When extracting the A and C (in my program, P0 and P1), parameters of the FINAL OPTIMIZED GEOMETRY's CRYSTALLOGRAPHIC CELL, together with the coordinates:

if re.match(middle_pattern, line): passed_mid_point = True print line print file.next() parameters_line = file.next() aux = parameters_line.split() p0 = aux[0] p1 = aux[1] p2 = aux[2] p3 = aux[3] p4 = aux[4] p5 = aux[5] # print p0 print p2 P0.append(p0) P2.append(p2) print file.next() print file.next() print file.next() print file.next() if re.match(end_pattern, line): passed_mid_point = False elif passed_mid_point: # parse the coordinates print 'line2 =', line terms = line.split() print 'terms =', terms # print 'terms[1] =', terms[1] if terms and terms[1] == 'T': print terms[1] atomic_number = terms[2] print 'atomic_number = ', atomic_number ATOMIC_NUMBERS.append(atomic_number) x = terms[4] print 'x =', x Xs.append(x) y = terms[5] print 'y = ', y Ys.append(y) z = terms[6] print 'z = ', z Zs.append(z) print 'VOLUMES = ', VOLUMES print 'P0 = ', P0 print 'P2 = ', P2 print 'Xs = ', Xs print 'Ys = ', Ys print 'Zs = ', Zs print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS

The result is the following:

P0 = ['5.02162261', '4.97568007', '4.98494429']

which is wrong, because 5.02162261 does not come from a FINAL OPTIMIZED GEOMETRY (see file).

Also the coordinates are wrong:

Xs = ['0.000000000000E+00', '3.333333333333E-01', '-4.079267158859E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01'] Ys = ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01'] Zs = ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02'] ATOMIC_NUMBERS = ['20', '6', '8', '20', '6', '8', '20', '6', '8']

This would be the desired result:

VOLUMES = ['119.823364', '121.143469'] P0 = ['4.97568007', '4.98494429'] P1 = [16.76591397, '16.88768068'] Xs = ['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01'] Ys = ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01'] Zs = ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02'] ATOMIC_NUMBERS = ['20', '6', '8', '20', '6', '8']

I would appreciate if you could help me

Entire code:

import sys import re import os initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3$' middle_pattern = '^ CRYSTALLOGRAPHIC CELL ' end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$' global N_atom_irreducible_unit N_atom_irreducible_unit = 3 VOLUMES = [] P0 = [] P2 = [] ATOMIC_NUMBERS = [] Xs = [] Ys = [] Zs = [] with open('g.out') as file: passed_mid_point = False for line in file: if re.match(initial_pattern, line): passed_mid_point = False print file.next() print file.next() print file.next() volume_line = file.next() print volume_line aux = volume_line.split() each_volume = aux[7] print each_volume VOLUMES.append(each_volume) if re.match(middle_pattern, line): passed_mid_point = True print line print file.next() parameters_line = file.next() aux = parameters_line.split() p0 = aux[0] p1 = aux[1] p2 = aux[2] p3 = aux[3] p4 = aux[4] p5 = aux[5] # print p0 print p2 P0.append(p0) P2.append(p2) print file.next() print file.next() print file.next() print file.next() if re.match(end_pattern, line): passed_mid_point = False elif passed_mid_point: # parse the coordinates print 'line2 =', line terms = line.split() print 'terms =', terms # print 'terms[1] =', terms[1] if terms and terms[1] == 'T': print terms[1] atomic_number = terms[2] print 'atomic_number = ', atomic_number ATOMIC_NUMBERS.append(atomic_number) x = terms[4] print 'x =', x Xs.append(x) y = terms[5] print 'y = ', y Ys.append(y) z = terms[6] print 'z = ', z Zs.append(z) print 'VOLUMES = ', VOLUMES print 'P0 = ', P0 print 'P2 = ', P2 print 'Xs = ', Xs print 'Ys = ', Ys print 'Zs = ', Zs print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS

最满意答案

我写了一个简化版的脚本,看起来很有用。 我希望这可以作为你最终剧本的起点:

#!/usr/bin/env python # -*- coding: utf-8 -*- VOLUMES = [] P0 = [] P2 = [] ATOMIC_NUMBERS = [] Xs = [] Ys = [] Zs = [] with open('g.out') as gout: final_optimized_geometry = False for line in gout: if 'FINAL OPTIMIZED GEOMETRY' in line: final_optimized_geometry = True elif 'PRIMITIVE CELL' in line: if not final_optimized_geometry: continue volume = line.split()[7] VOLUMES.append(volume) elif 'CRYSTALLOGRAPHIC CELL (VOLUME=' in line: if not final_optimized_geometry: continue gout.readline() line = gout.readline() p0, p2 = line.split()[0:3:2] P0.append(p0) P2.append(p2) elif 'COORDINATES IN THE CRYSTALLOGRAPHIC CELL' in line: if not final_optimized_geometry: continue gout.readline() gout.readline() while 'T = ATOM BELONGING TO THE ASYMMETRIC UNIT' not in line: line = gout.readline() atomdata = line.split() if not atomdata or atomdata[1] != 'T': continue atomicnumber = atomdata[2] x, y, z = atomdata[4:7] ATOMIC_NUMBERS.append(atomicnumber) Xs.append(x) Ys.append(y) Zs.append(z) final_optimized_geometry = False print(VOLUMES) print(P0) print(P2) print(ATOMIC_NUMBERS) print(Xs) print(Ys) print(Zs)

这会生成以下输出:

['119.823364', '121.143469'] ['4.97568007', '4.98494429'] ['16.76591397', '16.88768068'] ['20', '6', '8', '20', '6', '8'] ['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01'] ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01'] ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']

事实上,它是一个非常简单的有限状态机,只有两个状态。 警告:如果在一个最终优化的几何体中存在多个晶体细胞,则它将不起作用。 在这种情况下,它只会捕获第一个单元格的信息。

代码还对文件做出了其他假设,当然也许需要进行验证。

我避免使用正则表达式。

此代码仅在Python 3中运行(针对Python 3.6.2进行测试)。 Python 2.7将阻止在文件迭代块中使用readline() (哪种有意义,但很高兴看到Python 3可以使用它)。 我们使用readline()作为一个小黑客从我们知道的输入文件中跳过行必须被跳过,而不再遍历整个循环(这将需要更多的标志变量)。

顺便说一句,如果您的唯一任务是解析文本文件,那么查看专用语言(例如Lex)可能会很有趣。 此外,Perl是为了做这样的事情而设计的,比Python更多。

希望这可以帮助!

I wrote a simplified version of your script, which seems to work. I hope this can count as a starting point for your final script:

#!/usr/bin/env python # -*- coding: utf-8 -*- VOLUMES = [] P0 = [] P2 = [] ATOMIC_NUMBERS = [] Xs = [] Ys = [] Zs = [] with open('g.out') as gout: final_optimized_geometry = False for line in gout: if 'FINAL OPTIMIZED GEOMETRY' in line: final_optimized_geometry = True elif 'PRIMITIVE CELL' in line: if not final_optimized_geometry: continue volume = line.split()[7] VOLUMES.append(volume) elif 'CRYSTALLOGRAPHIC CELL (VOLUME=' in line: if not final_optimized_geometry: continue gout.readline() line = gout.readline() p0, p2 = line.split()[0:3:2] P0.append(p0) P2.append(p2) elif 'COORDINATES IN THE CRYSTALLOGRAPHIC CELL' in line: if not final_optimized_geometry: continue gout.readline() gout.readline() while 'T = ATOM BELONGING TO THE ASYMMETRIC UNIT' not in line: line = gout.readline() atomdata = line.split() if not atomdata or atomdata[1] != 'T': continue atomicnumber = atomdata[2] x, y, z = atomdata[4:7] ATOMIC_NUMBERS.append(atomicnumber) Xs.append(x) Ys.append(y) Zs.append(z) final_optimized_geometry = False print(VOLUMES) print(P0) print(P2) print(ATOMIC_NUMBERS) print(Xs) print(Ys) print(Zs)

This generates the following output:

['119.823364', '121.143469'] ['4.97568007', '4.98494429'] ['16.76591397', '16.88768068'] ['20', '6', '8', '20', '6', '8'] ['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01'] ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01'] ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']

In fact it's a very simple finite state machine with only two states. Warning: it will not work if there are multiple crystallographic cells in one final optimized geometry. In that case it will only capture the first cell's information.

The code also makes other assumptions about the file, which perhaps need to be verified of course.

I avoided the use of regular expressions.

This code will only run in Python 3 (tested against Python 3.6.2). Python 2.7 will choke on using readline() inside the file iteration block (which kind of makes sense, but it's great to see Python 3 is okay with it). We are using readline() as a little hack to skip lines from the input file we know for certain have to be skipped, without going through the whole loop again (which would require more flag variables).

By the way, if your sole task is to parse text files, it might be interesting to check out dedicated languages, such as Lex for example. Also, Perl was designed for doing things like this, more than Python was.

Hope this helps!

在两个字符串之间提取和处理信息,这些字符串在文件中重复多次(Extract and process information between two strings, being these strings repeated multiple times along the file)

我有一个这种结构的文件:

LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 122.771603 - DENSITY 2.704 g/cm^3 A B C ALPHA BETA GAMMA 6.32540491 6.32540491 6.32540491 46.774144 46.774144 46.774144 ******************************************************************************* ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10 ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01 3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01 4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01 5 T 8 O -4.912600492192E-01 -8.739950780750E-03 2.500000000000E-01 6 F 8 O 2.500000000000E-01 -4.912600492193E-01 -8.739950780750E-03 7 F 8 O -8.739950780750E-03 2.500000000000E-01 -4.912600492193E-01 8 F 8 O 4.912600492193E-01 8.739950780750E-03 -2.500000000000E-01 9 F 8 O -2.500000000000E-01 4.912600492193E-01 8.739950780750E-03 10 F 8 O 8.739950780750E-03 -2.500000000000E-01 4.912600492193E-01 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL 1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000 ******************************************************************************* CRYSTALLOGRAPHIC CELL (VOLUME= 368.31480902) A B C ALPHA BETA GAMMA 5.02162261 5.02162261 16.86554607 90.000000 90.000000 120.000000 COORDINATES IN THE CRYSTALLOGRAPHIC CELL ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA 0.000000000000E+00 0.000000000000E+00 -5.000000000000E-01 3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02 5 T 8 O -4.079267158859E-01 -3.333333333333E-01 -8.333333333333E-02 6 F 8 O 3.333333333333E-01 -7.459338255258E-02 -8.333333333333E-02 7 F 8 O 7.459338255258E-02 4.079267158859E-01 -8.333333333333E-02 8 F 8 O 4.079267158859E-01 3.333333333333E-01 8.333333333333E-02 9 F 8 O -3.333333333333E-01 7.459338255258E-02 8.333333333333E-02 10 F 8 O -7.459338255258E-02 -4.079267158859E-01 8.333333333333E-02 T = ATOM BELONGING TO THE ASYMMETRIC UNIT more lines more lines more lines FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500) ******************************************************************************* LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 119.823364 - DENSITY 2.770 g/cm^3 A B C ALPHA BETA GAMMA 6.28373604 6.28373604 6.28373604 46.646397 46.646397 46.646397 ******************************************************************************* ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10 ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01 3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01 4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01 5 T 8 O -4.924094276183E-01 -7.590572381674E-03 2.500000000000E-01 6 F 8 O 2.500000000000E-01 -4.924094276183E-01 -7.590572381674E-03 7 F 8 O -7.590572381674E-03 2.500000000000E-01 -4.924094276183E-01 8 F 8 O 4.924094276183E-01 7.590572381674E-03 -2.500000000000E-01 9 F 8 O -2.500000000000E-01 4.924094276183E-01 7.590572381674E-03 10 F 8 O 7.590572381674E-03 -2.500000000000E-01 4.924094276183E-01 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL 1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000 ******************************************************************************* CRYSTALLOGRAPHIC CELL (VOLUME= 359.47009054) A B C ALPHA BETA GAMMA 4.97568007 4.97568007 16.76591397 90.000000 90.000000 120.000000 COORDINATES IN THE CRYSTALLOGRAPHIC CELL ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.491739570355E-17 -2.745869785177E-17 -5.000000000000E-01 3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02 5 T 8 O -4.090760942850E-01 -3.333333333333E-01 -8.333333333333E-02 6 F 8 O 3.333333333333E-01 -7.574276095166E-02 -8.333333333333E-02 7 F 8 O 7.574276095166E-02 4.090760942850E-01 -8.333333333333E-02 8 F 8 O 4.090760942850E-01 3.333333333333E-01 8.333333333333E-02 9 F 8 O -3.333333333333E-01 7.574276095166E-02 8.333333333333E-02 10 F 8 O -7.574276095166E-02 -4.090760942850E-01 8.333333333333E-02 T = ATOM BELONGING TO THE ASYMMETRIC UNIT INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE more lines more lines more lines FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500) ******************************************************************************* LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 121.143469 - DENSITY 2.740 g/cm^3 A B C ALPHA BETA GAMMA 6.32229536 6.32229536 6.32229536 46.436583 46.436583 46.436583 ******************************************************************************* ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10 ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA 5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01 3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01 4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01 5 T 8 O -4.927088991116E-01 -7.291100888437E-03 2.500000000000E-01 6 F 8 O 2.500000000000E-01 -4.927088991116E-01 -7.291100888437E-03 7 F 8 O -7.291100888437E-03 2.500000000000E-01 -4.927088991116E-01 8 F 8 O 4.927088991116E-01 7.291100888437E-03 -2.500000000000E-01 9 F 8 O -2.500000000000E-01 4.927088991116E-01 7.291100888437E-03 10 F 8 O 7.291100888437E-03 -2.500000000000E-01 4.927088991116E-01 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL 1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000 ******************************************************************************* CRYSTALLOGRAPHIC CELL (VOLUME= 363.43040599) A B C ALPHA BETA GAMMA 4.98494429 4.98494429 16.88768068 90.000000 90.000000 120.000000 COORDINATES IN THE CRYSTALLOGRAPHIC CELL ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.471726358381E-17 -2.735863179191E-17 -5.000000000000E-01 3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02 5 T 8 O -4.093755657782E-01 -3.333333333333E-01 -8.333333333333E-02 6 F 8 O 3.333333333333E-01 -7.604223244490E-02 -8.333333333333E-02 7 F 8 O 7.604223244490E-02 4.093755657782E-01 -8.333333333333E-02 8 F 8 O 4.093755657782E-01 3.333333333333E-01 8.333333333333E-02 9 F 8 O -3.333333333333E-01 7.604223244490E-02 8.333333333333E-02 10 F 8 O -7.604223244490E-02 -4.093755657782E-01 8.333333333333E-02 T = ATOM BELONGING TO THE ASYMMETRIC UNIT INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE more lines more lines more lines

我想提取CRYSTALLOGRAPHIC CELL的信息; 但只有来自FINAL OPTIMIZED GEOMETRY那个。

以下3场比赛:

initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3$' middle_pattern = '^ CRYSTALLOGRAPHIC CELL ' end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$'

允许搜索信息。

首先,我定义了一个标志passed_mid_point = False ,

然后程序的以下部分提取FINAL OPTIMIZED GEOMETRY的CRYSTALLOGRAPHIC CELL的VOLUME :

VOLUMES = [] with open('g.out') as file: passed_mid_point = False for line in file: if re.match(initial_pattern, line): passed_mid_point = False print file.next() print file.next() print file.next() volume_line = file.next() print volume_line aux = volume_line.split() each_volume = aux[7] print each_volume VOLUMES.append(each_volume) print 'VOLUMES = ', VOLUMES

这是正确的,因为VOLUMES = ['119.823364', '121.143469'] 。 请注意,初始122.771603 (请参阅原始文件)未按预期收集在列表中。

当提取A和C (在我的程序中, P0和P1 )时, FINAL OPTIMIZED GEOMETRY的CRYSTALLOGRAPHIC CELL参数以及坐标:

if re.match(middle_pattern, line): passed_mid_point = True print line print file.next() parameters_line = file.next() aux = parameters_line.split() p0 = aux[0] p1 = aux[1] p2 = aux[2] p3 = aux[3] p4 = aux[4] p5 = aux[5] # print p0 print p2 P0.append(p0) P2.append(p2) print file.next() print file.next() print file.next() print file.next() if re.match(end_pattern, line): passed_mid_point = False elif passed_mid_point: # parse the coordinates print 'line2 =', line terms = line.split() print 'terms =', terms # print 'terms[1] =', terms[1] if terms and terms[1] == 'T': print terms[1] atomic_number = terms[2] print 'atomic_number = ', atomic_number ATOMIC_NUMBERS.append(atomic_number) x = terms[4] print 'x =', x Xs.append(x) y = terms[5] print 'y = ', y Ys.append(y) z = terms[6] print 'z = ', z Zs.append(z) print 'VOLUMES = ', VOLUMES print 'P0 = ', P0 print 'P2 = ', P2 print 'Xs = ', Xs print 'Ys = ', Ys print 'Zs = ', Zs print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS

结果如下:

P0 = ['5.02162261', '4.97568007', '4.98494429']

这是错误的,因为5.02162261不是来自FINAL OPTIMIZED GEOMETRY (参见文件)。

坐标也是错误的:

Xs = ['0.000000000000E+00', '3.333333333333E-01', '-4.079267158859E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01'] Ys = ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01'] Zs = ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02'] ATOMIC_NUMBERS = ['20', '6', '8', '20', '6', '8', '20', '6', '8']

这将是理想的结果:

VOLUMES = ['119.823364', '121.143469'] P0 = ['4.97568007', '4.98494429'] P1 = [16.76591397, '16.88768068'] Xs = ['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01'] Ys = ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01'] Zs = ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02'] ATOMIC_NUMBERS = ['20', '6', '8', '20', '6', '8']

如果你能帮助我,我将不胜感激

整个代码:

import sys import re import os initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3$' middle_pattern = '^ CRYSTALLOGRAPHIC CELL ' end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$' global N_atom_irreducible_unit N_atom_irreducible_unit = 3 VOLUMES = [] P0 = [] P2 = [] ATOMIC_NUMBERS = [] Xs = [] Ys = [] Zs = [] with open('g.out') as file: passed_mid_point = False for line in file: if re.match(initial_pattern, line): passed_mid_point = False print file.next() print file.next() print file.next() volume_line = file.next() print volume_line aux = volume_line.split() each_volume = aux[7] print each_volume VOLUMES.append(each_volume) if re.match(middle_pattern, line): passed_mid_point = True print line print file.next() parameters_line = file.next() aux = parameters_line.split() p0 = aux[0] p1 = aux[1] p2 = aux[2] p3 = aux[3] p4 = aux[4] p5 = aux[5] # print p0 print p2 P0.append(p0) P2.append(p2) print file.next() print file.next() print file.next() print file.next() if re.match(end_pattern, line): passed_mid_point = False elif passed_mid_point: # parse the coordinates print 'line2 =', line terms = line.split() print 'terms =', terms # print 'terms[1] =', terms[1] if terms and terms[1] == 'T': print terms[1] atomic_number = terms[2] print 'atomic_number = ', atomic_number ATOMIC_NUMBERS.append(atomic_number) x = terms[4] print 'x =', x Xs.append(x) y = terms[5] print 'y = ', y Ys.append(y) z = terms[6] print 'z = ', z Zs.append(z) print 'VOLUMES = ', VOLUMES print 'P0 = ', P0 print 'P2 = ', P2 print 'Xs = ', Xs print 'Ys = ', Ys print 'Zs = ', Zs print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS

I have a file with this structure:

LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 122.771603 - DENSITY 2.704 g/cm^3 A B C ALPHA BETA GAMMA 6.32540491 6.32540491 6.32540491 46.774144 46.774144 46.774144 ******************************************************************************* ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10 ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01 3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01 4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01 5 T 8 O -4.912600492192E-01 -8.739950780750E-03 2.500000000000E-01 6 F 8 O 2.500000000000E-01 -4.912600492193E-01 -8.739950780750E-03 7 F 8 O -8.739950780750E-03 2.500000000000E-01 -4.912600492193E-01 8 F 8 O 4.912600492193E-01 8.739950780750E-03 -2.500000000000E-01 9 F 8 O -2.500000000000E-01 4.912600492193E-01 8.739950780750E-03 10 F 8 O 8.739950780750E-03 -2.500000000000E-01 4.912600492193E-01 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL 1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000 ******************************************************************************* CRYSTALLOGRAPHIC CELL (VOLUME= 368.31480902) A B C ALPHA BETA GAMMA 5.02162261 5.02162261 16.86554607 90.000000 90.000000 120.000000 COORDINATES IN THE CRYSTALLOGRAPHIC CELL ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA 0.000000000000E+00 0.000000000000E+00 -5.000000000000E-01 3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02 5 T 8 O -4.079267158859E-01 -3.333333333333E-01 -8.333333333333E-02 6 F 8 O 3.333333333333E-01 -7.459338255258E-02 -8.333333333333E-02 7 F 8 O 7.459338255258E-02 4.079267158859E-01 -8.333333333333E-02 8 F 8 O 4.079267158859E-01 3.333333333333E-01 8.333333333333E-02 9 F 8 O -3.333333333333E-01 7.459338255258E-02 8.333333333333E-02 10 F 8 O -7.459338255258E-02 -4.079267158859E-01 8.333333333333E-02 T = ATOM BELONGING TO THE ASYMMETRIC UNIT more lines more lines more lines FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500) ******************************************************************************* LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 119.823364 - DENSITY 2.770 g/cm^3 A B C ALPHA BETA GAMMA 6.28373604 6.28373604 6.28373604 46.646397 46.646397 46.646397 ******************************************************************************* ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10 ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01 3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01 4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01 5 T 8 O -4.924094276183E-01 -7.590572381674E-03 2.500000000000E-01 6 F 8 O 2.500000000000E-01 -4.924094276183E-01 -7.590572381674E-03 7 F 8 O -7.590572381674E-03 2.500000000000E-01 -4.924094276183E-01 8 F 8 O 4.924094276183E-01 7.590572381674E-03 -2.500000000000E-01 9 F 8 O -2.500000000000E-01 4.924094276183E-01 7.590572381674E-03 10 F 8 O 7.590572381674E-03 -2.500000000000E-01 4.924094276183E-01 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL 1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000 ******************************************************************************* CRYSTALLOGRAPHIC CELL (VOLUME= 359.47009054) A B C ALPHA BETA GAMMA 4.97568007 4.97568007 16.76591397 90.000000 90.000000 120.000000 COORDINATES IN THE CRYSTALLOGRAPHIC CELL ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.491739570355E-17 -2.745869785177E-17 -5.000000000000E-01 3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02 5 T 8 O -4.090760942850E-01 -3.333333333333E-01 -8.333333333333E-02 6 F 8 O 3.333333333333E-01 -7.574276095166E-02 -8.333333333333E-02 7 F 8 O 7.574276095166E-02 4.090760942850E-01 -8.333333333333E-02 8 F 8 O 4.090760942850E-01 3.333333333333E-01 8.333333333333E-02 9 F 8 O -3.333333333333E-01 7.574276095166E-02 8.333333333333E-02 10 F 8 O -7.574276095166E-02 -4.090760942850E-01 8.333333333333E-02 T = ATOM BELONGING TO THE ASYMMETRIC UNIT INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE more lines more lines more lines FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3 (NON PERIODIC DIRECTION: LATTICE PARAMETER FORMALLY SET TO 500) ******************************************************************************* LATTICE PARAMETERS (ANGSTROMS AND DEGREES) - BOHR = 0.5291772083 ANGSTROM PRIMITIVE CELL - CENTRING CODE 7/0 VOLUME= 121.143469 - DENSITY 2.740 g/cm^3 A B C ALPHA BETA GAMMA 6.32229536 6.32229536 6.32229536 46.436583 46.436583 46.436583 ******************************************************************************* ATOMS IN THE ASYMMETRIC UNIT 3 - ATOMS IN THE UNIT CELL: 10 ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA 5.000000000000E-01 -5.000000000000E-01 -5.000000000000E-01 3 T 6 C 2.500000000000E-01 2.500000000000E-01 2.500000000000E-01 4 F 6 C -2.500000000000E-01 -2.500000000000E-01 -2.500000000000E-01 5 T 8 O -4.927088991116E-01 -7.291100888437E-03 2.500000000000E-01 6 F 8 O 2.500000000000E-01 -4.927088991116E-01 -7.291100888437E-03 7 F 8 O -7.291100888437E-03 2.500000000000E-01 -4.927088991116E-01 8 F 8 O 4.927088991116E-01 7.291100888437E-03 -2.500000000000E-01 9 F 8 O -2.500000000000E-01 4.927088991116E-01 7.291100888437E-03 10 F 8 O 7.291100888437E-03 -2.500000000000E-01 4.927088991116E-01 TRANSFORMATION MATRIX PRIMITIVE-CRYSTALLOGRAPHIC CELL 1.0000 0.0000 1.0000 -1.0000 1.0000 1.0000 0.0000 -1.0000 1.0000 ******************************************************************************* CRYSTALLOGRAPHIC CELL (VOLUME= 363.43040599) A B C ALPHA BETA GAMMA 4.98494429 4.98494429 16.88768068 90.000000 90.000000 120.000000 COORDINATES IN THE CRYSTALLOGRAPHIC CELL ATOM X/A Y/B Z/C ******************************************************************************* 1 T 20 CA 0.000000000000E+00 0.000000000000E+00 0.000000000000E+00 2 F 20 CA -5.471726358381E-17 -2.735863179191E-17 -5.000000000000E-01 3 T 6 C 3.333333333333E-01 -3.333333333333E-01 -8.333333333333E-02 4 F 6 C -3.333333333333E-01 3.333333333333E-01 8.333333333333E-02 5 T 8 O -4.093755657782E-01 -3.333333333333E-01 -8.333333333333E-02 6 F 8 O 3.333333333333E-01 -7.604223244490E-02 -8.333333333333E-02 7 F 8 O 7.604223244490E-02 4.093755657782E-01 -8.333333333333E-02 8 F 8 O 4.093755657782E-01 3.333333333333E-01 8.333333333333E-02 9 F 8 O -3.333333333333E-01 7.604223244490E-02 8.333333333333E-02 10 F 8 O -7.604223244490E-02 -4.093755657782E-01 8.333333333333E-02 T = ATOM BELONGING TO THE ASYMMETRIC UNIT INFORMATION **** fort.34 **** GEOMETRY OUTPUT FILE more lines more lines more lines

I would like to extract CRYSTALLOGRAPHIC CELL's information; but only the one that comes from a FINAL OPTIMIZED GEOMETRY.

The following 3 matches:

initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3$' middle_pattern = '^ CRYSTALLOGRAPHIC CELL ' end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$'

allow to search for the information.

Firstly, I define a flag passed_mid_point = False,

and then the following part of the program extracts the VOLUME of the FINAL OPTIMIZED GEOMETRY's CRYSTALLOGRAPHIC CELL:

VOLUMES = [] with open('g.out') as file: passed_mid_point = False for line in file: if re.match(initial_pattern, line): passed_mid_point = False print file.next() print file.next() print file.next() volume_line = file.next() print volume_line aux = volume_line.split() each_volume = aux[7] print each_volume VOLUMES.append(each_volume) print 'VOLUMES = ', VOLUMES

which is correct, because VOLUMES = ['119.823364', '121.143469']. Note that the initial 122.771603 (see original file) is not gathered in the list, as expected.

When extracting the A and C (in my program, P0 and P1), parameters of the FINAL OPTIMIZED GEOMETRY's CRYSTALLOGRAPHIC CELL, together with the coordinates:

if re.match(middle_pattern, line): passed_mid_point = True print line print file.next() parameters_line = file.next() aux = parameters_line.split() p0 = aux[0] p1 = aux[1] p2 = aux[2] p3 = aux[3] p4 = aux[4] p5 = aux[5] # print p0 print p2 P0.append(p0) P2.append(p2) print file.next() print file.next() print file.next() print file.next() if re.match(end_pattern, line): passed_mid_point = False elif passed_mid_point: # parse the coordinates print 'line2 =', line terms = line.split() print 'terms =', terms # print 'terms[1] =', terms[1] if terms and terms[1] == 'T': print terms[1] atomic_number = terms[2] print 'atomic_number = ', atomic_number ATOMIC_NUMBERS.append(atomic_number) x = terms[4] print 'x =', x Xs.append(x) y = terms[5] print 'y = ', y Ys.append(y) z = terms[6] print 'z = ', z Zs.append(z) print 'VOLUMES = ', VOLUMES print 'P0 = ', P0 print 'P2 = ', P2 print 'Xs = ', Xs print 'Ys = ', Ys print 'Zs = ', Zs print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS

The result is the following:

P0 = ['5.02162261', '4.97568007', '4.98494429']

which is wrong, because 5.02162261 does not come from a FINAL OPTIMIZED GEOMETRY (see file).

Also the coordinates are wrong:

Xs = ['0.000000000000E+00', '3.333333333333E-01', '-4.079267158859E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01'] Ys = ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01'] Zs = ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02'] ATOMIC_NUMBERS = ['20', '6', '8', '20', '6', '8', '20', '6', '8']

This would be the desired result:

VOLUMES = ['119.823364', '121.143469'] P0 = ['4.97568007', '4.98494429'] P1 = [16.76591397, '16.88768068'] Xs = ['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01'] Ys = ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01'] Zs = ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02'] ATOMIC_NUMBERS = ['20', '6', '8', '20', '6', '8']

I would appreciate if you could help me

Entire code:

import sys import re import os initial_pattern = '^ FINAL OPTIMIZED GEOMETRY - DIMENSIONALITY OF THE SYSTEM 3$' middle_pattern = '^ CRYSTALLOGRAPHIC CELL ' end_pattern = '^ T = ATOM BELONGING TO THE ASYMMETRIC UNIT$' global N_atom_irreducible_unit N_atom_irreducible_unit = 3 VOLUMES = [] P0 = [] P2 = [] ATOMIC_NUMBERS = [] Xs = [] Ys = [] Zs = [] with open('g.out') as file: passed_mid_point = False for line in file: if re.match(initial_pattern, line): passed_mid_point = False print file.next() print file.next() print file.next() volume_line = file.next() print volume_line aux = volume_line.split() each_volume = aux[7] print each_volume VOLUMES.append(each_volume) if re.match(middle_pattern, line): passed_mid_point = True print line print file.next() parameters_line = file.next() aux = parameters_line.split() p0 = aux[0] p1 = aux[1] p2 = aux[2] p3 = aux[3] p4 = aux[4] p5 = aux[5] # print p0 print p2 P0.append(p0) P2.append(p2) print file.next() print file.next() print file.next() print file.next() if re.match(end_pattern, line): passed_mid_point = False elif passed_mid_point: # parse the coordinates print 'line2 =', line terms = line.split() print 'terms =', terms # print 'terms[1] =', terms[1] if terms and terms[1] == 'T': print terms[1] atomic_number = terms[2] print 'atomic_number = ', atomic_number ATOMIC_NUMBERS.append(atomic_number) x = terms[4] print 'x =', x Xs.append(x) y = terms[5] print 'y = ', y Ys.append(y) z = terms[6] print 'z = ', z Zs.append(z) print 'VOLUMES = ', VOLUMES print 'P0 = ', P0 print 'P2 = ', P2 print 'Xs = ', Xs print 'Ys = ', Ys print 'Zs = ', Zs print 'ATOMIC_NUMBERS = ', ATOMIC_NUMBERS

最满意答案

我写了一个简化版的脚本,看起来很有用。 我希望这可以作为你最终剧本的起点:

#!/usr/bin/env python # -*- coding: utf-8 -*- VOLUMES = [] P0 = [] P2 = [] ATOMIC_NUMBERS = [] Xs = [] Ys = [] Zs = [] with open('g.out') as gout: final_optimized_geometry = False for line in gout: if 'FINAL OPTIMIZED GEOMETRY' in line: final_optimized_geometry = True elif 'PRIMITIVE CELL' in line: if not final_optimized_geometry: continue volume = line.split()[7] VOLUMES.append(volume) elif 'CRYSTALLOGRAPHIC CELL (VOLUME=' in line: if not final_optimized_geometry: continue gout.readline() line = gout.readline() p0, p2 = line.split()[0:3:2] P0.append(p0) P2.append(p2) elif 'COORDINATES IN THE CRYSTALLOGRAPHIC CELL' in line: if not final_optimized_geometry: continue gout.readline() gout.readline() while 'T = ATOM BELONGING TO THE ASYMMETRIC UNIT' not in line: line = gout.readline() atomdata = line.split() if not atomdata or atomdata[1] != 'T': continue atomicnumber = atomdata[2] x, y, z = atomdata[4:7] ATOMIC_NUMBERS.append(atomicnumber) Xs.append(x) Ys.append(y) Zs.append(z) final_optimized_geometry = False print(VOLUMES) print(P0) print(P2) print(ATOMIC_NUMBERS) print(Xs) print(Ys) print(Zs)

这会生成以下输出:

['119.823364', '121.143469'] ['4.97568007', '4.98494429'] ['16.76591397', '16.88768068'] ['20', '6', '8', '20', '6', '8'] ['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01'] ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01'] ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']

事实上,它是一个非常简单的有限状态机,只有两个状态。 警告:如果在一个最终优化的几何体中存在多个晶体细胞,则它将不起作用。 在这种情况下,它只会捕获第一个单元格的信息。

代码还对文件做出了其他假设,当然也许需要进行验证。

我避免使用正则表达式。

此代码仅在Python 3中运行(针对Python 3.6.2进行测试)。 Python 2.7将阻止在文件迭代块中使用readline() (哪种有意义,但很高兴看到Python 3可以使用它)。 我们使用readline()作为一个小黑客从我们知道的输入文件中跳过行必须被跳过,而不再遍历整个循环(这将需要更多的标志变量)。

顺便说一句,如果您的唯一任务是解析文本文件,那么查看专用语言(例如Lex)可能会很有趣。 此外,Perl是为了做这样的事情而设计的,比Python更多。

希望这可以帮助!

I wrote a simplified version of your script, which seems to work. I hope this can count as a starting point for your final script:

#!/usr/bin/env python # -*- coding: utf-8 -*- VOLUMES = [] P0 = [] P2 = [] ATOMIC_NUMBERS = [] Xs = [] Ys = [] Zs = [] with open('g.out') as gout: final_optimized_geometry = False for line in gout: if 'FINAL OPTIMIZED GEOMETRY' in line: final_optimized_geometry = True elif 'PRIMITIVE CELL' in line: if not final_optimized_geometry: continue volume = line.split()[7] VOLUMES.append(volume) elif 'CRYSTALLOGRAPHIC CELL (VOLUME=' in line: if not final_optimized_geometry: continue gout.readline() line = gout.readline() p0, p2 = line.split()[0:3:2] P0.append(p0) P2.append(p2) elif 'COORDINATES IN THE CRYSTALLOGRAPHIC CELL' in line: if not final_optimized_geometry: continue gout.readline() gout.readline() while 'T = ATOM BELONGING TO THE ASYMMETRIC UNIT' not in line: line = gout.readline() atomdata = line.split() if not atomdata or atomdata[1] != 'T': continue atomicnumber = atomdata[2] x, y, z = atomdata[4:7] ATOMIC_NUMBERS.append(atomicnumber) Xs.append(x) Ys.append(y) Zs.append(z) final_optimized_geometry = False print(VOLUMES) print(P0) print(P2) print(ATOMIC_NUMBERS) print(Xs) print(Ys) print(Zs)

This generates the following output:

['119.823364', '121.143469'] ['4.97568007', '4.98494429'] ['16.76591397', '16.88768068'] ['20', '6', '8', '20', '6', '8'] ['0.000000000000E+00', '3.333333333333E-01', '-4.090760942850E-01', '0.000000000000E+00', '3.333333333333E-01', '-4.093755657782E-01'] ['0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01', '0.000000000000E+00', '-3.333333333333E-01', '-3.333333333333E-01'] ['0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02', '0.000000000000E+00', '-8.333333333333E-02', '-8.333333333333E-02']

In fact it's a very simple finite state machine with only two states. Warning: it will not work if there are multiple crystallographic cells in one final optimized geometry. In that case it will only capture the first cell's information.

The code also makes other assumptions about the file, which perhaps need to be verified of course.

I avoided the use of regular expressions.

This code will only run in Python 3 (tested against Python 3.6.2). Python 2.7 will choke on using readline() inside the file iteration block (which kind of makes sense, but it's great to see Python 3 is okay with it). We are using readline() as a little hack to skip lines from the input file we know for certain have to be skipped, without going through the whole loop again (which would require more flag variables).

By the way, if your sole task is to parse text files, it might be interesting to check out dedicated languages, such as Lex for example. Also, Perl was designed for doing things like this, more than Python was.

Hope this helps!