Hi all,
I'm trying to parse a text file to extract individual records. The file is an export print file. The problem is that the file is space padded. each line is lfcr terminated.
Name name name lfcr
adddr1 lfcr
addr2 addr2 addr2 lfcr
town state town state town state lfcr
zip cus# zip cus# zip cus# lfcr
Each record consists of 5 lines that may contain blank fields. Spaces are used as field delimiters. Also some fields contain 2 spaces between words.
I read each line into a buffer and then parse it into 3 buffers.
The code above explains better how I intend to do it.
Any help is greatly appreciated.
Mich
I'm trying to parse a text file to extract individual records. The file is an export print file. The problem is that the file is space padded. each line is lfcr terminated.
Name name name lfcr
adddr1 lfcr
addr2 addr2 addr2 lfcr
town state town state town state lfcr
zip cus# zip cus# zip cus# lfcr
Each record consists of 5 lines that may contain blank fields. Spaces are used as field delimiters. Also some fields contain 2 spaces between words.
I read each line into a buffer and then parse it into 3 buffers.
xor edx,edx
mov edi, ;dest
call strend
mov esi,buffer ;scr
@next:
mov al,byte
inc esi
cmp al,20h
je spc
mov ,byte al
inc edi
dec ecx
test ecx,ecx
je end_word
jmp @next
spc:
inc dword
cmp ,dword 3
je end_word
jmp @next
end_word:
dec edi
mov byte, ','
add edx,1
mov edi,
xor ecx,
add esi,ecx
xor eax,eax
mov ,dword 0
cmp edx,3
je end_
jmp @next
end_:
cc printf,"%s %0X %s",,,NewLine
cc printf,"%s %0X %s",,,NewLine
cc printf,"%s %0X %s",,,NewLine
The code above explains better how I intend to do it.
Any help is greatly appreciated.
Mich
mich, are you sure the fields are space-separated and not tab-separated? From your description it looks like parsing the rows reliably is impossible to achieve. For instance, what is preventing the 3rd row to be interpreted in the following manner?
town state | town state town | state lfcr
(I'm using "|" as field delimiter)Thanks for your reply LocoDelAssembly
No tabs. The file will print in 80 chr width. I will try to find the column boundary and copy all characters into the buffers, append field delimiter (','). Once I have delimited fields, I will remove the extra spaces.
town state | town state town | state lfcr
town state is padded with 2 spaces. The distance between State and town are space padded. It looks to me as if the columns are 26 or 27 characters wide. The longest field name is about that long with a single space separating it from the first letter in the second column etc.
I will change the loops in my routine and give this a try.
Mich
No tabs. The file will print in 80 chr width. I will try to find the column boundary and copy all characters into the buffers, append field delimiter (','). Once I have delimited fields, I will remove the extra spaces.
town state | town state town | state lfcr
town state is padded with 2 spaces. The distance between State and town are space padded. It looks to me as if the columns are 26 or 27 characters wide. The longest field name is about that long with a single space separating it from the first letter in the second column etc.
I will change the loops in my routine and give this a try.
Mich
You shouldn't be parsing by spaces. You've already said that there should be a length limit (80 chars) and that you'll be working on data in blocks of 5 lines at a time.
So the easy solution for this is to create a buffer that is 410 bytes (80 chars * 5 lines + (2 chars * 5 lines)) and read the entire block into that buffer. Then just access a static set of offsets into that buffer since they are unlikely to change. For example, if each column was 26 characters long, we could handle this by using a pointer to the 410 byte buffer, then use (row * 80) + (col * 27) where both row and col are a zero-based indices for our "table".
We could then easily start at the end of each column's text (ie. [(row * 80) + (col * 27) + 26]) and overwrite all the 0x20's (spaces) with 0x00 (null terminators) and then treat the entry as a normal C-string.
Regards,
Bryant Keller
So the easy solution for this is to create a buffer that is 410 bytes (80 chars * 5 lines + (2 chars * 5 lines)) and read the entire block into that buffer. Then just access a static set of offsets into that buffer since they are unlikely to change. For example, if each column was 26 characters long, we could handle this by using a pointer to the 410 byte buffer, then use (row * 80) + (col * 27) where both row and col are a zero-based indices for our "table".
We could then easily start at the end of each column's text (ie. [(row * 80) + (col * 27) + 26]) and overwrite all the 0x20's (spaces) with 0x00 (null terminators) and then treat the entry as a normal C-string.
Regards,
Bryant Keller
Thanks for your answer Synfire,
#0697667 GB LTD #0719830 GB LTD #0731990 GB LTD BDA LONA
PO BOX 3129 PO BOX 4530 SG3 SITE 11 COMP 138
STN MAIN 14311 ALASKA VENUE
BECIL CAKE NT ECIL RAKE TB FROZEN JOHN VA
V04 1G0 24860 VTC 1G0 24986 VGJ 4M7 26024
Unfortunately the formatting is screwed up and the columns are not properly aligned but I think you will see what I mean.
I have tried to remove multiple spaces but this causes severe distortions in the case of line 3. The fields for address2 are blank in column one and two, causing the field for column 3 to be shifted into column one. Furthermore, number companies all start with a space. Overwriting spaces with 0 would create strings for each word. Not sure how to get that straightened out at the end?
Further analyzing the file reveals that the first chr in column 1 starts at 1, first chr in column 2 starts at position 27, first chr in column 3 starts at 54. Line length is 79 chr with lfcr at the end for a total line length of 81 chr.
I like your suggestion to read a block of data and then using the indices to extract the strings.
I will work out something in the next few days.
Mich
#0697667 GB LTD #0719830 GB LTD #0731990 GB LTD BDA LONA
PO BOX 3129 PO BOX 4530 SG3 SITE 11 COMP 138
STN MAIN 14311 ALASKA VENUE
BECIL CAKE NT ECIL RAKE TB FROZEN JOHN VA
V04 1G0 24860 VTC 1G0 24986 VGJ 4M7 26024
Unfortunately the formatting is screwed up and the columns are not properly aligned but I think you will see what I mean.
I have tried to remove multiple spaces but this causes severe distortions in the case of line 3. The fields for address2 are blank in column one and two, causing the field for column 3 to be shifted into column one. Furthermore, number companies all start with a space. Overwriting spaces with 0 would create strings for each word. Not sure how to get that straightened out at the end?
Further analyzing the file reveals that the first chr in column 1 starts at 1, first chr in column 2 starts at position 27, first chr in column 3 starts at 54. Line length is 79 chr with lfcr at the end for a total line length of 81 chr.
I like your suggestion to read a block of data and then using the indices to extract the strings.
I will work out something in the next few days.
Mich
Google for 'pretty printer' - that's what you need to be doing.
mich,
You may enclose monospaced text in [?code][?/code] BBCode tags:
for proper alignment.
There is something strange with this format. While it looks like strikingly similar to punched cards ;-), first column is 26 cells wide (according to your numbers for field starting positions), second is 27 cells wide and last is only 25 cells wide (your data example contains some extra characters in this column).
Three columns of 26 cells each will yield 78 cells total, plus LF/CR (is it correct sequence?) for 80 bytes/line.
With fixed record/field size parsing shouldn't pose any problems (if I got it right). As Synfire had already said, replace padding spaces at the end of fields with zeros (ASCII NUL), and treat the results as regular nul-terminated strings. Beware though that if field occupies its full width (it isn't specified whether this can occur), there isn't any space to replace with '\0'.
You may enclose monospaced text in [?code][?/code] BBCode tags:
1 27 54 79
#0697667 GB LTD #0719830 GB LTD #0731990 GB LTD BDA LONA
PO BOX 3129 PO BOX 4530 SG3 SITE 11 COMP 138
STN MAIN 14311 ALASKA VENUE
BECIL CAKE NT ECIL RAKE TB FROZEN JOHN VA
V04 1G0 24860 VTC 1G0 24986 VGJ 4M7 26024
for proper alignment.
There is something strange with this format. While it looks like strikingly similar to punched cards ;-), first column is 26 cells wide (according to your numbers for field starting positions), second is 27 cells wide and last is only 25 cells wide (your data example contains some extra characters in this column).
Three columns of 26 cells each will yield 78 cells total, plus LF/CR (is it correct sequence?) for 80 bytes/line.
With fixed record/field size parsing shouldn't pose any problems (if I got it right). As Synfire had already said, replace padding spaces at the end of fields with zeros (ASCII NUL), and treat the results as regular nul-terminated strings. Beware though that if field occupies its full width (it isn't specified whether this can occur), there isn't any space to replace with '\0'.