Hi all,
I'm trying to parse a text file to extract individual records. The file is an export print file. The problem is that the file is space padded. each line is lfcr terminated.

Name            name          name          lfcr
                                      adddr1        lfcr
addr2            addr2          addr2          lfcr
town  state    town state    town  state  lfcr
zip  cus#      zip  cus#      zip  cus#    lfcr

Each record consists of 5 lines that may contain blank fields. Spaces are used as field delimiters. Also some fields contain 2 spaces between words.
I read each line into a buffer and then parse it into 3 buffers.

                        xor edx,edx
mov edi,    ;dest
call strend
mov esi,buffer              ;scr
@next:           
    mov al,byte
    inc esi   
    cmp al,20h
je spc 
    mov ,byte al
inc edi         
    dec ecx
    test ecx,ecx
    je end_word
jmp @next
spc:
inc dword
cmp ,dword 3
    je end_word 
jmp @next
end_word:
dec edi
mov byte, ','
add edx,1
mov edi,
xor ecx,
add esi,ecx
xor eax,eax
mov ,dword 0
cmp edx,3
je end_
jmp @next
end_:
cc printf,"%s %0X %s",,,NewLine   
cc printf,"%s %0X %s",,,NewLine
cc printf,"%s %0X %s",,,NewLine


The code above explains better how I intend to do it.
Any help is greatly appreciated.

Mich
Posted on 2012-09-23 23:31:06 by mich
mich, are you sure the fields are space-separated and not tab-separated? From your description it looks like parsing the rows reliably is impossible to achieve. For instance, what is preventing the 3rd row to be interpreted in the following manner?
town  state  |  town state     town | state   lfcr
(I'm using "|" as field delimiter)
Posted on 2012-09-24 08:07:36 by LocoDelAssembly
Thanks for your reply LocoDelAssembly
No tabs. The file will print in 80 chr width. I will try to find the column boundary and copy all characters into  the buffers, append field delimiter  (','). Once I have delimited fields, I will remove the extra spaces.

town  state  |  town state    town | state  lfcr

town  state is padded with 2 spaces. The distance between State and town are space padded. It looks to me as if the columns are 26 or 27 characters wide. The longest field name is about that long with a single space separating it from the first letter in the second column etc.
I will change the loops in my routine and give this a try.
Mich
Posted on 2012-09-24 14:50:05 by mich
You shouldn't be parsing by spaces. You've already said that there should be a length limit (80 chars) and that you'll be working on data in blocks of 5 lines at a time.

So the easy solution for this is to create a buffer that is 410 bytes (80 chars * 5 lines + (2 chars * 5 lines)) and read the entire block into that buffer. Then just access a static set of offsets into that buffer since they are unlikely to change. For example, if each column was 26 characters long, we could handle this by using a pointer to the 410 byte buffer, then use (row * 80) + (col * 27) where both row and col are a zero-based indices for our "table".

We could then easily start at the end of each column's text (ie. [(row * 80) + (col * 27) + 26])  and overwrite all the 0x20's (spaces) with 0x00 (null terminators) and then treat the entry as a normal C-string.

Regards,
Bryant Keller
Posted on 2012-09-24 20:10:44 by Synfire
Thanks for your answer Synfire,

  #0697667 GB LTD          #0719830 GB LTD          #0731990 GB LTD BDA LONA 
PO BOX 3129                PO BOX 4530                SG3 SITE 11 COMP 138       
                                                      STN MAIN 14311 ALASKA VENUE 
BECIL CAKE  NT          ECIL RAKE  TB            FROZEN  JOHN  VA         
V04 1G0            24860  VTC 1G0            24986  VGJ 4M7            26024 

Unfortunately the formatting is screwed up and the columns are not properly aligned but I think you will see what I mean.

I have tried to remove multiple spaces but this causes severe distortions in the case of line 3. The fields for address2 are blank in column one and two, causing the field for column 3 to be shifted into column one. Furthermore, number companies all start with a space. Overwriting spaces with 0 would create strings for each word. Not sure how to get that straightened out at the end?
Further analyzing the file reveals that the first chr in column 1 starts at 1, first chr in column 2 starts at position 27, first chr in column 3 starts at 54. Line length is 79 chr with lfcr at the end for a total line length of 81 chr.
I like your suggestion to read a block of data and then using the indices to extract the strings.
I will work out something in the next few days.
Mich
Posted on 2012-09-25 00:45:00 by mich
Google for 'pretty printer' - that's what you need to be doing.
Posted on 2012-09-27 00:46:09 by Homer
mich,

You may enclose monospaced text in [?code][?/code] BBCode tags:

1                        27                        54                      79
#0697667 GB LTD          #0719830 GB LTD            #0731990 GB LTD BDA LONA
PO BOX 3129              PO BOX 4530                SG3 SITE 11 COMP 138       
                                                    STN MAIN 14311 ALASKA VENUE 
BECIL CAKE  NT            ECIL RAKE  TB              FROZEN  JOHN  VA         
V04 1G0          24860  VTC 1G0            24986  VGJ 4M7            26024

for proper alignment.

There is something strange with this format. While it looks like strikingly similar to punched cards ;-), first column is 26 cells wide (according to your numbers for field starting positions), second is 27 cells wide and last is only 25 cells wide (your data example contains some extra characters in this column).

Three columns of 26 cells each will yield 78 cells total, plus LF/CR (is it correct sequence?) for 80 bytes/line.

With fixed record/field size parsing shouldn't pose any problems (if I got it right). As Synfire had already said, replace padding spaces at the end of fields with zeros (ASCII NUL), and treat the results as regular nul-terminated strings. Beware though that if field occupies its full width (it isn't specified whether this can occur), there isn't any space to replace with '\0'.
Posted on 2012-12-24 05:03:53 by baldr