Author Topic: Parseing  (Read 843 times)

0 Members and 1 Guest are viewing this topic.

Offline mich

  • ASM Supporter
  • *
  • Posts: 8
Parseing
« on: 2012-09-24 04:31:06 »
Hi all,
I'm trying to parse a text file to extract individual records. The file is an export print file. The problem is that the file is space padded. each line is lfcr terminated.
 
Quote
Name            name          name           lfcr
                                      adddr1         lfcr
addr2            addr2          addr2          lfcr
town  state    town state     town  state   lfcr
zip   cus#      zip  cus#      zip  cus#     lfcr
Each record consists of 5 lines that may contain blank fields. Spaces are used as field delimiters. Also some fields contain 2 spaces between words.
I read each line into a buffer and then parse it into 3 buffers.
Code: [Select]
                        xor edx,edx
mov edi,[ptbl+edx*4]    ;dest
call strend
mov esi,buffer               ;scr
@next:               
    mov al,byte[esi]
    inc esi   
    cmp al,20h
je spc 
    mov [edi],byte al
inc edi         
    dec ecx
    test ecx,ecx
    je end_word
jmp @next
spc:
inc dword[fspc]
cmp [fspc],dword 3
    je end_word 
jmp @next
end_word:
dec edi
mov byte[edi], ','
add edx,1
mov edi,[ptbl+edx*4]
xor ecx,[linelen]
add esi,ecx
xor eax,eax
mov [fspc],dword 0
cmp edx,3
je end_
jmp @next
end_:
cc printf,"%s %0X %s",[ptbl+0],[ptbl+0],NewLine   
cc printf,"%s %0X %s",[ptbl+4],[ptbl+4],NewLine
cc printf,"%s %0X %s",[ptbl+8],[ptbl+8],NewLine

The code above explains better how I intend to do it.
Any help is greatly appreciated.

Mich

Offline LocoDelAssembly

  • ASM Supporter
  • *
  • Posts: 94
Re: Parseing
« Reply #1 on: 2012-09-24 13:07:36 »
mich, are you sure the fields are space-separated and not tab-separated? From your description it looks like parsing the rows reliably is impossible to achieve. For instance, what is preventing the 3rd row to be interpreted in the following manner?
Code: [Select]
town  state  |  town state     town | state   lfcr(I'm using "|" as field delimiter)

Offline mich

  • ASM Supporter
  • *
  • Posts: 8
Re: Parseing
« Reply #2 on: 2012-09-24 19:50:05 »
Thanks for your reply LocoDelAssembly
No tabs. The file will print in 80 chr width. I will try to find the column boundary and copy all characters into  the buffers, append field delimiter  (','). Once I have delimited fields, I will remove the extra spaces.
Quote
town  state  |  town state     town | state   lfcr
town  state is padded with 2 spaces. The distance between State and town are space padded. It looks to me as if the columns are 26 or 27 characters wide. The longest field name is about that long with a single space separating it from the first letter in the second column etc.
I will change the loops in my routine and give this a try.
Mich

Offline Synfire

  • Just Another Nobody
  • Community Staff
  • Regular Member
  • *****
  • Posts: 620
  • Country: us
  • Bryant Keller
    • About Bryant Keller
Re: Parseing
« Reply #3 on: 2012-09-25 01:10:44 »
You shouldn't be parsing by spaces. You've already said that there should be a length limit (80 chars) and that you'll be working on data in blocks of 5 lines at a time.

So the easy solution for this is to create a buffer that is 410 bytes (80 chars * 5 lines + (2 chars * 5 lines)) and read the entire block into that buffer. Then just access a static set of offsets into that buffer since they are unlikely to change. For example, if each column was 26 characters long, we could handle this by using a pointer to the 410 byte buffer, then use (row * 80) + (col * 27) where both row and col are a zero-based indices for our "table".

We could then easily start at the end of each column's text (ie. [(row * 80) + (col * 27) + 26])  and overwrite all the 0x20's (spaces) with 0x00 (null terminators) and then treat the entry as a normal C-string.

Regards,
Bryant Keller

Offline mich

  • ASM Supporter
  • *
  • Posts: 8
Re: Parseing
« Reply #4 on: 2012-09-25 05:45:00 »
Thanks for your answer Synfire,
Quote
  #0697667 GB LTD          #0719830 GB LTD          #0731990 GB LTD BDA LONA 
PO BOX 3129                 PO BOX 4530                    SG3 SITE 11 COMP 138         
                                                               STN MAIN 14311 ALASKA VENUE   
BECIL CAKE  NT          ECIL RAKE  TB                FROZEN  JOHN  VA           
V04 1G0             24860  VTC 1G0             24986     VGJ 4M7             26024 
Unfortunately the formatting is screwed up and the columns are not properly aligned but I think you will see what I mean.

I have tried to remove multiple spaces but this causes severe distortions in the case of line 3. The fields for address2 are blank in column one and two, causing the field for column 3 to be shifted into column one. Furthermore, number companies all start with a space. Overwriting spaces with 0 would create strings for each word. Not sure how to get that straightened out at the end?
Further analyzing the file reveals that the first chr in column 1 starts at 1, first chr in column 2 starts at position 27, first chr in column 3 starts at 54. Line length is 79 chr with lfcr at the end for a total line length of 81 chr.
I like your suggestion to read a block of data and then using the indices to extract the strings.
I will work out something in the next few days.
Mich

Offline Homer

  • ObjAsm32 Developer
  • Community Staff
  • ASM Fanatic
  • *****
  • Posts: 4617
  • Country: au
Re: Parseing
« Reply #5 on: 2012-09-27 05:46:09 »
Google for 'pretty printer' - that's what you need to be doing.
In C++, friends have access to your private members.
In ObjAsm, your members are exposed!

Offline baldr

  • Code Warrior
  • **
  • Posts: 178
Re: Parseing
« Reply #6 on: 2012-12-24 10:03:53 »
mich,

You may enclose monospaced text in [?code][?/code] BBCode tags:
Code: [Select]
1                         27                         54                       79
#0697667 GB LTD           #0719830 GB LTD            #0731990 GB LTD BDA LONA
PO BOX 3129               PO BOX 4530                SG3 SITE 11 COMP 138         
                                                     STN MAIN 14311 ALASKA VENUE   
BECIL CAKE  NT            ECIL RAKE  TB              FROZEN  JOHN  VA           
V04 1G0           24860   VTC 1G0             24986  VGJ 4M7             26024
for proper alignment.

There is something strange with this format. While it looks like strikingly similar to punched cards ;-), first column is 26 cells wide (according to your numbers for field starting positions), second is 27 cells wide and last is only 25 cells wide (your data example contains some extra characters in this column).

Three columns of 26 cells each will yield 78 cells total, plus LF/CR (is it correct sequence?) for 80 bytes/line.

With fixed record/field size parsing shouldn't pose any problems (if I got it right). As Synfire had already said, replace padding spaces at the end of fields with zeros (ASCII NUL), and treat the results as regular nul-terminated strings. Beware though that if field occupies its full width (it isn't specified whether this can occur), there isn't any space to replace with '\0'.