Hi everyone!

I need your help, I don't know how can I download a web page to a .txt file.
I mean, I want to download all the text contained on a web page into a .txt file.
The only thing that comes to my mind is URLDownloadToFile, but seems like it just downloads files.

It is possible to search some text on a web page from my program? How?

Thanks in advance!
Posted on 2011-04-15 22:54:15 by GermainR27
A "web page" is a "file", no?

I know how to do it at a lower level. I guess for 'doze you'd need "WSAStartup" first. Then "socket", "connect", "send" a "get" request - has to be properly formatted - and "recv" what it sends ya. I have an example written by Nathan Baker (using the NASMX package) which I have "modified" by adding a "fakeapi.asm" file which can be linked with it to make it run on Linux (same source code). As such, it's oversimplified and a little "weird". I'll post it if you can't find anything more suitable.

Best,
Frank

Posted on 2011-04-16 12:15:18 by fbkotler
Frank,
I think he means he wants to strip the HTML out of the document leaving only visible text (sorta like using 'links -dump $URL').

Germain
That being said, this is not a trivial task and even people writing it in languages like PERL with advance parsing usually resort to "hacks" to accomplish what they want. That being said, I would look into the PCRE library to help you in getting the results you want. If you're a little braver, I'd say go check out the source for links2 curses based web browser and see how it implements the -dump command. :)

Regards,
Bryant Keller
Posted on 2011-04-16 15:22:19 by Synfire
Ah! Yes, that'll take some parsing. Not so easy. How much would you lose if you just dumped everything between '<' and '>'?

Best,
Frank

Posted on 2011-04-16 15:40:35 by fbkotler

Frank,
I think he means he wants to strip the HTML out of the document leaving only visible text (sorta like using 'links -dump $URL').

Germain
That being said, this is not a trivial task and even people writing it in languages like PERL with advance parsing usually resort to "hacks" to accomplish what they want. That being said, I would look into the PCRE library to help you in getting the results you want. If you're a little braver, I'd say go check out the source for links2 curses based web browser and see how it implements the -dump command. :)

Regards,
Bryant Keller


Yeah, that is exacly what I want to do!
Posted on 2011-04-16 19:57:36 by GermainR27
How much would you lose if you just dumped everything between '<' and '>'?


Yeah, that probably would work in best cases. Truthfully, he'll probably want to completely ignore the <head></head> <style></style> and <script></script> sections first, then rescan stripping out all tags. A problem with this can be when people improperly use <'s or >'s in source instead of &lt; or &gt;, but that could be chalked up to not supporting poorly developed HTML code.
Posted on 2011-04-18 12:34:15 by Synfire
Yeah... not really as simple as that. I wasn't thinking clearly (shocked, I tell ya!). The "download" part isn't too tough. Stripping the "html" to leave "clean text" is more complicated - especially dealing with any "bad html", as you point out!

Any progress with it, GermainR27?

Best,
Frank

Posted on 2011-04-18 14:15:21 by fbkotler
You are better off using a premade lib for this (there are a few out there I think), One of my apps, I had to deal with a page returned from a server...  I just parsed the HTML for what I needed.... it is a real PITA to parse HTML since nobody seems to follow a standard... since most browsers fix bad tags on the fly.... you will have to deal with bad tags, tags that aren't closed properly...  Uppercase tags, lowercase tags... a bunch of stuff... good luck!
Posted on 2011-04-18 16:02:41 by Gunner
This is a start for you.  It downloads a masm32 service pack page and translates it.   I noticed it does not work with a dial up modem, only high speed.  It does a kindergarten translation as it does not have word wrap.
Attachments:
Posted on 2011-04-20 13:03:38 by roaknog