Hi everyone!
I need your help, I don't know how can I download a web page to a .txt file.
I mean, I want to download all the text contained on a web page into a .txt file.
The only thing that comes to my mind is URLDownloadToFile, but seems like it just downloads files.
It is possible to search some text on a web page from my program? How?
Thanks in advance!
I need your help, I don't know how can I download a web page to a .txt file.
I mean, I want to download all the text contained on a web page into a .txt file.
The only thing that comes to my mind is URLDownloadToFile, but seems like it just downloads files.
It is possible to search some text on a web page from my program? How?
Thanks in advance!
A "web page" is a "file", no?
I know how to do it at a lower level. I guess for 'doze you'd need "WSAStartup" first. Then "socket", "connect", "send" a "get" request - has to be properly formatted - and "recv" what it sends ya. I have an example written by Nathan Baker (using the NASMX package) which I have "modified" by adding a "fakeapi.asm" file which can be linked with it to make it run on Linux (same source code). As such, it's oversimplified and a little "weird". I'll post it if you can't find anything more suitable.
Best,
Frank
I know how to do it at a lower level. I guess for 'doze you'd need "WSAStartup" first. Then "socket", "connect", "send" a "get" request - has to be properly formatted - and "recv" what it sends ya. I have an example written by Nathan Baker (using the NASMX package) which I have "modified" by adding a "fakeapi.asm" file which can be linked with it to make it run on Linux (same source code). As such, it's oversimplified and a little "weird". I'll post it if you can't find anything more suitable.
Best,
Frank
Frank,
I think he means he wants to strip the HTML out of the document leaving only visible text (sorta like using 'links -dump $URL').
Germain
That being said, this is not a trivial task and even people writing it in languages like PERL with advance parsing usually resort to "hacks" to accomplish what they want. That being said, I would look into the PCRE library to help you in getting the results you want. If you're a little braver, I'd say go check out the source for links2 curses based web browser and see how it implements the -dump command. :)
Regards,
Bryant Keller
I think he means he wants to strip the HTML out of the document leaving only visible text (sorta like using 'links -dump $URL').
Germain
That being said, this is not a trivial task and even people writing it in languages like PERL with advance parsing usually resort to "hacks" to accomplish what they want. That being said, I would look into the PCRE library to help you in getting the results you want. If you're a little braver, I'd say go check out the source for links2 curses based web browser and see how it implements the -dump command. :)
Regards,
Bryant Keller
Ah! Yes, that'll take some parsing. Not so easy. How much would you lose if you just dumped everything between '<' and '>'?
Best,
Frank
Best,
Frank
Frank,
I think he means he wants to strip the HTML out of the document leaving only visible text (sorta like using 'links -dump $URL').
Germain
That being said, this is not a trivial task and even people writing it in languages like PERL with advance parsing usually resort to "hacks" to accomplish what they want. That being said, I would look into the PCRE library to help you in getting the results you want. If you're a little braver, I'd say go check out the source for links2 curses based web browser and see how it implements the -dump command. :)
Regards,
Bryant Keller
Yeah, that is exacly what I want to do!
How much would you lose if you just dumped everything between '<' and '>'?
Yeah, that probably would work in best cases. Truthfully, he'll probably want to completely ignore the <head></head> <style></style> and <script></script> sections first, then rescan stripping out all tags. A problem with this can be when people improperly use <'s or >'s in source instead of < or >, but that could be chalked up to not supporting poorly developed HTML code.
Yeah... not really as simple as that. I wasn't thinking clearly (shocked, I tell ya!). The "download" part isn't too tough. Stripping the "html" to leave "clean text" is more complicated - especially dealing with any "bad html", as you point out!
Any progress with it, GermainR27?
Best,
Frank
Any progress with it, GermainR27?
Best,
Frank
You are better off using a premade lib for this (there are a few out there I think), One of my apps, I had to deal with a page returned from a server... I just parsed the HTML for what I needed.... it is a real PITA to parse HTML since nobody seems to follow a standard... since most browsers fix bad tags on the fly.... you will have to deal with bad tags, tags that aren't closed properly... Uppercase tags, lowercase tags... a bunch of stuff... good luck!
This is a start for you. It downloads a masm32 service pack page and translates it. I noticed it does not work with a dial up modem, only high speed. It does a kindergarten translation as it does not have word wrap.