Hello,

Is it possible to read only a portion of a file into memory, without starting at the beginning?

Thanks, Allasso
Posted on 2011-01-01 11:47:31 by Allasso
Yeah... open the file and "lseek" to the desired offset, then read it. "lseek" takes three parameters - the file descriptor (obtained from "open"), the offset, and an integer "whence": from the start of file, from the current position in the file, and from the end of file (offset can be negative). It returns the new offset - can be used (lseek to an offset of zero from the end of the file) to find the file length. Also look into "llseek" for larger files, and "mmap" (mmap2).

Best,
Frank

Posted on 2011-01-01 15:45:22 by fbkotler
ah, yes, of course  :-)

thanks muchly.

Allasso
Posted on 2011-01-01 17:09:05 by Allasso
Is it slower to read a file from and offset? (not from beginning)  I thought I read that, but I am not sure if I am interpreting it correctly.
Posted on 2011-01-01 19:32:10 by Allasso
Probably - the "lseek" must take some time. Further, we can only read (physically... from a disk file) a whole sector at a time. Experiment with it! My observation has been that the "first" read (after a reboot, say, or reading another large file) is much slower than a subsequent read from the same (or "nearby"?) file. I suspect OS buffering policy comes into it. If you're using libC, the "buffered I/O" functions "fopen()", "fread()", etc. are much faster. We can provide our own buffering, of course.

As a silly example, s'pose we wanted to find the first letter 'x' in a file. Opening the file and reading it one byte at a time (with either int 80h or calling "read()") would be very slow. Reading a buffer-full and searching the buffer would be much faster. If we wanted the first 'x' after the first 100 bytes, simply skipping the first 100 bytes of the first buffer-full would probably be okay. If we wanted the first 'x' after the first megabyte, the lseek might be worthwhile. I suspect mmap would be the biggest win.

The "best" (fastest?) way probably depends on exactly what you're doing. Do you have something in particular in mind?

Best,
Frank

Posted on 2011-01-01 21:48:38 by fbkotler
Hi, Frank,

I hadn't considered the sector thing.  Regarding what you said about subsequent reads being faster, I think that you are correct about the OS buffering.  I am only starting to learn about this subject, but I think that (modern) OS's that use virtual memory and paging, map what is already in memory, so once the file is already in memory, on subsequent reads, the program can work on what is already there.  Then there is caching, which, it being likely that the same file (or portion of it) is in the cache, is faster yet.

The scenario I am dealing with presently is this:  I am playing around with creating an algorithm to index and search a (flat file) database.  In the search routine, the first operation is to extract from a search term, the address to an entry on a table which will provide the pointer to the appropriate bucket to search in the index, which is in another file. (didja get all that?)  The entries on the table start at the beginning of the file, and each entry is 8 bytes long.  The table can be as long as 8kB.  It makes more sense to me to just read in the 8 bytes that I am interested in, rather than 8kb.  There is also another operation similar to this that follows, and the file in that case could possibly be a 1 - 2 hundred kB, depending on the vocabulary of the database.

Now it may be possible that it would be better to determine an optimum range of addresses, determine which range the desired address falls into, and load that entire segment.  This range could be dependent on the sector size, as you say, or the page size of the OS.  It would seem, however, that using a hardware parameter such as the sector size/range would be meaningless.  It seems that is the territory of the OS.  But I really don't know on this.

On the hand... I have a hunch that the OS is already doing all of that figuring for you anyway, on a modern system using virtual memory.  So it may be that, I lseek to my desired address and read my little 16 bytes, and the system loads a page in anyway, and does it all correctly.  If this is true, then even if I could figure out the optimum range, is the OS going to say, "hey, this guy knows what he's doin'.  We don't need to figure this now."?

Regarding lseek taking time, I think that is insignificant compared to reading data from a disk into memory.  I have observed this to be the case with 100 MB files.

I agree with your advice to "experiment with it", however I have found in the past that I have been fooled often by results from speed tests, even when the results were very consistent during a session.  It almost seems like you need a machine totally dedicated to doing this - lightweight OS, no gui, no networking going on, as few daemons running as possible, to really get accurate results.  I could be all wet about this; I don't know how "they" do it, but that is how it seems to me.  Of course I am kinda terminally anal retentive :-)

BTW, I really appreciate this board.  It is the clearest, cleanest, fastest, most straightforward board I have ever posted on.  And it's all about assembly :-)  (which I am new to, but I really like it)  It was hard to find, but maybe that's what makes it so good.
Posted on 2011-01-02 07:31:32 by Allasso