The best way to read and process whole files

AlexTMjugador · May 16, 2016

Scripts are able to read files by using fileOpen and fileRead functions, and there are mainly three methods to get the whole file contents I can think of: reading all the bytes directly into a string, concatenating an arbitrary number of bytes into a buffer string or concatenating an arbitrary number of bytes into a buffer array.

Method 1: reading all the bytes directly into a string

local file = fileOpen("file.dat", true) 
fileContents = fileRead(file, fileGetSize(file))

Method 2: concatenating an arbitrary number of bytes into a buffer string

local file = fileOpen("file.dat", true) 
local buffer = "" -- This will have the whole file 
while not fileIsEOF(file) do 
    buffer = buffer .. fileRead(file, 500) -- Arbitrary bytes to read go here 
end

Method 3: concatenating an arbitrary number of bytes into a buffer array

local file = fileOpen("file.dat", true) 
local buffer = {} 
while not fileIsEOF(file) do 
    buffer[#buffer + 1] = fileRead(file, 500) -- Arbitrary bytes to read go here 
end 
fileContents = table.concat(buffer)

I decided to write a script to test out the performance and memory consumption of these methods under varying conditions of file size and buffer size. You can check out the source code in Pastebin. Below there are the results the script outputted to the server console.

2 KB and 5 MB files, 500 bytes buffer for both

***************************************** 
** FILE READ METHODS PERFORMANCE TESTS ** 
***************************************** 
~ Information 
Small file size is 2 KB, big file size is 5120 KB 
Small file estimated loop iterations: 5 (500 bytes buffer) 
Big file estimated loop iterations: 10486 (500 bytes buffer) 
~ Test results 
Test 1: small file, read into string directly 
2.0244140625 KB of memory | 0 ms 
Test 2: big file, read into string directly 
5120.0244140625 KB of memory | 7 ms 
Test 3: small file, concatenating into buffer string 
7.076171875 KB of memory | 1 ms 
Test 4: big file, concatenating into buffer string 
97188.6328125 KB of memory | 25293 ms 
Test 5: small file, concatenating into buffer array 
2.7958984375 KB of memory | 0 ms 
Test 6: big file, concatenating into buffer array 
15570.067382813 KB of memory | 27 ms

2 KB and 5 MB files, 5.000 bytes buffer for both

***************************************** 
** FILE READ METHODS PERFORMANCE TESTS ** 
***************************************** 
~ Information 
Small file size is 2 KB, big file size is 5120 KB 
Small file loop iterations: 1 (5000 bytes buffer) 
Big file loop iterations: 1049 (5000 bytes buffer) 
~ Test results 
Test 1: small file, read into string directly 
2.0244140625 KB of memory | 0 ms 
Test 2: big file, read into string directly 
5120.0244140625 KB of memory | 7 ms 
Test 3: small file, concatenating into buffer string 
2.0244140625 KB of memory | 0 ms 
Test 4: big file, concatenating into buffer string 
75989.8046875 KB of memory | 3644 ms 
Test 5: small file, concatenating into buffer array 
2.1025390625 KB of memory | 0 ms 
Test 6: big file, concatenating into buffer array 
15836.223632813 KB of memory | 19 ms

2 KB and 5 MB files, 5.000 bytes buffer for the smaller, 500 bytes buffer for the bigger

***************************************** 
** FILE READ METHODS PERFORMANCE TESTS ** 
***************************************** 
~ Information 
Small file size is 2 KB, big file size is 5120 KB 
Small file loop iterations: 1 (5000 bytes buffer) 
Big file loop iterations: 10486 (500 bytes buffer) 
~ Test results 
Test 1: small file, read into string directly 
2.0244140625 KB of memory | 0 ms 
Test 2: big file, read into string directly 
5120.0244140625 KB of memory | 7 ms 
Test 3: small file, concatenating into buffer string 
2.0244140625 KB of memory | 0 ms 
Test 4: big file, concatenating into buffer string 
97188.6328125 KB of memory | 26600 ms 
Test 5: small file, concatenating into buffer array 
2.1025390625 KB of memory | 0 ms 
Test 6: big file, concatenating into buffer array 
15570.067382813 KB of memory | 28 ms

Looking at these results, the best way to read everything from a file to a string is method 1. If reading the whole file in parts is required for some reason, method 3 is the way to go. Obviously, method 2 is definitely not the best way to do it and it should be not used.

However, these results bring me some questions about the usefulness of while loops with fileIsEOF: Why are Wiki examples using them to output bytes to the console, when it's admittely better to do not use them if the file is small? Why do even developers encourage to use them too, like in this forum post, when they are a slower and do not provide significant-to-none memory savings?

In short: why use while loops with fileIsEOF to read and process whole files?

Einheit-101 · May 17, 2016

€DIT: Now i see it, you are right lol

Looks like "the bigger the buffer, the better"

AlexTMjugador · June 2, 2016

I don't know if it's fine to reply a post whose last response was two weeks ago for this, but I found some interesting random code on GitHub which uses method 2 I described to compute the MD5 hash of a whole file.

That code was crafted by Necktrox, who by the way is someone who certainly has got some insight of how does MTA: SA work and knows how to program. So I want to repeat the question I came across half month ago: why read a whole file by parts using buffers when the script only cares about the entire file? Perhaps it's because some limitation or bug in fileRead, but even though why not use tables when reading by parts is needed? The results of the benchmark I posted are clear too, and you even can test it for yourself.

The community and I would be very grateful if someone finally tells us, like in the topic title, what is the best way to read whole files. Right now it makes nonsense to me to use method 2, really, but when I see good scripters using it there should be a reason... or not?

Anubhav · June 2, 2016

I don't know if it's fine to reply a post whose last response was two weeks ago for this, but I found some interesting random code on GitHub which uses method 2 I described to compute the MD5 hash of a whole file.
That code was crafted by Necktrox, who by the way is someone who certainly has got some insight of how does MTA: SA work and knows how to program. So I want to repeat the question I came across half month ago: why read a whole file by parts using buffers when the script only cares about the entire file? Perhaps it's because some limitation or bug in fileRead, but even though why not use tables when reading by parts is needed? The results of the benchmark I posted are clear too, and you even can test it for yourself.

The community and I would be very grateful if someone finally tells us, like in the topic title, what is the best way to read whole files. Right now it makes nonsense to me to use method 2, really, but when I see good scripters using it there should be a reason... or not?

His code seems to be quite different. The script he has written is reading 1024 bytes and md5'ing it (hashing it) and then storing it. That means he wants every 1024 bytes to be hashes differently because reading whole file then hash ~= reading 10 bytes or random amount of data and hashing it. (my guess, ofc)

Einheit-101 · June 2, 2016

Using anything else than method 1 makes no sense for me either

Anubhav · June 2, 2016

Using anything else than method 1 makes no sense for me either

It actually does make sense if you don't want to directly buffer the data into a string/table and do some seperations with it. This topic is really interesting and I will keep a eye on it. I'm sure there is a reason, and I actually don't use file functions because I never needed them because I use SQLite, XML or MySQL.

AlexTMjugador · June 2, 2016

His code seems to be quite different. The script he has written is reading 1024 bytes and md5'ing it (hashing it) and then storing it. That means he wants every 1024 bytes to be hashes differently because reading whole file then hash ~= reading 10 bytes or random amount of data and hashing it. (my guess, ofc)

Yes, that is what the code does, but it is a bit strange nevertheless. If the file is big, that tiny code will be a CPU and memory hog, and I think that unless you are trying to hash something untrustful securely it is not necessary to hash the file by parts. And by the way, if you need more confidence against hash collisions, you can use the SHA512 algorithm MTA provides.

Anubhav · June 2, 2016

His code seems to be quite different. The script he has written is reading 1024 bytes and md5'ing it (hashing it) and then storing it. That means he wants every 1024 bytes to be hashes differently because reading whole file then hash ~= reading 10 bytes or random amount of data and hashing it. (my guess, ofc)

Yes, that is what the code does, but it is a bit strange nevertheless. If the file is big, that tiny code will be a CPU and memory hog, and I think that unless you are trying to hash something untrustful securely it is not necessary to hash the file by parts. And by the way, if you need more confidence against hash collisions, you can use the SHA512 algorithm MTA provides.

After studying that resource of the file, I reliazed that he used it to get the file's md5 hash string. Like when you download they show you that original file's MD5, sha hash.etc, I assume it is used for that purpose.

botder · June 2, 2016

Well, hello there. My hash function does not follow any standards and to improve the performance I decreased/increased the buffer size. The result was that increasing the buffer was counter-performance in this case, but this statement may be false on larger files - didn't investigate. I am open for improvements on my shitty code

Edit: http://stackoverflow.com/questions/10324611/how-to-calculate-the-md5-hash-of-a-large-file-in-c

This is not a standard description page, but the example codes given by the authors of each posts indicate that 1024 is the correct size for this.

AlexTMjugador · June 2, 2016

That Stack Overflow question is an interesting read indeed. According to the answer, the best way to deal with very big files (I would say ~50 MB or more, but of course this can vary according to the situation) is splitting them in chunks. This seems logical, as reading a very big file directly into memory will be very demanding and even impossible: for example, if you have a video file of, let's say, 2 GB, you can't even read it in a machine with less than 2 GB RAM, as virtually there is no space. Paging files and such can help, you may say, but I think you can understand what I mean anyway: it is a lame programming practice to read a very big file into memory directly and hoping that the OS will always be happy with that. In fact, that discussion made me think about the fact that while/repeat loops with fileIsEOF aren't as unuseful as I believed, because they may in fact come in handy when talking about huge files.

However, that is how things get done when using C or similar languages designed to be lower-level and allow a great degree of functionality to applications, but that can't exactly be ported to Lua. The famous book Programming in Lua, in its chapter 11, explains how does Lua allocate memory when concatenating strings, putting some code examples of good and bad practices. I think we can extract some conclusions from it, after considering some answers to this topic too:

Method 2 alone, with the only intention of reading a whole file into memory and not doing anything more, it is definitely a MUSTN'T.
Method 1 is very appropiate, CPU wise, when dealing with small or not-so-big files: memory usage does not go very high (if we are talking about very small files, of a few bytes, it is even lower than with other methods), and the overhead introduced by using tables or buffer strings is nonexistent.
When memory starts being a concern (that is, the script is working with big or very big files), performance becomes less important and the script should try to lower or at least not raise its memory usage. We do not want it to crash on systems with low free memory, do we? And now is when I suppose that loops and file read buffers have a reason to be: it is no longer suitable to treat a file as a whole, but as chunks, and buffer strings do that.

As I concluded, in Lua is a MUSTN'T using a ever growing buffer string which becomes bigger and bigger with each byte read. But at the same time reading the whole huge file is impractical. What we can do in this situation? Although the previously linked Programming in Lua fragment hints that a more efficient approach is something a bit similar (but conceptually very different) to method 3, the whole file is kept in the memory nevertheless, and this is only a partial solution: it skips most of the garbage collector overhead of string concatenating, but it still looks inefficient when a script only cares about isolated chunks. Well, I think that the answer is in Programming in Lua's 21.2.1 section: read the file in "reasonably large chunks" (more chunk size = more performance, but higher memory usage, because we reduce the number of loop iterations at the cost of memory) and do not care about what is before that chunk.

But all this reasoning, from my point of view, doesn't answer the initial question: what is the best way to read and process whole files? There is not an only way that will work just fine for every file. The ideal thing is to switch between various methods on operation, depending on what type of file are you dealing with (is it feasible to process the chunks one by one, or is it needed to store the whole file in memory?) and its size.

TL;DR: The best way to read and process a whole file that is small is by reading it directly into a string and doing the desired thing with it. However, when memory usage is a concern and/or the file is big, you may have to use other algorithms which are capable of processing a file in isolated chunks, which sacrifice CPU time in exchange of reasonable memory consumption. If you need to store a whole big file in memory and process it in chunks, which is something I think is not very likely, the best way to go is a buffer stack which manages string concatenation efficiently, which is an operation that, when misused, is a major CPU and memory hog due to garbage collection and Lua itself. NEVER use something like buffer = buffer .. fileRead(file, bufferSize) without any further attention.

Of course, if someone finds my conclusions incorrect, please reply to this post and expose the reasons for thinking so

Talking to Necktrox now, and after writing this huge text, I think that a better approach for computing a whole file MD5, even if it is big or small, would be to think about a file size "red line" (let's say 1 MB, for example) and change the algorithm according to it. You can see it like an "hybrid approach". If the file is smaller than the limit, reading the file into a string directly is the fastest way to do it, and memory consumption is not a problem because that file will take 1 MB or less in memory. But when the file is bigger, it will take more than 1 MB, so chunk processing is the way to go here. My suggestion on buffer size? The one that keeps memory usage nice and constant, so it would be 1 MB minus the bytes that a MD5 hash occupies. What's more, you can make it fancier, and upper or lower the "red line" depending of the file size and the balance between memory consumption and performance you want. I didn't test what I'm saying though, but to me it seems like a pretty straightforward consequence of what I said.

botder · June 3, 2016

It's a good idea to approach the reading of the file into memory by increasing the buffer-size depending on the size of the source file. The formula to compute the buffer-size should consider hardware speed, memory size and source file size. Hardware speed (cheap VPS with 1/2 cores vs. multicore root servers) and memory size should be hard-coded as constants (does the garbage collector give any information to calculate an optimal threshold for memory usage iykwim?). Code it and publish it as a useful function on the MTA wiki

Furthermore, you might shot yourself in the leg if you rely on 3rd-party programs calculating the MD5-filehash by hashing each 1024 bytes while your script does it with e.g. 4096 - the resulting hash will be wrong. (You might load the entire file into memory and then hash each 1024-byte part, but wouldn't that end up with Lua copying each 1024 bytes again into memory, which would kill the whole point of file-readin optimization?).

Nevertheless, a fast function to load an entire file would be useful for other purposes if you need the file as a whole (e.g. script execution).

AlexTMjugador · June 3, 2016

It's a good idea to approach the reading of the file into memory by increasing the buffer-size depending on the size of the source file. The formula to compute the buffer-size should consider hardware speed, memory size and source file size. Hardware speed (cheap VPS with 1/2 cores vs. multicore root servers) and memory size should be hard-coded as constants (does the garbage collector give any information to calculate an optimal threshold for memory usage iykwim?). Code it and publish it as a useful function on the MTA wiki

Making an useful function which handles all of this seems a good idea to me. However, Lua knows almost nothing of the underlying operating hardware, so it can't measure CPU performance and memory availability without doing some kind of stress testing, which is largely impractical. Anyway, that is not a big problem: you can get total CPU and memory usage on the server, and memory consumption on the client, so you can estimate the most appropiate performance/memory balance. Of course, some manual tweaking may be needed under varying hardware configurations and design aspirations, because it can't account for processes outside MTA: SA, besides other factors.

Furthermore, you might shot yourself in the leg if you rely on 3rd-party programs calculating the MD5-filehash by hashing each 1024 bytes while your script does it with e.g. 4096 - the resulting hash will be wrong.

Unless you are planning to use the input of another script or program, or you want your code to be usable by another program with that buffer size, modifying it is not an issue. What's more, this Stack Overflow question about hashing huge files in Python shows that there is no standard buffer size for this: some prefer using a 128 byte buffer, others 8192... So there is no way to always avoid shooting yourself, unless you know what are you working with.

I will be thinking about making some kind of function which makes all this file reading and processing mess easy to do, and transparent to the newbie (or not so newbie) scripter

Sign In

The best way to read and process whole files

Recommended Posts

AlexTMjugador

Link to comment

Einheit-101

Link to comment

AlexTMjugador

Link to comment

Anubhav

Link to comment

Einheit-101

Link to comment

Anubhav

Link to comment

AlexTMjugador

Link to comment

Anubhav

Link to comment

botder

Link to comment

AlexTMjugador

Link to comment

botder

Link to comment

AlexTMjugador

Link to comment

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Browse

Activity

MTA Network

Social Media