Bigdata
How to Process Large Files … ?
Large is a variable Term, 700 GB is large for me, while it could be a small peace for others.
Assuming you need to count the lines … this simple Task can take minutes !
Size
[user@host /tmp]$ du -sh bigfile
745G bigfile
Wordcount -> 10 min
if you need to count the lines, use the wordcount command and you get the exact number … but you have to wait for minutes, depending in your disk subsystem and the file size of course
[user@host /tmp]$ time wc -l bigfile
1265723263 bigfile
real 10m42.255s
user 1m1.684s
sys 5m6.303s
Estimate Lines Script (Linux)
if you can live with an estimate, just try this script. it just meassure the first 100 lines, calculcate the size and calculate it for the rest of the file
cat << 'EOF' > linestimate.sh
#!/usr/bin/env bash
head -101 $1 | tail -100 > $1_fewlines
filesize=$(du -b $1 | cut -f -1)
linesize=$(du -b $1_fewlines | cut -f -1)
rm $1_fewlines
echo $(expr $filesize / $linesize \* 100) $1
EOF
chmod 755 linestimate.sh
Estimate Lines Script (OpenBSD)
OpenBSD needs gdu of the coreutil. the onboard “du” command is not able to count the Bytes :(
doas pkg_add coreutils
cat << 'EOF' > linestimate.sh
#!/usr/bin/env bash
head -101 $1 | tail -100 > $1_fewlines
filesize=$(gdu -b $1 | cut -f -1)
linesize=$(gdu -b $1_fewlines | cut -f -1)
rm $1_fewlines
echo $(expr $filesize / $linesize \* 100) $1
EOF
chmod 755 linestimate.sh
Run Script -> 10 ms
[user@host /tmp]$ ./linestimate.sh bigfile
1263427700 bigfile
real 0m0.011s
user 0m0.005s
sys 0m0.010s
Deviation: 0.2 %
depending on the type of data, you will get a fairly accurate estimate with a deviation of a few parts per thousand
wc -l: 1265723263
script: 1263427700
Diff: +2295563
-> 2295563 / 1265723263 = 0.001813637362214 -> 0.18 Percent :)
Any Comments ?
sha256: aa7210d9822299c3fd8174acaf684bb45724e458d86fafa6d20728b17de9d78e