Word Count Line Estimate

Working with huge data files (millions of lines), I often want to know roughly how many lines there are in a file. You can use the typical wc -l to count the lines exactly, but this takes awhile for big files. Instead, if all you really need is a rough estimate (to the nearest million or so), here’s a quick script I wrote that can do this.

The idea is to use the file size, along with a guess of how big a line is (based on the average length of the first lines in the file). Then, extrapolate out to estimate the number of lines. Kudos to stackoverflow answers for the idea. This is much faster than using wc -l, and accurate enough to get an idea of what you’re dealing with. Here’s wcle (word count line estimate):

 # wcle – word count line estimate # Fast line-count estimate for huge files # By Nathan Sheffield, 2014</p>
<p>file=$1 nsample=1000 headbytes=<code>head -q -n $nsample $file | wc -c</code> #tailbytes=<code>tail -q -n $nsample $file | wc -c</code> #echo $headbytes</p>
<p>filesize=<code>ls -sH --block-size=1 $file | cut -f1 -d" "</code> #echo $filesize</p>
<p>echo -n $((filesize / (headbytes) * $nsample)) echo " (" $((filesize / headbytes )) “K;” $((filesize / headbytes /1000 )) “M )”