Pearl …

So I was recently sent about 60 GB of data on a harddrive. The contents were contained in a single file of comma separated values and I was warned that that the data would probably be a bit messy but no one had looked at the data yet. If I had some time, I was requested to browse over the proposed project and give feedback about what would be needed to accomplish the deliverables and what changes to the proposal might be needed if we were to accept this project.
In many cases this is a quick task since we will have smaller datasets or the content will already be on a cloud system that we can gain access but in this case the data was just on a harddrive. This was also a project where we did not want to commit many resources at this time and I was travelling so I wanted to have a method I could do while on the plane.
I attempted to read the contents with Pandas and that did not work due to the file size. This was pretty much expected. I then turned to Vaex and some tasks worked fine but more complicated tasks would produce errors, also due to the large file size. I thought about turning to Polars but was a bit tired of trying python methods at this point. Polars will be a learning experience project in the near future. Instead, I turned to Perl and GitBash. I had used this in the past and was impressed with the speed and simplicity.
I was not disappointed. I determined how many rows of data I had and took a couple samples of 10M rows and used a simple python script to get a nice overview of what the data was like from the smaller files. The impressive part was that the sample was generated in less than 30 seconds and was only a single line of commands. Compared to using python and the csv package, this was much faster. I had seen many suggestions on stack overflow to use the csv_reader but when your file is over 100M rows then this is a very slow option.
Here are just two of the commands that I thought were very useful:
To view the first few lines of the file:
head -3 filename
To get a random sample of 5% of the data:
perl -ne 'print if (rand() < .05)' large_filename > sampled_data_filename
My next task will be to use Polars and see how I like it and to read more about Perl’s capabilities to manipulate data. I found a free open source book entitled Data Munging with Pearl which will be my starting point. The book is over 20 years old but the focus of the book’s content is in line with my goals.
With all of that being said. I really appreciate the speed and simplicity of this tool.
Side note
: Although this is not a perl command, I often use this terminal command also to give me hints about the data type of my document.
file --exclude encoding filename_inserted_here.csv