Monthly Archives: December 2022

Python

Python – Read CSV

One of the more important things I need to attend to is reading a CSV file and examining it. While there is a plethora of documentation on this, since this is my blog I’m documenting my most used cases.

dfOriginalCSV = pd.read_csv("csvFile.csv", sep=",", dtype=str, keep_default_na=False, encoding='utf-8')

So the file is csvFile.csv, while we don’t have to declare it the sep provides the separator character in case of those pesky pipes. By declaring the dtype of str we’re saying the whole thing is a string so it doesn’t do odd tricks with numbers. The keep default na suppresses pythons overwhelming desire to put nan into anything that doesn’t seem like a proper value and of course always account for the encoding.

PowerShell

PowerShell – Count number of commas in each row of a CSV

So sometimes your data is so bad that you can’t trust a two column CSV to be turned into a dictionary, or more aptly, the file that is supposed to be two columns won’t open with in python to convert it to a dictionary because it doesn’t see two columns. So where is the issue? Probably a column count, which you can usually figure out if there are extra in excel but what if there are less? This PS snippet will read a given CSV, count the number of columns in the header row and report on any difference in the file, assuming your header row is correct this tells you which lines aren’t helping to validate a file.

$data = Get-Content location.csv

#Count in header
$header = $data[0].Split(",").Count
#Line counter
$i = 1

#Find lines with less or more commas
$data | ForEach-Object { 
    $c = $_.Split(",").Count
    if($c -ne $header) { "Line $i - $c commas" }
    $i++
}