What if part of your reporting database is a set of CSV files?
Apart from possible problems with different number of columns, data types, delimiters, encoding etc., you have to care about performance.
According to my practice, large number of files kills productivity. It is better to firstly combine CSV / TXT files into one, then use Power Query to load it.
Experiment with reasonable number of files
In this scenario I just read and load all files into one table
Set of files | One BIG file |
40 files | 1 file |
584 MB | 598 MB |
1 min 15 sec | 53 sec |
With transformation like “Change Data Type”, Remove some columns, Replace Values:
Set of files | One BIG file |
1 min 29 sec | 1 min 4 sec |
As we may see, simple transformation only slightly increased load time.
Let’s double number of files:
Set of files | One BIG file |
80 files | 1 file |
1.14 GB | 1.2 GB |
2 min 45 sec | 1 min 40 sec |
Difference: 55 sec. Significant!
Seems, like it is always better to combine files into one.
How to combine CSV files into one
Locate your folder with CSV/TXT files in File Explorer, hold SHIFT and use right-click:
Use command
copy *.csv result.csv
All files in the folder will be combined into one.
If you have headers in each file, filter them out later in Power Query.
If you want to save resulting file to another folder – I recommend to read more about COPY command here and here.
Experiment with large number or files
Around a year ago or so, I had an experience of working with ~2000 CSV files.
I don’t have those files now, but I remember that it took for me around 30 min to load all those files to Power BI Desktop.
Then I combined them into one and load time dropped to 5 min!
Out of curiosity, I decided to check current version of Power BI Desktop and confirm “theory of big file”.
I generated fake files and compared load time (read and load, without transformation)
Set of files | One BIG file |
679 files | 1 file |
2.4 GB | 2.5 GB |
5 min 19 sec | 3 min |
We may see significant difference! Data Model gets the same information, but we release workstation resources on 2 min earlier.
And this is just a “read and load”. No transformations, no relationships in data model.
CSV vs XSLX
You may read here why I do not consider XLSB as a source for Power Query.
Just a few days ago, Ken Puls published very interesting results of his experiment with csv and xl files. Must read if your reporting database is set of files.
Although, XLSX is slower, it still can be useful in simple scenarios.
When I plan architecture of reporting solution, I keep in mind following matrix
Small data volume | Large data volume | |
Small number of files | XSLX or CSV if you sure in data quality | CSV |
Large number of files (>24) | CSV – > combine into one big CSV | CSV – > combine into one big CSV |
Of course, there should be a balance between number of files and data volume. At some point it is better to go away from XLSX.
Pay attention, that with CSV and TXT you have to care about encoding, delimiters, possible “carriage return” symbols, quotes etc. If you have no control over CSV creation and it comes from untrusted source – be ready for unpleasant surprises.