[HN Gopher] Ask HN: How would you chunk a large Excel file?
___________________________________________________________________
Ask HN: How would you chunk a large Excel file?
Let's say you had an Excel file with 10,000 rows and you wanted to
break it up into many Excel files each with 500 records. Each new
file should have the header fields from the original. How would you
do it? I did it by writing a node script but I'm wondering if
there's an easier way. Edit: Guys this is just an example. I'm
looking for a general solution. It could be 10 million rows.
Author : codingclaws
Score : 9 points
Date : 2024-05-26 20:38 UTC (2 hours ago)
| gidorah wrote:
| It's Excel, so just write a macro?
| mattmanser wrote:
| Bet it took you more time.to write that code than if you'd done
| it manually.
|
| If you actually had to do it 'properly' there are actually a ton
| of options: - do it old school with a VBA macro
| - use the newer js macro stuff - xslx files are just a zip
| file of XML so could just do it in pretty much any language
| qup wrote:
| That's only 20 cut and paste operations? I would certainly do
| that manually before trying to code it. And I'm a proficient
| coder, and not a proficient Excel user.
|
| If it had 100k rows, I'd be out of my depth, so I'd hit google.
| binarymax wrote:
| I'm not at my machine to whip up an example but this is an ideal
| use case for pandas. You can read or stream the excel file and
| split it in probably <20 lines of python.
| klyrs wrote:
| Not knowing what the data is like; I'd save it in a comma- or
| tab-separated text format. From there, it's just a few lines of
| bash.
| allsummer wrote:
| This is so quaint. I love it.
|
| Not even going to ask AI. Waiting for a bash one-liner before OP
| reminds us they are on corporate Windows machine.
| hgyjnbdet wrote:
| Then a powershell one liner would do it.
| namrog84 wrote:
| Write a script. I know there are decent api for excel and files.
|
| I dont know the api or recommended scripting language. This would
| be a good case for chatgpt or equivalent type task. Enough to get
| started.
|
| edit: I asked chatgpt, it recommended python and 'pandas' for
| interacting with excel python import
| pandas as pd # Load the data from an Excel file, assuming
| headers are in the first row by default data =
| pd.read_excel('path/to/your/file.xlsx') # Define the
| number of records per chunk chunk_size = 500 #
| Split the data into chunks and write each chunk to a new Excel
| file for i in range(0, len(data), chunk_size):
| # Extract the chunk of data based on the current index range
| chunk = data.iloc[i:i + chunk_size] # Write the chunk
| to a new Excel file, including headers as column names
| chunk.to_excel(f'output_{i // chunk_size + 1}.xlsx', index=False)
|
| I asked about the first 'row', and it claims panda includes that
| in each chunk, but I don't know about that. It's at least a place
| to start to iterate from. Would need to iterate further with real
| code/tests.
| orhmeh09 wrote:
| I would use R. Phoneposting now but something like
| library(tidyverse) library(readxl) library(writexl)
| read_excel("file.xlsx") %>% group_by(group_id =
| row_number() %/% 20) %>% group_walk(~ write_xlsx(.x,
| paste0("file_", .y, ".xlsx")))
|
| edit: updated to write xlsx instead of csv
| nutrie wrote:
| It's been well over a decade since I last dealt with Excel, but I
| remember you could actually query the data with SQL without
| opening the file, like you would with any flat-file db. If the
| size is the problem. It was poorly documented but I'd done it a
| few times and it worked really well. The best part being it was
| simple, fast and worked even with locked files. Otherwise I don't
| understand the question.
| arh68 wrote:
| The C# interop API will surely work, but I wouldn't claim it's
| user-friendly.
|
| I would strongly consider dumping rows into SQL, for more
| "natural" selection.
|
| [1] https://learn.microsoft.com/en-
| us/dotnet/api/microsoft.offic...
| bigyikes wrote:
| Not to be that guy, but this is the kind of task that LLMs are
| great at assisting with.
| MattGaiser wrote:
| Seems like a 30 second LLM prompt for a Python script.
| ttymck wrote:
| https://stackoverflow.com/questions/72110208/python-split-1-...
| tomrod wrote:
| With something in Excel that needs to chunk, Python is good, but
| also Go and Rust are good for these situations. ChatGPT is a good
| starting point to build out some boilerplate.
|
| For readable/not binary files, there are standard tools like
| split, awk, etc.
| abrichr wrote:
| Via ChatGPT:
|
| ---
|
| ```python import pandas as pd
|
| def split_excel_file(input_file: str, output_file_prefix: str,
| rows_per_file: int = 500) -> None: """ Splits an Excel file into
| multiple Excel files, each containing a specified number of rows.
| Args: input_file (str): The path to the input Excel
| file. output_file_prefix (str): The prefix for the
| output Excel files. rows_per_file (int): The number
| of rows per output file. """ # Read the input
| Excel file df = pd.read_excel(input_file) #
| Calculate the number of files needed total_rows = len(df)
| num_files = (total_rows + rows_per_file - 1) // rows_per_file
| for i in range(num_files): start_row = i *
| rows_per_file end_row = start_row + rows_per_file
| df_chunk = df.iloc[start_row:end_row] #
| Generate the output file name output_file =
| f"{output_file_prefix}_{i+1}.xlsx" #
| Write the chunk to a new Excel file
| df_chunk.to_excel(output_file, index=False)
|
| # Usage example: split_excel_file('input.xlsx', 'output_file',
| 500) ```
|
| This script reads the entire Excel file, splits it into chunks of
| 500 rows, and writes each chunk to a new Excel file with the
| header fields included. Adjust `input.xlsx` and `output_file`
| with your actual file names.
| hgyjnbdet wrote:
| VBA macro in the original excel file.
| tanin wrote:
| You can use my SQL spreadsheet app: https://superintendent.app
|
| I had a similar problem at work where I needed to do some formula
| on a 5 GB CSV file. Excel can't handle more than 1M rows.
| Database through command line is too clunky. Eventually I built a
| GUI wrapper on SQLite, and it grew into Superintendent.app (now
| powered by DuckDB).
| guidedlight wrote:
| This looks great. Well done.
___________________________________________________________________
(page generated 2024-05-26 23:01 UTC)