https://til.simonwillison.net/googlecloud/video-frame-ocr Simon Willison's TILs Reading thermometer temperatures over time from a video Natalie has been experimenting with using a microwave as a kiln for pottery, specifically for Raku. She wanted to understand how long to run the microwave to reach the desired temperature for the Raku glazes to work. But... her thermometer can't go in the microwave while it's running. So she needs to remove the kiln component from the microwave, attach it to the thermometer and measure how it cools over time in order to calculate its peak temperature while the microwave is running. The tedious way to do this would be to watch the thermometer for over an hour as it cools down, making notes on the temperature at different points and then plotting those on a graph to derive that original temperature. Or... how about recording a video of the thermometer and extracting the readings via OCR instead? We rigged up an iPhone on a stand, pointed it at the digital LCD on the thermometer, and left it running for nearly two hours. Splitting a video into frames every 10s GPT-4 prompt: Write a zsh script to run on macOS with ffmpeg available that takes a path to a video file and then spits out a single frame from every 10s of video in a file with a filename that reflects the time within the video It gave me back this: #!/bin/zsh # Check if input file is provided if [ "$#" -ne 1 ]; then echo "Usage: $0 " exit 1 fi # Check if ffmpeg is installed if ! command -v ffmpeg >/dev/null 2>&1; then echo "ffmpeg is required but not installed. Please install ffmpeg first." exit 1 fi INPUT_VIDEO="$1" OUTPUT_DIR="frames" # Create the output directory if it does not exist mkdir -p "$OUTPUT_DIR" # Get the duration of the video in seconds DURATION=$(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "$INPUT_VIDEO") DURATION=$(printf "%.0f" "$DURATION") # Extract a frame every 10 seconds for ((i=0; i<=DURATION; i+=10)); do OUTPUT_FILENAME=$(printf "${OUTPUT_DIR}/frame_%02d:%02d:%02d.jpg" $((i/3600)) $(((i/60)%60)) $((i%60))) ffmpeg -ss "$i" -i "$INPUT_VIDEO" -vf "select=eq(pict_type\,PICT_TYPE_I)" -vframes 1 -q:v 2 "$OUTPUT_FILENAME" -y done echo "Frames extracted successfully in the ${OUTPUT_DIR} directory." Executed like this: ./extract_frames.sh IMG_9684.MOV This ran perfectly, without any modifications. It generated a folder full of JPG images, taken from 10s intervals of the raw video. Each image looked like this: A close up of the screen of a digital thermometer. The model name is DM6801A+, then a large LCD shows 316, then a small C below it. Running OCR This next step proved harder than I expected. My first attempt used textra by Dylan Freedman. This is a fantastic little CLI tool which runs on macOS and uses the powerful VisionKit APIs on that platform to run OCR and audio extraction. % textra frame_01:00:00.jpg -s DM6801A+ Just one problem: it picked up the text at the top of the image, but it didn't get the output of the LCD display itself with the temperature. Next I tried my own s3-ocr tool, which uses AWS Textract. I've had amazing results with this in the past, but while it did better it still wasn't nearly reliable enough at extracting the numeric readouts from this particular series of images. I hadn't tried it myself, but I'd heard good things about Google Cloud Vision - so I gave that a go using their online demo: The Cloud Vision Try the API tool. My photo is on the left with boxes drawn on it around the model number of the thermometer, the LCD temperature display and the little C shown below it. On the right is a Block 1 box with the model number and Block 2 showing 834, the current temperature. That's exactly what I need! Next challenge: how to automate it? Using the Cloud Vision API GPT-4 prompt: I want to use the google cloud vision API to run OCR from a python script First help me understand what access token credentials I need for this project and how to create them Full transcript here. Asking it about access token credentials was a bit of a stretch: I hate figuring out how to make these, but I was pretty confident that any steps it gave me here would have been out-dated by changes Google had made to their console interface since the training cut-off date for GPT-4 of September 2021. My skepticism was misplaced! It gave me step by step instructions which mostly worked - and gave me enough information to figure out how to get hold of a JSON file representing credentials for a service account that could call the Cloud Vision API. The code it gave me was useful too. I fired up a Jupyter notebook and evolved it until it looked like this: import os from google.cloud import vision from google.oauth2 import service_account def ocr_image(image_path, credentials_path): credentials = service_account.Credentials.from_service_account_file(credentials_path) client = vision.ImageAnnotatorClient(credentials=credentials) with open(image_path, 'rb') as image_file: content = image_file.read() image = vision.Image(content=content) response = client.text_detection(image=image) texts = response.text_annotations return texts[0].description if texts else '' image_path = "ocr-video-frames/frames-9679/frame_00:00:30.jpg" credentials_path = "cloud-vision-ocr-382418-fee63d63124b.json" result = ocr_image(image_path, credentials_path) print(result) Then I wrote my own code to run it against every image in my directory: import pathlib import sqlite_utils root = pathlib.Path("ocr-video-frames/") db = sqlite_utils.Database("ocr.db") for path in root.glob("*/*.jpg"): relative = str(path.relative_to(root)) text = ocr_image(path, credentials_path) db["results"].insert({ "key": relative, "text": text }, pk="key") This gave me a SQLite database file containing the extracted text from every one of my images. Extracting the temperatures in Datasette with a regular expression The OCR extracted text data wasn't just the temperatures I needed. For some of the frames it looked more like this: DM6801A+ 180 POWER HOLD 0.1deg F degF TI THERMOMETER TYPE-K 1.0deg degC The bit I care about is the "180" - three decimal characters with a newline before and after them. I installed the new datasette-sqlite-regex plugin by Alex Garcia: datasette install datasette-sqlite-regex Then I opened up my SQLite database in Datasette and constructed the following query: select regex_find('[0-9]{2}:[0-9]{2}:[0-9]{2}', key) as time, trim(regex_find('\n[0-9]{3}\n', text), char(10)) as temperature from results where key like 'frames/%' and temperature is not null order by [time] The resulting table looked like this: time temperature 00:02:00 830 00:02:10 834 00:02:20 836 00:02:40 834 00:03:20 820 00:03:30 816 00:03:40 812 I used datasette-copyable to copy the data out to a Google Sheet. Here's my first attempt at charting the data, which makes it easy to spot places where the OCR got the wrong results: Google Sheets table and chart representing the data. Starts at 830C and drops down over 1hr 40 minutes. The line chart has some very obvious spikes and troughs from outliers but is mostly smooth. Related * aws Running OCR against a PDF file with AWS Textract - 2022-06-28 * llms Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp - 2023-03-10 * llms Training nanoGPT entirely on content from my blog - 2023-02-09 * python Calculating embeddings with gtr-t5-large in Python - 2023-01-31 * llms Running nanoGPT on a MacBook M2 to generate terrible Shakespeare - 2023-02-01 Created 2023-04-02T10:29:24-07:00, updated 2023-04-02T10:51:58-07:00 * History * Edit