https://jamesturk.github.io/scrapeghost/ [ ] [ ] Skip to content logo scrapeghost About ( ) ( ) [ ] Initializing search jamesturk/scrapeghost logo scrapeghost jamesturk/scrapeghost * [ ] About About Table of contents + Quickstart o Command Line Usage Example + Features * Tutorial * OpenAI / GPT * Usage * FAQ * [ ] Reference Reference + Command Line Interface + API Reference * [ ] About Scrapeghost About Scrapeghost + Hippocratic License + Code of Conduct + Changelog Table of contents * Quickstart + Command Line Usage Example * Features AboutP scrapeghost logo scrapeghost is an experimental library for scraping websites using OpenAI's GPT. The library provides a means to scrape structured data from HTML without writing page-specific code. Important Before you proceed, here are at least three reasons why you should not use this library: * It is very experimental, no guarantees are made about the stability of the API or the accuracy of the results. * It relies on the OpenAI API, which is quite slow and can be expensive. (See costs before using this library.) * Currently licensed under Hippocratic License 3.0. (See FAQ.) Use at your own risk. QuickstartP Step 1) Obtain an OpenAI API key (https://platform.openai.com) and set an environment variable: export OPENAI_API_KEY=sk-... Step 2) Install the library however you like: pip install scrapeghost or poetry add scrapeghost Step 3) Instantiate a SchemaScraper by defining the shape of the data you wish to extract: from scrapeghost import SchemaScraper scrape_legislators = SchemaScraper( schema={ "name": "string", "url": "url", "district": "string", "party": "string", "photo_url": "url", "offices": [{"name": "string", "address": "string", "phone": "string"}], } ) Note There's no pre-defined format for the schema, the GPT models do a good job of figuring out what you want and you can use whatever values you want to provide hints. Step 4) Passing the scraper a URL (or HTML) to the resulting scraper will return the scraped data: resp = scrape_legislators("https://www.ilga.gov/house/rep.asp?MemberID=3071") resp.data {"name": "Emanuel 'Chris' Welch", "url": "https://www.ilga.gov/house/Rep.asp?MemberID=3071", "district": "7th", "party": "D", "photo_url": "https://www.ilga.gov/images/members/{5D419B94-66B4-4F3B-86F1-BFF37B3FA55C}.jpg", "offices": [ {"name": "Springfield Office", "address": "300 Capitol Building, Springfield, IL 62706", "phone": "(217) 782-5350"}, {"name": "District Office", "address": "10055 W. Roosevelt Rd., Suite E, Westchester, IL 60154", "phone": "(708) 450-1000"} ]} That's it! Read the tutorial for a step-by-step guide to building a scraper. Command Line Usage ExampleP If you've installed the package (e.g. with pipx), you can use the scrapeghost command line tool to experiment. #!/bin/sh scrapeghost https://www.ncleg.gov/Members/Biography/S/436 \ --schema "{'first_name': 'str', 'last_name': 'str', 'photo_url': 'url', 'offices': [] }'" \ --css div.card | python -m json.tool { "first_name": "Gale", "last_name": "Adcock", "photo_url": "https://www.ncleg.gov/Members/MemberImage/S/436/Low", "offices": [ { "type": "Mailing", "address": "16 West Jones Street, Rm. 1104, Raleigh, NC 27601" }, { "type": "Office Phone", "phone": "(919) 715-3036" } ] } See the CLI docs for more details. FeaturesP The purpose of this library is to provide a convenient interface for exploring web scraping with GPT. While the bulk of the work is done by the GPT model, scrapeghost provides a number of features to make it easier to use. Python-based schema definition - Define the shape of the data you want to extract as any Python object, with as much or little detail as you want. Preprocessing * HTML cleaning - Remove unnecessary HTML to reduce the size and cost of API requests. * CSS and XPath selectors - Pre-filter HTML by writing a single CSS or XPath selector. * Auto-splitting - Optionally split the HTML into multiple calls to the model, allowing for larger pages to be scraped. Postprocessing * JSON validation - Ensure that the response is valid JSON. (With the option to kick it back to GPT for fixes if it's not.) * Schema validation - Go a step further, use a pydantic schema to validate the response. * Hallucination check - Does the data in the response truly exist on the page? Cost Controls * Scrapers keep running totals of how many tokens have been sent and received, so costs can be tracked. * Support for automatic fallbacks (e.g. use cost-saving GPT-3.5-Turbo by default, fall back to GPT-4 if needed.) * Allows setting a budget and stops the scraper if the budget is exceeded. Back to top Copyright (c) 2023 James Turk Made with Material for MkDocs