https://jamesturk.github.io/scrapeghost/

[ ] [ ]
Skip to content
logo
scrapeghost
About
( ) ( )
[                    ]
Initializing search

 
jamesturk/scrapeghost
logo scrapeghost
 
jamesturk/scrapeghost

  * [ ] About About Table of contents
      + Quickstart
          o Command Line Usage Example
      + Features
  * Tutorial
  * OpenAI / GPT
  * Usage
  * FAQ
  * [ ] Reference Reference
      + Command Line Interface
      + API Reference
  * [ ] About Scrapeghost About Scrapeghost
      + Hippocratic License
      + Code of Conduct
      + Changelog

Table of contents

  * Quickstart
      + Command Line Usage Example
  * Features

AboutP

scrapeghost logo

scrapeghost is an experimental library for scraping websites using
OpenAI's GPT.

The library provides a means to scrape structured data from HTML
without writing page-specific code.

Important

Before you proceed, here are at least three reasons why you should
not use this library:

  * It is very experimental, no guarantees are made about the
    stability of the API or the accuracy of the results.

  * It relies on the OpenAI API, which is quite slow and can be
    expensive. (See costs before using this library.)

  * Currently licensed under Hippocratic License 3.0. (See FAQ.)

Use at your own risk.

QuickstartP

Step 1) Obtain an OpenAI API key (https://platform.openai.com) and
set an environment variable:

export OPENAI_API_KEY=sk-...

Step 2) Install the library however you like:

pip install scrapeghost

or

poetry add scrapeghost

Step 3) Instantiate a SchemaScraper by defining the shape of the data
you wish to extract:

from scrapeghost import SchemaScraper
scrape_legislators = SchemaScraper(
  schema={
      "name": "string",
      "url": "url",
      "district": "string",
      "party": "string",
      "photo_url": "url",
      "offices": [{"name": "string", "address": "string", "phone": "string"}],
  }
)

Note

There's no pre-defined format for the schema, the GPT models do a
good job of figuring out what you want and you can use whatever
values you want to provide hints.

Step 4) Passing the scraper a URL (or HTML) to the resulting scraper
will return the scraped data:

resp = scrape_legislators("https://www.ilga.gov/house/rep.asp?MemberID=3071")
resp.data

{"name": "Emanuel 'Chris' Welch",
 "url": "https://www.ilga.gov/house/Rep.asp?MemberID=3071",
 "district": "7th", "party": "D", 
 "photo_url": "https://www.ilga.gov/images/members/{5D419B94-66B4-4F3B-86F1-BFF37B3FA55C}.jpg",
   "offices": [
     {"name": "Springfield Office",
      "address": "300 Capitol Building, Springfield, IL 62706",
       "phone": "(217) 782-5350"},
     {"name": "District Office",
      "address": "10055 W. Roosevelt Rd., Suite E, Westchester, IL 60154",
       "phone": "(708) 450-1000"}
   ]}

That's it!

Read the tutorial for a step-by-step guide to building a scraper.

Command Line Usage ExampleP

If you've installed the package (e.g. with pipx), you can use the
scrapeghost command line tool to experiment.

#!/bin/sh 
scrapeghost https://www.ncleg.gov/Members/Biography/S/436  \
        --schema "{'first_name': 'str', 'last_name': 'str',
        'photo_url': 'url', 'offices': [] }'" \
        --css div.card | python -m json.tool

{
    "first_name": "Gale",
    "last_name": "Adcock",
    "photo_url": "https://www.ncleg.gov/Members/MemberImage/S/436/Low",
    "offices": [
        {
            "type": "Mailing",
            "address": "16 West Jones Street, Rm. 1104, Raleigh, NC 27601"
        },
        {
            "type": "Office Phone",
            "phone": "(919) 715-3036"
        }
    ]
}

See the CLI docs for more details.

FeaturesP

The purpose of this library is to provide a convenient interface for
exploring web scraping with GPT.

While the bulk of the work is done by the GPT model, scrapeghost
provides a number of features to make it easier to use.

Python-based schema definition - Define the shape of the data you
want to extract as any Python object, with as much or little detail
as you want.

Preprocessing

  * HTML cleaning - Remove unnecessary HTML to reduce the size and
    cost of API requests.
  * CSS and XPath selectors - Pre-filter HTML by writing a single CSS
    or XPath selector.
  * Auto-splitting - Optionally split the HTML into multiple calls to
    the model, allowing for larger pages to be scraped.

Postprocessing

  * JSON validation - Ensure that the response is valid JSON. (With
    the option to kick it back to GPT for fixes if it's not.)
  * Schema validation - Go a step further, use a pydantic schema to
    validate the response.
  * Hallucination check - Does the data in the response truly exist
    on the page?

Cost Controls

  * Scrapers keep running totals of how many tokens have been sent
    and received, so costs can be tracked.
  * Support for automatic fallbacks (e.g. use cost-saving
    GPT-3.5-Turbo by default, fall back to GPT-4 if needed.)
  * Allows setting a budget and stops the scraper if the budget is
    exceeded.

Back to top
Copyright (c) 2023 James Turk
Made with Material for MkDocs