https://github.com/minosvasilias/godot-dodo

Skip to content Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Case Studies
      + Customer Stories
      + Resources
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

[                    ] 

  *  
    #
    In this repository All GitHub |
    Jump to |

  * No suggested jump to results

  *  
    #
    In this repository All GitHub |
    Jump to |
  *  
    #
    In this user All GitHub |
    Jump to |
  *  
    #
    In this repository All GitHub |
    Jump to |

Sign in
Sign up
{{ message }}
minosvasilias / godot-dodo Public

  * Notifications
  * Fork 0
  * Star 6

Finetuning large language models for GDScript generation.

License

MIT license
6 stars 0 forks
Star
Notifications

  * Code
  * Issues 0
  * Pull requests 0
  * Actions
  * Projects 0
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

minosvasilias/godot-dodo

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
1 branch 0 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/m]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone minosv]

    Work fast with our official CLI. Learn more.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@minosvasilias
minosvasilias Initial commit
...
f62b90a Apr 23, 2023
Initial commit
f62b90a

Git stats

  * 1 commit

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
data
Initial commit
April 23, 2023 22:16
finetune
Initial commit
April 23, 2023 22:16
models
Initial commit
April 23, 2023 22:16
.gitattributes
Initial commit
April 23, 2023 22:16
.gitignore
Initial commit
April 23, 2023 22:16
LICENSE
Initial commit
April 23, 2023 22:16
README.md
Initial commit
April 23, 2023 22:16
godot_dodo_logo.png
Initial commit
April 23, 2023 22:16
requirements.txt
Initial commit
April 23, 2023 22:16
View code
[                    ]
godot-dodo Performance Concept How? Why? Dataset Generation Run
Pre-assembled datasets Finetuning Hardware Requirements Setup Run
Inference Publishing to Huggingface Finetuned model weights Cost
Datasets Finetuned Models Use with godot-copilot Acknowledegments
Citation

README.md

 godot-dodo

Godot-Dodo logo imagined by Midjourney v5

The godot-dodo project presents a pipeline to finetune open source
language models on human-created, language-specific code retrieved
from GitHub.

In this case, the targeted language is GDScript, but the same
methodology can be applied to other languages.

This repository includes the following:

  * Scripts to assemble the finetuning dataset
  * Pre-assembled, raw datasets (up to a size of 60k rows)
  * Scripts to finetune a model
  * Links to model weights
  * Performance report comparing finetuned models

 Performance

Results

For comprehensive results explaining the methodology used and a full
list of all result, please refer to the full performance report here.

In summary, godot_dodo models achieve significantly greater
consistency than gpt-4/gpt-3.5-turbo when it comes to generating
accurate GDScript syntax, but are somewhat less capable of following
complex instructions.

 Concept

 How?

Unlike other, similar approaches to finetuning models such as
stanford-alpaca, this approach does not use existing, larger language
models for the output-values of the finetuning-dataset. All code used
is human-created. Language models are instead only used to label each
code snippet.

As such, we can assemble comment:code data-pairs in the style of
CodeSearchNet, making use of powerful existing models to annotate
high-quality human-created code.

 Why?

Some existing language models such as gpt-4 are excellent coders.
However, a lot of their ability is concentrated in only the most
popular languages, such as Python or Javascript.

Less widely used languages are underrepresented in the training data
and experience a massive performance drop-off, where models routinely
mistake syntax or hallucinate language features that do not exist.

This aims to provide much more robust language-specific models that
can be used to reliably generate code that compiles on first try.

 Dataset Generation

Due to this approach relying on human-created data, we scrape GitHub
repositories using the GitHub search API.

Using the language:gdscript search term, we retrieve a list of
repositories including GDScript code.

We also use license:mit to limit the dataset to suitable
repositories. Only MIT-licensed code is used for training!

We then clone each one and apply the following logic:

  * Find project.godot file
  * Detect whether project is made for 3.x or 4.x Godot engine
    versions
  * Iterate through all .gd files found in the repository
  * For each one, split file into individual functions
  * For each function found, ask existing LLM (gpt-3.5-turbo) for a
    detailed comment describing the functions purpose
  * Add instruction:response data pair to dataset

Note that existing, human-written comments located above the
code-block are not used for the instruction value. We are interested
in consistent detail for comments, rather than trying to preserve
some potentially higher-quality human-written ones.

Human comments within the code block however are preserved.

 Run

To assemble a dataset yourself, follow these instructions:

  * Run python data/generate_unlabeled_dataset.py
  * Run python data/label_dataset.py

Please do note that you'll need GitHub and OpenAI API keys in order
to use these scripts.

 Pre-assembled datasets

Pre-assembled datasets included in this repository:

  * godot_dodo_4x_60k
      + Assembled using 4.x Godot projects - ~60k rows

Further datasets may be added in the future (particularly regarding
3.x data)

 Finetuning

The fine-tuning process closely mirrors the one introduced by
stanford_alpaca.

To reproduce a fine-tuned version of LLaMA, please follow the steps
below.

 Hardware Requirements

In order to effectively finetune a llama-7b or llama-13b model, it is
highly recommended to use at least two A100 80GB GPUs. You may
otherwise encounter out of memory errors or experience extremely long
training times, and will need to adjust the training parameters.

For finetuning godot_dodo_4x_60k_llama_13b, eight A100 80GB GPUs were
used.

Another important consideration is the protocol used for GPU
communication. It is recommended to use NVLink setups rather than
PCIe.

Should you only have access to PCIe setups, please replace full-shard
with shard_grad_op in the torchrun command. This may severely speed
up your training runs at the cost of potentially higher memory usage.

 Setup

Before finetuning, make sure to install all requirements using:

pip install -r requirements.txt

 Run

For exact commands used for finetuning models, please refer to the
individual model pages:

  * models/godot_dodo_4x_60k_llama_7b
  * models/godot_dodo_4x_60k_llama_13b

 Inference

To test out your finetuned model, you can use the eval.py script.
Simply run:

python finetune/eval.py --model_name_or_path PATH_TO_FINETUNED_MODEL/

 Publishing to Huggingface

To easily upload a finetuned model to Huggingface, you can use:

python finetune/push_to_hub.py --model_name_or_path PATH_TO_FINETUNED_MODEL/ --push_name HF_MODEL_NAME --auth_token HF_ACCESS_TOKEN

 Finetuned model weights

Links to model weights hosted on Huggingface are provided in the
respective model pages:

  * models/godot_dodo_4x_60k_llama_7b
  * models/godot_dodo_4x_60k_llama_13b

 Cost

Below the dollar-cost of assembling each available dataset and
finetuning each model.

 Datasets

  * godot_dodo_4x_60k
      + 30$ (gpt-3.5-turbo API costs)

 Finetuned Models

  * models/godot_dodo_4x_60k_llama_7b
      + 24$ (8x A100 80GB instance costs)
  * models/godot_dodo_4x_60k_llama_13b
      + 84$(8x A100 80GB instance costs)

 Use with godot-copilot

Usage of finetuned models with godot-copilot for in-editor, fully
local code generation may be supported in the future.

 Acknowledegments

Thank you to all MIT-licensed Godot projects! This would not be
possible without you.

All projects that were scraped during assembly of the included
finetuning data are listed in the respective dataset folders in data.

Another thank you goes to fluidstack.io for their reliable, cheap GPU
instances that were used for finetuning these models.

 Citation

If you wish to cite this project, please use:

@misc{godot-dodo,
  author = {Markus Sobkowski},
  title = {Godot-Dodo: Finetuned language models for GDScript generation},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/minosvasilias/godot-dodo}},
}

About

Finetuning large language models for GDScript generation.

Resources

Readme

License

MIT license

Stars

6 stars

Watchers

1 watching

Forks

0 forks
Report repository

Releases

No releases published

Packages 0

No packages published

Languages

  * Python 100.0%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session.