https://github.com/minosvasilias/godot-dodo Skip to content Toggle navigation Sign up * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Case Studies + Customer Stories + Resources * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing [ ] * # In this repository All GitHub | Jump to | * No suggested jump to results * # In this repository All GitHub | Jump to | * # In this user All GitHub | Jump to | * # In this repository All GitHub | Jump to | Sign in Sign up {{ message }} minosvasilias / godot-dodo Public * Notifications * Fork 0 * Star 6 Finetuning large language models for GDScript generation. License MIT license 6 stars 0 forks Star Notifications * Code * Issues 0 * Pull requests 0 * Actions * Projects 0 * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Security * Insights minosvasilias/godot-dodo This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 1 branch 0 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/m] Use Git or checkout with SVN using the web URL. [gh repo clone minosv] Work fast with our official CLI. Learn more. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @minosvasilias minosvasilias Initial commit ... f62b90a Apr 23, 2023 Initial commit f62b90a Git stats * 1 commit Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time data Initial commit April 23, 2023 22:16 finetune Initial commit April 23, 2023 22:16 models Initial commit April 23, 2023 22:16 .gitattributes Initial commit April 23, 2023 22:16 .gitignore Initial commit April 23, 2023 22:16 LICENSE Initial commit April 23, 2023 22:16 README.md Initial commit April 23, 2023 22:16 godot_dodo_logo.png Initial commit April 23, 2023 22:16 requirements.txt Initial commit April 23, 2023 22:16 View code [ ] godot-dodo Performance Concept How? Why? Dataset Generation Run Pre-assembled datasets Finetuning Hardware Requirements Setup Run Inference Publishing to Huggingface Finetuned model weights Cost Datasets Finetuned Models Use with godot-copilot Acknowledegments Citation README.md godot-dodo Godot-Dodo logo imagined by Midjourney v5 The godot-dodo project presents a pipeline to finetune open source language models on human-created, language-specific code retrieved from GitHub. In this case, the targeted language is GDScript, but the same methodology can be applied to other languages. This repository includes the following: * Scripts to assemble the finetuning dataset * Pre-assembled, raw datasets (up to a size of 60k rows) * Scripts to finetune a model * Links to model weights * Performance report comparing finetuned models Performance Results For comprehensive results explaining the methodology used and a full list of all result, please refer to the full performance report here. In summary, godot_dodo models achieve significantly greater consistency than gpt-4/gpt-3.5-turbo when it comes to generating accurate GDScript syntax, but are somewhat less capable of following complex instructions. Concept How? Unlike other, similar approaches to finetuning models such as stanford-alpaca, this approach does not use existing, larger language models for the output-values of the finetuning-dataset. All code used is human-created. Language models are instead only used to label each code snippet. As such, we can assemble comment:code data-pairs in the style of CodeSearchNet, making use of powerful existing models to annotate high-quality human-created code. Why? Some existing language models such as gpt-4 are excellent coders. However, a lot of their ability is concentrated in only the most popular languages, such as Python or Javascript. Less widely used languages are underrepresented in the training data and experience a massive performance drop-off, where models routinely mistake syntax or hallucinate language features that do not exist. This aims to provide much more robust language-specific models that can be used to reliably generate code that compiles on first try. Dataset Generation Due to this approach relying on human-created data, we scrape GitHub repositories using the GitHub search API. Using the language:gdscript search term, we retrieve a list of repositories including GDScript code. We also use license:mit to limit the dataset to suitable repositories. Only MIT-licensed code is used for training! We then clone each one and apply the following logic: * Find project.godot file * Detect whether project is made for 3.x or 4.x Godot engine versions * Iterate through all .gd files found in the repository * For each one, split file into individual functions * For each function found, ask existing LLM (gpt-3.5-turbo) for a detailed comment describing the functions purpose * Add instruction:response data pair to dataset Note that existing, human-written comments located above the code-block are not used for the instruction value. We are interested in consistent detail for comments, rather than trying to preserve some potentially higher-quality human-written ones. Human comments within the code block however are preserved. Run To assemble a dataset yourself, follow these instructions: * Run python data/generate_unlabeled_dataset.py * Run python data/label_dataset.py Please do note that you'll need GitHub and OpenAI API keys in order to use these scripts. Pre-assembled datasets Pre-assembled datasets included in this repository: * godot_dodo_4x_60k + Assembled using 4.x Godot projects - ~60k rows Further datasets may be added in the future (particularly regarding 3.x data) Finetuning The fine-tuning process closely mirrors the one introduced by stanford_alpaca. To reproduce a fine-tuned version of LLaMA, please follow the steps below. Hardware Requirements In order to effectively finetune a llama-7b or llama-13b model, it is highly recommended to use at least two A100 80GB GPUs. You may otherwise encounter out of memory errors or experience extremely long training times, and will need to adjust the training parameters. For finetuning godot_dodo_4x_60k_llama_13b, eight A100 80GB GPUs were used. Another important consideration is the protocol used for GPU communication. It is recommended to use NVLink setups rather than PCIe. Should you only have access to PCIe setups, please replace full-shard with shard_grad_op in the torchrun command. This may severely speed up your training runs at the cost of potentially higher memory usage. Setup Before finetuning, make sure to install all requirements using: pip install -r requirements.txt Run For exact commands used for finetuning models, please refer to the individual model pages: * models/godot_dodo_4x_60k_llama_7b * models/godot_dodo_4x_60k_llama_13b Inference To test out your finetuned model, you can use the eval.py script. Simply run: python finetune/eval.py --model_name_or_path PATH_TO_FINETUNED_MODEL/ Publishing to Huggingface To easily upload a finetuned model to Huggingface, you can use: python finetune/push_to_hub.py --model_name_or_path PATH_TO_FINETUNED_MODEL/ --push_name HF_MODEL_NAME --auth_token HF_ACCESS_TOKEN Finetuned model weights Links to model weights hosted on Huggingface are provided in the respective model pages: * models/godot_dodo_4x_60k_llama_7b * models/godot_dodo_4x_60k_llama_13b Cost Below the dollar-cost of assembling each available dataset and finetuning each model. Datasets * godot_dodo_4x_60k + 30$ (gpt-3.5-turbo API costs) Finetuned Models * models/godot_dodo_4x_60k_llama_7b + 24$ (8x A100 80GB instance costs) * models/godot_dodo_4x_60k_llama_13b + 84$(8x A100 80GB instance costs) Use with godot-copilot Usage of finetuned models with godot-copilot for in-editor, fully local code generation may be supported in the future. Acknowledegments Thank you to all MIT-licensed Godot projects! This would not be possible without you. All projects that were scraped during assembly of the included finetuning data are listed in the respective dataset folders in data. Another thank you goes to fluidstack.io for their reliable, cheap GPU instances that were used for finetuning these models. Citation If you wish to cite this project, please use: @misc{godot-dodo, author = {Markus Sobkowski}, title = {Godot-Dodo: Finetuned language models for GDScript generation}, year = {2023}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/minosvasilias/godot-dodo}}, } About Finetuning large language models for GDScript generation. Resources Readme License MIT license Stars 6 stars Watchers 1 watching Forks 0 forks Report repository Releases No releases published Packages 0 No packages published Languages * Python 100.0% Footer (c) 2023 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.