https://github.com/epfLLM/meditron

Skip to content Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Resources
      + Learning Pathways
      + White papers, Ebooks, Webinars
      + Customer Stories
      + Partners
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session. Dismiss alert
{{ message }}
epfLLM / meditron Public

  * Notifications
  * Fork 2
  * Star 119

Meditron is a suite of open-source medical Large Language Models
(LLMs).

huggingface.co/epfl-llm

License

Apache-2.0 license
119 stars 2 forks Activity
Star
Notifications

  * Code
  * Issues 0
  * Pull requests 1
  * Actions
  * Projects 0
  * Security
  * Insights

Additional navigation options

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

epfLLM/meditron

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
2 branches 0 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/e]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone epfLLM]

    Work fast with our official CLI. Learn more about the CLI.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@AGBonnet
AGBonnet Update README.md
...
084fe59 Nov 28, 2023
Update README.md
084fe59

Git stats

  * 46 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
BetterChatGPT @ 28c0b88
add submodule FastChat, Megatron-LLM, and BetterChatGPT
November 23, 2023 21:24
FastChat @ a754c48
add submodule FastChat, Megatron-LLM, and BetterChatGPT
November 23, 2023 21:24
Megatron-LLM @ 01fa877
add submodule FastChat, Megatron-LLM, and BetterChatGPT
November 23, 2023 21:24
deployment
add UI api update example
November 23, 2023 21:47
evaluation
Merge pull request #2 from epfLLM/ft-preprocessing
November 27, 2023 18:02
figures
Added data figure
November 28, 2023 11:47
finetuning
Finetuning docs (#1)
November 27, 2023 17:08
gap-replay
Update README.md
November 28, 2023 18:02
paper
upload paper to github
November 28, 2023 02:45
pretrain
add pretrain script
November 24, 2023 22:26
.gitmodules
add submodule FastChat, Megatron-LLM, and BetterChatGPT
November 23, 2023 21:24
LICENSE
initial commit of meditron's public release
November 23, 2023 20:08
README.md
Update README.md
November 28, 2023 14:36
requirements.txt
Update requirements.txt
November 23, 2023 21:38
View code
[                    ]
Model Details How to use Medical Training Data Download instructions
Training Procedure Training Hyperparameters (7B) Training
Hyperparameters (70B) Supervised Finetuning Finetuning
Hyperparameters Uses Downstream Use Medical Benchmark Inference &
Evaluation Requirements Model Deployment Citation

README.md

MediTron logo

Meditron is a suite of open-source medical Large Language Models
(LLMs).

We release Meditron-7B and Meditron-70B, which are adapted to the
medical domain from Llama-2 through continued pretraining on a
comprehensively curated medical corpus, including selected PubMed
papers and abstracts, a new dataset of internationally-recognized
medical guidelines, and a general domain corpus.

Meditron-70B, finetuned on relevant data, outperforms Llama-2-70B,
GPT-3.5 and Flan-PaLM on multiple medical reasoning tasks.

Advisory Notice

    While Meditron is designed to encode medical knowledge from
    sources of high-quality evidence, it is not yet adapted to
    deliver this knowledge appropriately, safely, or within
    professional actionable constraints. We recommend against using
    Meditron in medical applications without extensive use-case
    alignment, as well as additional testing, specifically including
    randomized controlled trials in real-world practice settings.

 Model Details

  * Developed by: EPFL LLM Team
  * Model type: Causal decoder-only transformer language model
  * Language(s): English (mainly)
  * Model License: LLAMA 2 COMMUNITY LICENSE AGREEMENT
  * Code License: APACHE 2.0 LICENSE
  * Continue-pretrained from model: Llama-2-70B
  * Context length: 4k tokens
  * Input: Text only data
  * Output: Model generates text only
  * Status: This is a static model trained on an offline dataset.
    Future versions of the tuned models will be released as we
    enhance model's performance.
  * Knowledge Cutoff: August 2023
  * Trainer: epflLLM/Megatron-LLM
  * Paper: Meditron-70B: Scaling Medical Pretraining for Large
    Language Models

 How to use

You can load Meditron model directly from the HuggingFace model hub
as follows:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("epfl-llm/meditron-70B")
model = AutoModelForCausalLM.from_pretrained("epfl-llm/meditron-70B")

Pipeline

 Medical Training Data

We release code to download and pre-process the data used to train
Meditron.

MediTron's domain-adaptive pre-training corpus GAP-Replay combines
48.1B tokens from four corpora:

  * Clinical Guidelines: a new corpus of 46K clinical practice
    guidelines from various healthcare-related sources, including
    hospitals and international organizations,
  * Paper Abstracts: 16.1M abstracts extracted from closed-access
    PubMed and PubMed Central papers,
  * Medical Papers: full-text articles extracted from 5M publicly
    available PubMed and PubMed Central papers.
  * Replay dataset: 400M tokens of general domain pretraining data
    sampled from RedPajama-v1.

 Download instructions

You can download and pre-process the entire GAP-Replay corpus by
running ./download.sh in the gap-replay folder.

You can download 36K open-access articles from our Guidelines corpus
from the HuggingFace datasets hub.

from datasets import load_dataset

dataset = load_dataset("epfl-llm/guidelines")

You can scrape and clean all 46K guidelines (including closed-access
sources) by running ./download.sh in the guidelines folder.

More details can be found in the GAP-Replay documentation.

 Training Procedure

We used the Megatron-LLM distributed training library, a derivative
of Nvidia's Megatron LM project, to optimize training efficiency.
Hardware consists of 16 nodes of 8x NVIDIA A100 (80GB) SXM GPUs
connected by NVLink and NVSwitch with a single Nvidia ConnectX-6 DX
network card and equipped with 2 x AMD EPYC 7543 32-Core Processors
and 512 GB of RAM. The nodes are connected via RDMA over Converged
Ethernet.

Our three-way parallelism scheme uses the following:

  * Data Parallelism (DP -- different GPUs process different subsets
    of the batches) of 2,
  * Pipeline Parallelism (PP -- different GPUs process different
    layers) of 8,
  * Tensor Parallelism (TP -- different GPUs process different
    subtensors for matrix multiplication) of 8.

 Training Hyperparameters (7B)

bf16              true
lr                3e-4
eps               1e-5
betas             [0.9, 0.95]
clip_grad         1
weight decay      0.1
DP size           16
TP size           4
PP size           1
seq length        2048
lr scheduler      cosine
min lr            1e-6
warmup iteration  2000
micro batch size  10
global batch size 1600

 Training Hyperparameters (70B)

bf16              true
lr                1.5e-4
eps               1e-5
betas             [0.9, 0.95]
clip_grad         1
weight decay      0.1
DP size           2
TP size           8
PP size           8
seq length        4096
lr scheduler      cosine
min lr            1e-6
warmup iteration  2000
micro batch size  2
global batch size 512

You can see the script we used to pretrain our models through
Megatron-LLM here: finetune.sh

 Supervised Finetuning

We again used the Megatron-LLM distributed training library for
supervised finetuning (sinlge-node and multi-node). We made a file,
sft.py, that automatically handles the tokenization and finetuning
process through Megatron-LLM. To start a multi-node finetuning
process, here is an example:

cd finetuning
python sft.py \
    --checkpoint=baseline \
    --size=70 \
    --run_name=cotmedqa \
    --data /pure-mlo-scratch/zechen/meditron/benchmarks/ft_preprocessed/medqa_cot_train.jsonl \
    --val /pure-mlo-scratch/zechen/meditron/benchmarks/ft_preprocessed/medqa_cot_validation.jsonl \
    --micro_batch=4
    --nodes=4 \
    --addr=<RANK0_HOST_NAME> \
    --save_interval=200 \
    --pp=4 \
    --seq 4096 \
    --rank=<CURRENT_RANK>

Run the above line of code at node rank-0, rank-1, rank-2, and rank3
to start a 4-node finetuning process.

Important!: Make sure to have the proper paths defined in sft.py and
finetune_sft.sh.

 Finetuning Hyperparameters

bf16         true
lr           2e-5
eps          1e-5
betas        [0.9, 0.95]
clip_grad    1
weight decay 0.1
DP size      16
TP size      4
PP size      1
seq length   2048 or 4096
lr scheduler cosine
min lr       2e-6
warmup ratio 0.1
added tokens [<|im_start|>, <|im_end|>]

 Uses

Meditron-70B is being made available for further testing and
assessment as an AI assistant to enhance clinical decision-making and
democratize access to an LLM for healthcare use. Potential use cases
may include but are not limited to:

  * Medical exam question answering
  * Supporting differential diagnosis
  * Disease information (symptoms, cause, treatment) query
  * General health information query

It is possible to use this model to generate text, which is useful
for experimentation and understanding its capabilities. It should not
be used directly for production or work that may impact people.

We do not recommend using this model for natural language generation
in a production environment, finetuned or otherwise.

 Downstream Use

Meditron-70B is a foundation model that can be finetuned,
instruction-tuned, or RLHF-tuned for specific downstream tasks and
applications. The main way we have used this model is finetuning for
downstream question-answering tasks, but we encourage using this
model for additional applications.

Specific formatting needs to be followed to prompt our finetuned
models, including the <|im_start|>, <|im_end|> tags, and system,
question, answer identifiers.

"""
<|im_start|>system
{system_message}<|im_end|>
<|im_start|>question
{prompt}<|im_end|>
<|im_start|>answer
"""

Note: the above formatting is not a requirement if you use your own
formatting option for the finetuning of the model.

 Medical Benchmark Inference & Evaluation

 Requirements

Before you start, please install the necessary packages:

vllm >= 0.2.1
transformers >= 4.34.0
datasets >= 2.14.6
torch >= 2.0.1

For detailed instructions to run inference and evaluation with
medical benchmarks, please read the documentation here inference &
evaluation instructions.

 Model Deployment

For detailed instructions to deploy meditron models and have an
interactive chat session, please read the documentation here Model
Deployment

 Citation

If you use this software or our paper, please cite them:

@misc{chen2023meditron70b,
      title={MEDITRON-70B: Scaling Medical Pretraining for Large Language Models},
      author={Zeming Chen and Alejandro Hernandez-Cano and Angelika Romanou and Antoine Bonnet and Kyle Matoba and Francesco Salvi and Matteo Pagliardini and Simin Fan and Andreas Kopf and Amirkeivan Mohtashami and Alexandre Sallinen and Alireza Sakhaeirad and Vinitra Swamy and Igor Krawczuk and Deniz Bayazit and Axel Marmet and Syrielle Montariol and Mary-Anne Hartley and Martin Jaggi and Antoine Bosselut},
      year={2023},
      eprint={2311.16079},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@software{epfmedtrn,
  author = {Zeming Chen and Alejandro Hernandez-Cano and Angelika Romanou and Antoine Bonnet and Kyle Matoba and Francesco Salvi and Matteo Pagliardini and Simin Fan and Andreas Kopf and Amirkeivan Mohtashami and Alexandre Sallinen and Alireza Sakhaeirad and Vinitra Swamy and Igor Krawczuk and Deniz Bayazit and Axel Marmet and Syrielle Montariol and Mary-Anne Hartley and Martin Jaggi and Antoine Bosselut},
  title = {MediTron-70B: Scaling Medical Pretraining for Large Language Models},
  month = November,
  year = 2023,
  url = {https://github.com/epfLLM/meditron}
}

About

Meditron is a suite of open-source medical Large Language Models
(LLMs).

huggingface.co/epfl-llm

Resources

Readme

License

Apache-2.0 license
Activity

Stars

119 stars

Watchers

9 watching

Forks

2 forks
Report repository

Releases

No releases published

Packages 0

No packages published

Contributors 11

  * @AGBonnet
  * @eric11eca
  * @frasalvi
  * @agromanou
  * @AleHD
  * @smontariol
  * @martinjaggi
  * @kylematoba
  * @vinitra
  * @lighthea
  * @alirezasakhaei

Languages

  * Python 82.9%
  * Shell 9.1%
  * TypeScript 8.0%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact

You can't perform that action at this time.