mirror of
https://github.com/tatsu-lab/stanford_alpaca.git
synced 2024-10-01 05:35:37 -04:00
102 lines
5.5 KiB
Markdown
102 lines
5.5 KiB
Markdown
|
# Alpaca Instruction Following Dataset
|
|||
|
|
|||
|
## Motivation
|
|||
|
### For what purpose was the dataset created?
|
|||
|
To enable more open-source research on instruction following large language models, we use generate 52K instruction-followng demonstrations using OpenAI's text-davinci-003 model.
|
|||
|
|
|||
|
### Who created the dataset
|
|||
|
- [Rohan Taori](https://www.rohantaori.com/)
|
|||
|
- [Ishaan Gulrajani](https://ishaan.io/)
|
|||
|
- [Tianyi Zhang](https://tiiiger.github.io/)
|
|||
|
- [Yann Dubois](https://yanndubs.github.io/)
|
|||
|
- [Xuechen Li](https://www.lxuechen.com/)
|
|||
|
- [Carlos Guestrin](https://guestrin.su.domains/)
|
|||
|
- [Percy Liang](https://cs.stanford.edu/~pliang/)
|
|||
|
- [Tatsunori B. Hashimoto](https://thashim.github.io/)
|
|||
|
|
|||
|
## Composition
|
|||
|
|
|||
|
### What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
|
|||
|
The instruction following demonstrations are bootstrapped by following the [seed set](https://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl) released from the self-instruct project.
|
|||
|
Given that the dataset is generated, it is difficult to pinpoint who/what the instances represent.
|
|||
|
|
|||
|
### How many instances are there in total
|
|||
|
In total, there are 52,002 instances in the dataset.
|
|||
|
|
|||
|
### Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
|
|||
|
not applicable.
|
|||
|
|
|||
|
### What data does each instance consist of?
|
|||
|
|
|||
|
- `instruction`: `str`, describes the task the model should perform. Each of the 52K instructions is unique.
|
|||
|
- `input`: `str`, optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input.
|
|||
|
- `output`: `str`, the answer to the instruction as generated by `text-davinci-003`.
|
|||
|
|
|||
|
### Is any information missing from individual instances?
|
|||
|
no.
|
|||
|
|
|||
|
### Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)?
|
|||
|
not applicable.
|
|||
|
|
|||
|
### Is there a label or target associated with each instance?
|
|||
|
the finetuning target is the response generated by `text-davinci-003`.
|
|||
|
|
|||
|
### Are there recommended data splits (e.g., training, development/validation, testing)?
|
|||
|
The Alpaca models (both demo and the ones that will be released) are trained on all 52K data.
|
|||
|
There is no recommended data split for the dataset.
|
|||
|
|
|||
|
### Are there any errors, sources of noise, or redundancies in the dataset?
|
|||
|
All 52k instructions are unique. However, some generated instructions may not be sensible, i.e., there may not exist any good response to the instruction.
|
|||
|
|
|||
|
### Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?
|
|||
|
the dataset is self-contained.
|
|||
|
|
|||
|
### Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals’ non-public communications)?
|
|||
|
no.
|
|||
|
|
|||
|
### Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?
|
|||
|
The generated may contain a few inappropriate responses. In our preliminary testing, we have not encountered any offensive responses.
|
|||
|
|
|||
|
## Collection process
|
|||
|
The [Github repository](https://github.com/tatsu-lab/stanford_alpaca) contains the code to generate the dataset.
|
|||
|
|
|||
|
## Uses
|
|||
|
|
|||
|
### Has the dataset been used for any tasks already?
|
|||
|
The dataset is used to train the Alpaca models that are both used for the demo and released.
|
|||
|
|
|||
|
### Is there a repository that links to any or all papers or systems that use the dataset?
|
|||
|
Please see https://github.com/tatsu-lab/stanford_alpaca
|
|||
|
|
|||
|
### Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
|
|||
|
This dataset is generated by using the OpenAI's API. Therefore, this dataset cannot be used for commerical usage that compete with OpenAI.
|
|||
|
|
|||
|
### Are there tasks for which the dataset should not be used?
|
|||
|
The dataset should not be used for commerical usage that compete with OpenAI.
|
|||
|
|
|||
|
## Distribution
|
|||
|
### Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
|
|||
|
The dataset can be freely downloaded.
|
|||
|
|
|||
|
### How will the dataset will be distributed (e.g., tarball on website, API, GitHub)?
|
|||
|
The dataset can be downloaded from the [Github repository](https://github.com/tatsu-lab/stanford_alpaca) as a json file.
|
|||
|
|
|||
|
### Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?
|
|||
|
This dataset is distributed under [the ODC-By license](https://opendatacommons.org/licenses/by/1-0/).
|
|||
|
|
|||
|
### Have any third parties imposed IP-based or other restrictions on the data associated with the instances?
|
|||
|
no
|
|||
|
|
|||
|
### Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?
|
|||
|
no
|
|||
|
|
|||
|
## Maintenance
|
|||
|
|
|||
|
### Who is supporting/hosting/maintaining the dataset?
|
|||
|
The dataset is hosted on github and the Github repository is maintained by Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li.
|
|||
|
|
|||
|
### How can the owner/curator/manager of the dataset be contacted (e.g., email address)?
|
|||
|
Please open an issue in the [Github repository](https://github.com/tatsu-lab/stanford_alpaca)
|
|||
|
|
|||
|
### Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?
|
|||
|
We do not have plan to update the dataset.
|