stanford_alpaca/datasheet.md

102 lines
5.5 KiB
Markdown
Raw Normal View History

2023-03-13 11:15:01 -04:00
# Alpaca Instruction Following Dataset
## Motivation
### For what purpose was the dataset created?
To enable more open-source research on instruction following large language models, we use generate 52K instruction-followng demonstrations using OpenAI's text-davinci-003 model.
### Who created the dataset
- [Rohan Taori](https://www.rohantaori.com/)
- [Ishaan Gulrajani](https://ishaan.io/)
- [Tianyi Zhang](https://tiiiger.github.io/)
- [Yann Dubois](https://yanndubs.github.io/)
- [Xuechen Li](https://www.lxuechen.com/)
- [Carlos Guestrin](https://guestrin.su.domains/)
- [Percy Liang](https://cs.stanford.edu/~pliang/)
- [Tatsunori B. Hashimoto](https://thashim.github.io/)
## Composition
### What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?
The instruction following demonstrations are bootstrapped by following the [seed set](https://github.com/yizhongw/self-instruct/blob/main/data/seed_tasks.jsonl) released from the self-instruct project.
Given that the dataset is generated, it is difficult to pinpoint who/what the instances represent.
### How many instances are there in total
In total, there are 52,002 instances in the dataset.
### Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?
not applicable.
### What data does each instance consist of?
- `instruction`: `str`, describes the task the model should perform. Each of the 52K instructions is unique.
- `input`: `str`, optional context or input for the task. For example, when the instruction is "Summarize the following article", the input is the article. Around 40% of the examples have an input.
- `output`: `str`, the answer to the instruction as generated by `text-davinci-003`.
### Is any information missing from individual instances?
no.
### Are relationships between individual instances made explicit (e.g., users movie ratings, social network links)?
not applicable.
### Is there a label or target associated with each instance?
the finetuning target is the response generated by `text-davinci-003`.
### Are there recommended data splits (e.g., training, development/validation, testing)?
The Alpaca models (both demo and the ones that will be released) are trained on all 52K data.
There is no recommended data split for the dataset.
### Are there any errors, sources of noise, or redundancies in the dataset?
All 52k instructions are unique. However, some generated instructions may not be sensible, i.e., there may not exist any good response to the instruction.
### Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?
the dataset is self-contained.
### Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals non-public communications)?
no.
### Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?
The generated may contain a few inappropriate responses. In our preliminary testing, we have not encountered any offensive responses.
## Collection process
The [Github repository](https://github.com/tatsu-lab/stanford_alpaca) contains the code to generate the dataset.
## Uses
### Has the dataset been used for any tasks already?
The dataset is used to train the Alpaca models that are both used for the demo and released.
### Is there a repository that links to any or all papers or systems that use the dataset?
Please see https://github.com/tatsu-lab/stanford_alpaca
### Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?
This dataset is generated by using the OpenAI's API. Therefore, this dataset cannot be used for commerical usage that compete with OpenAI.
### Are there tasks for which the dataset should not be used?
The dataset should not be used for commerical usage that compete with OpenAI.
## Distribution
### Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
The dataset can be freely downloaded.
### How will the dataset will be distributed (e.g., tarball on website, API, GitHub)?
The dataset can be downloaded from the [Github repository](https://github.com/tatsu-lab/stanford_alpaca) as a json file.
### Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?
This dataset is distributed under [the ODC-By license](https://opendatacommons.org/licenses/by/1-0/).
### Have any third parties imposed IP-based or other restrictions on the data associated with the instances?
no
### Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?
no
## Maintenance
### Who is supporting/hosting/maintaining the dataset?
The dataset is hosted on github and the Github repository is maintained by Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li.
### How can the owner/curator/manager of the dataset be contacted (e.g., email address)?
Please open an issue in the [Github repository](https://github.com/tatsu-lab/stanford_alpaca)
### Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?
We do not have plan to update the dataset.