diff --git a/README.md b/README.md new file mode 100644 index 0000000..475421a --- /dev/null +++ b/README.md @@ -0,0 +1,149 @@ +--- +--- + +# Dataset Card for "imdb" + +## Table of Contents +- [Dataset Description](#dataset-description) + - [Dataset Summary](#dataset-summary) + - [Supported Tasks](#supported-tasks) + - [Languages](#languages) +- [Dataset Structure](#dataset-structure) + - [Data Instances](#data-instances) + - [Data Fields](#data-fields) + - [Data Splits Sample Size](#data-splits-sample-size) +- [Dataset Creation](#dataset-creation) + - [Curation Rationale](#curation-rationale) + - [Source Data](#source-data) + - [Annotations](#annotations) + - [Personal and Sensitive Information](#personal-and-sensitive-information) +- [Considerations for Using the Data](#considerations-for-using-the-data) + - [Social Impact of Dataset](#social-impact-of-dataset) + - [Discussion of Biases](#discussion-of-biases) + - [Other Known Limitations](#other-known-limitations) +- [Additional Information](#additional-information) + - [Dataset Curators](#dataset-curators) + - [Licensing Information](#licensing-information) + - [Citation Information](#citation-information) + - [Contributions](#contributions) + +## [Dataset Description](#dataset-description) + +- **Homepage:** [http://ai.stanford.edu/~amaas/data/sentiment/](http://ai.stanford.edu/~amaas/data/sentiment/) +- **Repository:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) +- **Paper:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) +- **Point of Contact:** [More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) +- **Size of downloaded dataset files:** 80.23 MB +- **Size of the generated dataset:** 127.06 MB +- **Total amount of disk used:** 207.28 MB + +### [Dataset Summary](#dataset-summary) + +Large Movie Review Dataset. +This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. + +### [Supported Tasks](#supported-tasks) + +[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) + +### [Languages](#languages) + +[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) + +## [Dataset Structure](#dataset-structure) + +We show detailed information for up to 5 configurations of the dataset. + +### [Data Instances](#data-instances) + +#### plain_text + +- **Size of downloaded dataset files:** 80.23 MB +- **Size of the generated dataset:** 127.06 MB +- **Total amount of disk used:** 207.28 MB + +An example of 'train' looks as follows. +``` +{ + "label": 0, + "text": "Goodbye world2\n" +} +``` + +### [Data Fields](#data-fields) + +The data fields are the same among all splits. + +#### plain_text +- `text`: a `string` feature. +- `label`: a classification label, with possible values including `neg` (0), `pos` (1). + +### [Data Splits Sample Size](#data-splits-sample-size) + +| name |train|unsupervised|test | +|----------|----:|-----------:|----:| +|plain_text|25000| 50000|25000| + +## [Dataset Creation](#dataset-creation) + +### [Curation Rationale](#curation-rationale) + +[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) + +### [Source Data](#source-data) + +[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) + +### [Annotations](#annotations) + +[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) + +### [Personal and Sensitive Information](#personal-and-sensitive-information) + +[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) + +## [Considerations for Using the Data](#considerations-for-using-the-data) + +### [Social Impact of Dataset](#social-impact-of-dataset) + +[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) + +### [Discussion of Biases](#discussion-of-biases) + +[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) + +### [Other Known Limitations](#other-known-limitations) + +[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) + +## [Additional Information](#additional-information) + +### [Dataset Curators](#dataset-curators) + +[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) + +### [Licensing Information](#licensing-information) + +[More Information Needed](https://github.com/huggingface/datasets/blob/master/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards) + +### [Citation Information](#citation-information) + +``` +@InProceedings{maas-EtAl:2011:ACL-HLT2011, + author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher}, + title = {Learning Word Vectors for Sentiment Analysis}, + booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies}, + month = {June}, + year = {2011}, + address = {Portland, Oregon, USA}, + publisher = {Association for Computational Linguistics}, + pages = {142--150}, + url = {http://www.aclweb.org/anthology/P11-1015} +} + +``` + + +### Contributions + +Thanks to [@ghazi-f](https://github.com/ghazi-f), [@patrickvonplaten](https://github.com/patrickvonplaten), [@lhoestq](https://github.com/lhoestq), [@thomwolf](https://github.com/thomwolf) for adding this dataset. \ No newline at end of file