forked from awslabs/open-data-registry
-
Notifications
You must be signed in to change notification settings - Fork 0
/
abeja-cc-ja.yaml
27 lines (27 loc) · 1.06 KB
/
abeja-cc-ja.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Name: ABEJA CC JA
Description: A large Japanese language corpus created through preprocessing Common Crawl data
Documentation: https://github.com/abeja-inc/Megatron-LM/blob/main/docs/dataset/about_data.md
Contact: [email protected]
ManagedBy: "[ABEJA inc.](https://www.abejainc.com/)"
UpdateFrequency: None
Tags:
- natural language processing
- web archive
- internet
- japanese
License: "This data is available for anyone to use under the [Common Crawl Terms of Use](https://commoncrawl.org/terms-of-use/)"
Resources:
- Description: Text corpus
ARN: arn:aws:s3:::abeja-cc-ja
Region: ap-northeast-1
Type: S3 Bucket
DataAtWork:
Tutorials:
- Title: Tutorial of ABEJA CC JA dataset
URL: https://github.com/abeja-inc/Megatron-LM/blob/main/docs/dataset/tutorials.md
AuthorName: Kyo Hattori
Tools & Applications:
Publications:
- Title: "Building a Large-Scale Japanese Corpus from Common Crawl and Its Preprocessing"
URL: https://tech-blog.abeja.asia/entry/abeja-nedo-project-part2-202405
AuthorName: Kyo Hattori