Skip to content
/ corpus Public

Malayalam Corpus by Swathanthra Malayalam Computing

Notifications You must be signed in to change notification settings

smc/corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Malayalam Corpus by Swathanthra Malayalam Computing

This is a collection of Malayalam content collected from various sources and then curated and processed for general purpose usage.

Contents (As on March 4, 2019)

The text corpus contains running text from various free licensed sources.

  • The whole content of Malayalam Wikipedia extracted on January 1, 2019
  • News/Article from various sources, source mentioned in respective files:
  • 251 Mb
  • 8,60,159 lines
  • 98,15,533 words
  • 10,11,11,885 characters

The word corpus contains

Contributing

  1. If you know or have a text collection with compatible license(CC by SA), we can add that to this collection. Just create an issue and let us know about it. We will help. We are looking for content in diverse topics.
  2. We are also collecting person names, place names etc in Malayalam. You can see the existing words by just browsing to the words folder. If you like to expand that collection, create an issue with details or create a merge request.

Make sure to respect the copyright of the content. We are trying to provide a corpus of free licensed content.

Other sources

  1. Malayalam content from Common Crawl dataset- https://github.com/qburst/common-crawl-malayalam

License

Creative Commons Attribution-ShareAlike https://creativecommons.org/licenses/by-sa/3.0/

About

Malayalam Corpus by Swathanthra Malayalam Computing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published