Repository with tools to convert for text some content types
To get for run in cli you can get with:
go get github.com/zbioe/grapnel
you can pass -type
of content in "type" for parse reader directly:
cat pdf/testdata/valid.pdf | grapnel -t pdf
cat html/testdata/valid.html | grapnel -t html
or you can not pass type for read all content and try detect the type:
cat pdf/testdata/valid.pdf | grapnel
cat html/testdata/valid.html | grapnel
Receive Pdf in []byte or io.Reader and transform him to text with pdftotext
create file main.go
with content:
package main
import (
"os"
"fmt"
"github.com/zbioe/grapnel/pdf"
)
func main() {
text, err := pdf.ToTextFromReader(os.Stdin)
if err != nil {
fmt.Print(err)
os.Exit(1)
}
fmt.Print(text)
}
run on command line:
go run main.go < pdf/test_files/valid.pdf
curl -Ls "http://www.orimi.com/pdf-test.pdf" | go run main.go
Receive html in bytes or reader and transform him to text
create file main.go
with content:
package main
import (
"os"
"fmt"
"github.com/zbioe/grapnel/html"
)
func main() {
text, err := html.ToTextFromReader(os.Stdin)
if err != nil {
fmt.Print(err)
os.Exit(1)
}
fmt.Print(text)
}
run on command line:
go run main.go < pdf/testdata/valid.html
curl -Ls "https://reddit.com/" | go run main.go