Skip to content

rplessl/wk2txt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NAME
    wk2txt - A Website to Text Converter

SYNOPSIS
    wk2txt *options*

  Options:
     --input URL               convert *URL* to text
     --urlfile file            convert URLs in *file* to text
     --output directory        save output in directory *directory*
     --xml                     dump in XML format
     --timeout inverval        timeout after *interval* seconds (NOT IMPLEMENTED)
     --debug                   print diagnostic information
     --help                    print help on options
     --version                 print version number

DESCRIPTION
    wk2txt is a command line tool based on the WebKit Engine to get the text
    content of a websites as seen by current webbrowsers which includes the
    evaluation of javascript code present in many modern webpages.

    wk2txt can process a whole list of URLs, which can be built by using the
    "--input" option multiple times or by passing the filename of a list of
    URLs to the "--urlfile". This file is expected to have 1 URL per line.

    The contents of the webpages is written to one file per webpage. The
    files are stored in the directory specified by the "--output" option.
    The filename is derived from the URL, it is the SHA1 hash code for the
    URL.

    By default, wk2txt dumps the contents and the meta data of the webpage
    in plain text format. If the "--xml" option is used, content and meta
    data is dumped in XML format.

AUTHOR
    Christian Plessl <[email protected]>
    Roman Plessl <[email protected]>

About

a website to text converter using webkit

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published