NewCrawler

java 版实现新闻内容自动抓取

自动识别列表页主体区域，抓取详情url
识别详情页中标题，作者，发布时间，正文内容。
支持已认别标签的记录,并可以修改标签, 保存为json 格式。如果存在同名文件，优先使用文件中的标签。

正文内容识别用到计算文本密度算法。 python 原版 https://github.com/kingname/GeneralNewsExtractor/

第三方工具

程序采用okhttp/htmlunit + jsoup /正则表达式三方jar 有ajax请求，使用htmlunit, 这个包可以获取json数据。没有ajax请求，使用okhttp就可以。

运行

App.java main 入口，目前测试了

String url = "https://news.sina.cn/gn/?from=wrap";
String url = "http://www.cjddsb.com/ym/xhy/";
String url = "https://readhub.cn/topics";
String url = "http://www.xinhuanet.com/fortune/";
String url = "https://news.163.com/world/";
String url = "http://military.people.com.cn/";
String url = "https://new.qq.com/tag/82542";
String url = "https://www.toutiao.com/";
String url = "http://gongyi.hebnews.cn/";
String url = "http://cn.chinadaily.com.cn/";

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src		src
README.md		README.md
config.txt		config.txt
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NewCrawler

第三方工具

运行

About

Releases

Packages

Contributors 2

Languages

songsh/NewCrawlers

Folders and files

Latest commit

History

Repository files navigation

NewCrawler

第三方工具

运行

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages