下拉刷新
Repository Details
Shared bynavbar_avatar
repo_avatar
HelloGitHub Rating
0 ratings
这是一个对网页正文进行抽取的工具
FreeMIT
Claim
Collect
Share
484
Stars
Yes
Chinese
HTML
Language
No
Active
3
Contributors
2
Issues
No
Organization
None
Latest
110
Forks
MIT
License
More
这是一个对网页正文进行抽取的工具。 [cx-extractor](https://github.com/chrislinan/cx-extractor/blob/master/%E5%9F%BA%E4%BA%8E%E8%A1%8C%E5%9D%97%E5%88%86%E5%B8%83%E5%87%BD%E6%95%B0%E7%9A%84%E9%80%9A%E7%94%A8%E7%BD%91%E9%A1%B5%E6%AD%A3%E6%96%87%E6%8A%BD%E5%8F%96%E7%AE%97%E6%B3%95.pdf) 算法的 python 版本,改进了原有算法,使其支持中英文,对新闻类网页正文抽取效果较好。示例代码: ```python from crawler.cx_extractor_Python import cx_extractor_Python cx = cx_extractor_Python() test_html = cx.getHtml('http://news.163.com/16/0101/10/BC84MRHS00014AED.html') content = cx.filter_tags(test_html) s = cx.getText(content) print(s) ```
Included in:
Vol.30

Comments

Rating:
No comments yet