Tag Archives: scraping

Goose – article extractor

Project Goose is an article extractor written in Java using Maven for the dependencies. It’s an open source project born from Gravity Labs http://gravity.com, Its goal is to take a webpage, perform calculations and extract the main text of the article as well as make recommendations on what image might be the most relevant image on the page. Goose aims to create an easy to use, scalable extractor that can plug into any application that needs to extract structure from unstructured web pages. 

(Full Story: Goose – article extractor)

Overview: Extracting article text from HTML documents

In the world of web scraping, text mining and article reading utilities (readability bookmarklet) there is an ever growing demand for utilities that are capable of distinguishing parts of a HTML document which represent an article apart from other common website building blocks like menus, headers, footers, ads etc.

(Full Story: Overview: Extracting article text from HTML documents)

Maltego – data forensics application

Maltego is an open source intelligence and forensics application. It will offer you timous mining and gathering of information as well as the representation of this information in a easy to understand format.

(Full Story: Maltego – data forensics application)

Needlebase – where data comes together

Needle platform for acquiring, integrating, cleansing, analyzing and publishing data on the web.  Using Needle through a web browser, without programmers or DBA

(Full Story: Needlebase – where data comes together)

Resources for text extraction from HTML documents

list of research papers, articles, web APIs, libraries and other software

(Full Story: Resources for text extraction from HTML documents)


Follow

Get every new post delivered to your Inbox.