Project Goose is an article extractor written in Java using Maven for the dependencies. It’s an open source project born from Gravity Labs http://gravity.com, Its goal is to take a webpage, perform calculations and extract the main text of the article as well as make recommendations on what image might be the most relevant image on the page. Goose aims to create an easy to use, scalable extractor that can plug into any application that needs to extract structure from unstructured web pages.
(Full Story: Goose – article extractor)


May 16, 2011
