Krwkrw is a web scraper. You can read how it came into being here
A quick run down of the changes in 0.1.3.
- Ability to express URL's to be included/excluded using Regex pattern. For example:
Krwkrw crawler = new Krwkrw(action); crawler.match("(\\S+)(/projects/)(\\S+)")
makes sure that only contents in the /projects/ path would be processed while
Krwkrw crawler = new Krwkrw(action); crawler.skip("(\\S+)(/projects/)(\\S+)")
will fetch and process all the contents except, the ones in the /projects/ path
- Ability to have random delays in between requests.
Before now it was only possible to set the seconds to wait between requests. For example:
Krwkrw crawler = new Krwkrw(action); crawler.setDelay(5) // waits 5 seconds between requests
With the 0.1.3 release, it is possible to have random delay; that is, the requests will be delayed by number of seconds picked randomly from a lower and upper bound, for example:
Krwkrw crawler = new Krwkrw(action); // waiting seconds will be any number between 5 and 20 crawler.setDelay(5, 20)
- Change in API. doKrawl method replaced with crawl
- Fix issue where it was possible for the crawler to crawl pages outside the origin url
- Some minor improvements here and there...
<dependency> <groupid>com.blogspot.geekabyte.krwkrw</groupid> <artifactid>krwler</artifactid> <version>0.1.3</version> </dependency>
If using Gradle, then:
dependencies { compile "com.blogspot.geekabyte.krwkrw:krwler:0.1.3}" }
The Javadoc can be accessible online here.
You can also check out Krwkrw on Github
No comments:
Post a Comment