geekAbyte: Krwkrw 0.1.3 Released

Just pushed the latest release (0.1.3) of Krwkrw to Maven central.

Krwkrw is a web scraper. You can read how it came into being here

A quick run down of the changes in 0.1.3.

Ability to express URL's to be included/excluded using Regex pattern. For example:
```
Krwkrw crawler = new Krwkrw(action);
crawler.match("(\\S+)(/projects/)(\\S+)")
```
makes sure that only contents in the /projects/ path would be processed while
```
Krwkrw crawler = new Krwkrw(action);
crawler.skip("(\\S+)(/projects/)(\\S+)")
```
will fetch and process all the contents except, the ones in the /projects/ path
Ability to have random delays in between requests.
Before now it was only possible to set the seconds to wait between requests. For example:
```
Krwkrw crawler = new Krwkrw(action);
crawler.setDelay(5) // waits 5 seconds between requests
```
With the 0.1.3 release, it is possible to have random delay; that is, the requests will be delayed by number of seconds picked randomly from a lower and upper bound, for example:
```
Krwkrw crawler = new Krwkrw(action);
// waiting seconds will be any number between 5 and 20
crawler.setDelay(5, 20) 
```
Change in API. doKrawl method replaced with crawl
Fix issue where it was possible for the crawler to crawl pages outside the origin url
Some minor improvements here and there...

If using Maven as your build tool, you can add it to your project via:

<dependency>
<groupid>com.blogspot.geekabyte.krwkrw</groupid>
<artifactid>krwler</artifactid>
<version>0.1.3</version>
</dependency>

If using Gradle, then:

dependencies {
compile "com.blogspot.geekabyte.krwkrw:krwler:0.1.3}"
}

The Javadoc can be accessible online here.

You can also check out Krwkrw on Github

geekAbyte

Books

Saturday, September 19, 2015

Krwkrw 0.1.3 Released

No comments:

Post a Comment