Writen by:
David VlijmincxIntroduction
This article will show you a simple web crawler that uses virtual threads. The web crawler will fetch web pages
and extract new URLs from them to crawl. I am using virtual threads because they are cheaper to make, and I can have more of
them simultaneously. Virtual threads also make blocking very cheap; it's not a problem, for example, when a thread has to
wait for a response from a webserver. If you want to learn more about virtual threads please see this post.
Building the scraper with virtual threads
The idea for this crawler is to have one virtual thread for each URL, so other threads can run when a thread is blocked while
waiting for a web page. In the code below, you see the entire web crawler class. In the start method, we have a while
loop that gets a URI from a
deque and submits it to an ExecutorService
.
Virtual threads make this crawler a bit more special than other crawlers that use the older (systems) threads. At line 9, we have a try
statement with two executor services. I used one executor service for the requests the HttpClient sends
and one for finding URLs in the HTTP response. The order of the executor services in the try statement is important because of ordered cancellation .
We can't close the executorService
that the HttpClient uses before closing the executorService
that processes the response.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
| public class WebCrawler {
Pattern UrlRegex = Pattern.compile("[-a-zA-Z\\d@:%._+~#=]{1,256}\\.[a-zA-Z\\d()]{1,6}\\b([-a-zA-Z\\d()@:%_+.~#?&/=]*)");
Set<URI> foundURIs = new HashSet<>();
LinkedBlockingDeque<URI> deque = new LinkedBlockingDeque<>();
public void start(URI startURI) {
deque.add(startURI);
try (ExecutorService httpClientExecutorService = Executors.newVirtualThreadPerTaskExecutor();
ExecutorService executor = Executors.newVirtualThreadPerTaskExecutor()) {
HttpClient client = HttpClient.newBuilder()
.followRedirects(HttpClient.Redirect.ALWAYS)
.connectTimeout(Duration.ofSeconds(1))
.executor(httpClientExecutorService)
.build();
while (foundURIs.size() < 5) {
try {
URI uri = deque.take();
System.out.println("uri = " + uri);
executor.submit(() -> crawl(uri, client));
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
}
System.out.println("foundURIs = " + foundURIs);
}
private void crawl(URI uri, HttpClient client) {
var request = HttpRequest.newBuilder()
.uri(uri)
.GET()
.build();
try {
var response = client.send(request, HttpResponse.BodyHandlers.ofString());
UrlRegex.matcher(response.body())
.results()
.map(m -> m.group(0))
.map(s -> response.uri().resolve(s))
.forEach(s -> {
if (foundURIs.add(s)) {
deque.add(s);
}
});
} catch (Exception e) {
System.out.println("Failed to parse URI: " + uri);
}
}
}
|
To start the web crawler, you only have to create an of the class instance and call the start()
method with an initial URL that it can use.
1
2
| WebCrawler webCrawler = new WebCrawler();
webCrawler.start(URI.create("https://www.davidvlijmincx.com/"));
|
Conclusion
We looked at a simple web crawler that uses virtual threads in this post. We went over how it works and where the threads
are managed and created. We also saw a case where the order of executors is essential because of ordered cancellation.
References and further reading
 
Questions, comments, concerns?
Have a question or comment about the content? Feel free to reach out!