Apache lucene slow

5/17/2023

Also, if your IO system can't keep up then it will bottleneck your CPU concurrency. Remember this change only helps you if you have concurrent hardware, you use enough threads for indexing and there's no other bottleneck (for example, in the content source that provides the documents). Simon Willnauer does a great job describing these changes here and here. There were some very challenging changes required to make concurrent flushing work, especially around how IndexWriter handles buffered deletes. It would be better to measure steady state indexing rate, while indexing an effectively infinite content source, and ignoring the startup and ending transients I suspect if I measured that instead, we'd see gains from larger RAM buffers, but this is just speculation at this point. This might be because of the discontinuity when closing IndexWriter, when we must wait for all the RAM buffers to be written to disk. Curiously, I found that larger RAM buffers slow down overall indexing rate. One nice side effect of concurrent flushing is that you can now use RAM buffers well over 2.1 GB, as long as you use multiple threads. The second jump (labelled as D on the graph) happened when I increased the indexing threads to 20 and dropped the RAM buffer to 350 MB, giving the fastest indexing rate after concurrent flushing. Those settings resulted in the fastest indexing rate before concurrent flushing. The first jump, the day concurrent flushing landed (labelled as B on the graph), shows the improvement while using only 6 threads and 512 MB RAM buffer during indexing. Note that there are two separate jumps in the graph.

I previously described the problem here.īut with concurrent flushing, each thread freely flushes its own segment even while other threads continue indexing. This was a nasty bottleneck on computers with highly concurrent hardware flushing was inherently single threaded. That new feature, having lived on a branch for quite some time, undergoing many fun iterations, was finally merged back to trunk about a week ago.īefore concurrent flushing, whenever IndexWriter needed to flush a new segment, it would stop all indexing threads and hijack one thread to perform the rather compute intensive flush. How did this happen? Concurrent flushing. That's a 265% jump! Lucene now indexes all of Wikipedia's 23.2 GB (English) export in 5 minutes and 10 seconds. Previously we were around 102 GB of plain text per hour, and now it's about 270 GB/hour. (Click through the image to see details about what changed on dates A, B, C and D). Back then the graphs were rather boring (a good thing), but, not anymore! Have a look at the stunning jumps in Lucene's indexing rate: A week ago, I described the nightly benchmarks we use to catch any unexpected slowdowns in Lucene's performance.

0 Comments

Apache lucene slow

Leave a Reply.

Author

Archives

Categories