Side-by-side comparison of a traditional keyword search (stemmed/unstemmed/phrase) with a blended keyword and k-NN search over Canberra Times news articles from 1994 using CLIP with the ViT-L-14::openai model(vector length 768) or openai ada-002 (vector length 1536), SOLR 9.1/Lucene 9.3 (vector length patch) with HNSW vector searching and Stanford NLP for entity extraction

Keyword boost
Keyword-found doc similarity boost
Query similarity boost


Traditional keyword TF/IDF searching with boosting for phrase and exact (non-stemmed) word form has been the basis for searching for decades, including in Trove. It is great when the keywords used match the words used in the sought text, and when the sought text has been perfectly OCRed or corrected. It is less great when searching for more abstract or broader oncepts, or when things can be expressed in several ways, and is very fragile when the OCRed text is less than perfect.

However, semantic dense-vector-based searching, as also used by major search engines such as Google and Bing, is now implicitly expected by the community: it aims at capturing the essence or semantics behind the user-provided search, and finding text that best matches those semantics. Both keyword and dense-vector searching have strengths and weaknesses and they are complementary. How best to combine them is an open research topic, but the motivation for this experiment is to see whether a naive approach can provide results better than the traditional keyword-only approach.

See also a prototype chat interface to 1994 Canberra Times articles


This is an experiment. About 47K news articles from The Canberra Times from 1994 were processed as follows:

  1. Text cleaned up a bit with regexp to remove obvious crud and reinstate some hyphenated words.

  2. Stanford NLP package used to identify named entities of type person, location, organisation and misc.

  3. CLIP ONNX Server (from CLIP-as-a-service) using the ViT-L-14::openai model generates a classification vector of length 768 for each article. (The ViT-B-32 512 length model has also been configured - not sure which is best.) Sentences less than 4 words are dropped, long sentences are truncated to 35 words. Sentence vectors are weighted based on ln(word-count) then added together to produce an article vector, which is then normalised. First sentence vector is then normalised and both the first sentence vector and article vector are indexed.

  4. For only the 1994 Canberra Times articles, openAI ada-002 embeddings are generated from the first 1200 words of each article and indexed (a modified Lucene 9.3 DenseVector implementation which supports vectors up to 2048 elements long was used as ada-002 produces vectors of length 1536).

  5. Article text, named entities and the classification vector are indexed by SOLR 9.1 using the Dense vector search recently added to SOLR/Lucene 9, specifically the Hierarchical Navigable Small Worlds (HNSW) search graph to implement an approximate K-nearest-neighbours search.

  6. A simple node.js app is used to search. For each search:

    • An initial standard search query is constructed using the supplied search words (anywhere, stemmed/unstemmed, boost for phrase).
    • Classification vectors are retrieved along with the articles from this initial search. Up to the first 10 of these are used for up to 10 vector similarity searches, ORed together and weighted by the log of their keyword TF/IDF score, to find similar documents to the top 10 keyword results. Also OR'ed in this second search is a vector search on an embedding vector generated from the query string, and finally, the same keyword search used for the first search. Configurable boost factors are used for the three components to produce a blended result shown in the 2nd column. Results in the second column are annotated if their ranking changed based on the blending (Promoted/Demoted annotations and pastel-coloured backgrounds), or they were inserted into the results (ie, not found by the first search - these items have a drak green background).
    • A comparative link to the current Trove search (title, year, article constraints applied) is rendered as are links to the articles found in Trove.

The results are very promising when searching for general concepts, allowing people to find more relevant articles. Ada-002 performs considerability better than CLIP. But classification searches are not of much value when searching for specific terms such as a person's name: the classification vector typically has too little input and cannot provide the high relevance of a keyword search. Vector-based similairity seems much better for a "more articles like this" than the "bag of words" approach. Some illuminating and promising searches:


TODO and open problems

  1. Ada-002 can accept ~6K words for embedding into a single output vector (although we have used a max of 1200 in this test). But CLIP cannot, and is really designed to just handle a single sentence. So for CLIP, how to best combine small (currently per-sentence) vectors to produce a document vector? For some articles, such as Letters to the editor which have many topics, this is probably fruitless. Should we instead be indexing sub-articles - sequences of 'coherent' sentences that have very similar vectors? That is, partitioning an article into sequences and indexing each sequence?

  2. How to blend keyword and vector search results? How to weigh keyword/phrase scores and kNN scores? Combining other 'signals' such as previous search history, location, ... (getting too creepy?)

  3. Use wikidata to resolve entities. Consider using context from this resolution to create a entity hierarchy and apply this an index time. For example, Paul Keating is a member of entities ALP and Federal Parliament. Hence we could show a facet search on ALP members and have it include Paul Keating, Kevin Rudd... AND we could answer questions such as Articles about ALP members and decentralisation

  4. Is it possible to use wikipedia articles to generate a characterisation vector for some topic or concept and to then search articles using that vector, to find articles representing some topic or concept? This is intriguing because if you are interested in say, cigarette advertising, do you search for tobacco, or nicotine, or smoking or cigarettes, and advertisting, or marketing or promotion, or ... - or, do you look for articles that have the vibe (or at least the language/vocab vibe) of Wikipedia's article on nicotine marketing?

  5. Embeddings are thrown by OCR errors (as is entity recognition). Investigate effect of OCR correction.

  6. Small text strings do not contain enough context for useful vectorisation and hence searching based on proximity to that vectorisation. So, it seems that the search terms first have to find some good text representative of the contents of articles the searcher is hoping to find. Maybe the newspaper corpus can provide these, maybe wikipedia can too. But perhaps this approach is too crude - perhaps creating vectors of the text from relevant paragraphs of keyword-only matches is best? Also, newspapers archive contents cover 200 years - language use changes and surely the best results will be cognisant of that?

  7. Encoders can be fine-tuned to improve domain-specification classification - worth it for newspapers? Or would fine-tuning need to be performed on, say, a decade basis?

  8. Is there any point encoding and using this approach on non-news articles (ie, family notices, lists, ads) - maybe not, although entity identifcation may still be useful on these.

  9. ada-002/CLIP/BERT/word2Vec/GloVe: what's best (cheapest/fastest/most-useful) for this use-case? Also what is the optimal encoding length (we're currently using 768 for CLIP, 1536 for ada-002)

  10. What about sparse semantic representations such SPLADE v2? They may be much cheaper and faster and may be a reasonable compromise on a very large corpus.

  11. Stanford NLP entity identification wasn't perfect - is Apache NLP or maybe other NER tools worth trying?

  12. ...or should I augment Stanford NLP with an Australian Gazetteer and human-names list?

  13. How does this scale? HNSW is designed to scale, but... worth investigating Milvus, Haystack NLP, Weaviate... ?

  14. Optimal (for index construction? storage? search recall/relevance? search speed?) HNSW parameters for Lucene?

  15. Comparison with current Trove searches is not exact for reasons including: Trove boosts query words found in an article's title (title is not differentiated in this prototype); Trove has much better handling of apostrophes and hyphenated words (including reconstruction of words broken across lines not performed in this prototype).


8Nov23 - added Jina V2 Base embeddings, which were very slow to generate on 4 core 2014 vintage Xeon - maybe 20 times slower than bgeBase768, but at least they support a long context.
17Oct23 - bgeBase768
Made BAAI/bge-base-en-v1.5 the default embedding rather than ada-002. Because this embedding has a 512 token limit, articles were split into chunks of no more than 256 words based on sentence boundaries (as determined by Stanford NLP).

Chunk embedding vectors for an article were combined into an article embedding vector by summing each chunk vector multiplied by a factor and then normalising the result. The factor was empirically determined to be the inverse of the chunk sequence number (starting from 1) multiplied by an embedding weight raised to the fifth power. This empirical determination was made by trying a variety of ways to combining the sequence number and chunk-article similarity score of all an article's chunks to best approximate a synthetic embedding with a high similarity to the article summary's embedding.

The embedding weight was generated by a neural net trained as follows. Llama-2-70b was used to generate a summary for about 40,000 newspaper articles longer than 250 words. Most of the newspaper articles were from 1948, the rest were from 1994. The summary was limited to 250 words. BAAI/bge-base-en-v1.5 was used to generate an embedding for each summary, as well as for each component chunk, and the (dot product) similarity between the article summary and each of its chunks was calculated. The neural net was trained to predict that chunk-to-containing-article similarity, given the chunk's embedding vector. The thinking behind this approach is that perhaps some chunks and hence their embeddings are mostly 'throat clearing' or otherwise tangential to the article in which they appear, and that important chunks will have embeddings with some characteristics that a neural net could identify, regardless of the article of which they are part. About 115,000 chunk embeddings and similarities were used for training, with two 'held out' sets (one of 1948, one of 1994 articles) totalling about 3000 used for evaluation of the training.

The pytorch neural net was small and simple, 3 layers, ~77K weights, accepts an embedding of dimension 768, outputs a float:

  (0): Linear(in_features=768, out_features=96, bias=True)
  (1): ReLU()
  (2): Dropout(p=0.2, inplace=False)
  (3): Linear(in_features=96, out_features=32, bias=True)
  (4): ReLU()
  (5): Dropout(p=0.1, inplace=False)
  (6): Linear(in_features=32, out_features=1, bias=True)

The test results showed reasonable learning of a weight to be used as a proxy of similarity between the embedding of the chunk and that of its containing article. In the following graphs, ideal results would have all points along the diagonal:

The advantage of BAAI/bge-base-en-v1.5 is that it is highly rated as an 'open' and effective retrieval embedding by HuggingFace Massive Text Embedding Benchmark (MTEB) Leaderboard as of Oct 2023, for a small model (0.44GB) producing an embedding with 768 dimensions. The disadvantage, compared to ada-002, is its small input sequence length of 512 tokens, compared to ada-002's 8191. The above approach to creating a good composite embedding seeks to adddress this disadvantage.

Qualitatively, the results are good - sometimes better, sometimes worse, but generally quite similar to ada-002 embeddings, but compare for yourself!

TODO: use PQ coding to reduce embedding from 768 floats to (say) 256 byte codes and evaluate loss of accuracy this will cause.

12Feb23 - ada-002pqcode: PQ code quantisisation using a stride of 3 floats, encoding as 1 byte, following This reduces vector size from 1536 floats to 512 bytes, but with some loss of fidelity (compare searches using ada-002pqcode with 'straight' ada-002

15Jan23 - ada-002quantUniform: for comparison with the near-optimal quantisisation of ada-002quant (below), try a uniform quantisisation of 256 buckets across the value range -1 .. +1. The average error per point is 0.000908 for this sample, and max error is 0.00391. By comparision ada-002quant average error is 0.000104 for the majority 'cluster' and 0.000269 for the 'outliers', and max errors are respectively 0.01379 and 0.01295.

11Jan23 - openAI's ada-002 embedding consists of 1536 floats. Maybe there are space and performance savings by quantising the floats as single bytes. At least for our materials, the characteristics of the float values seem amendable to efficient quantisisation - see curves for values of all positions and the curves for some specific positions, 5 of which (positions 194, 518, 954, 112-, 1246) are grouped as outliers and quantisised with an outlier mapping, and the rest quantisised with a default mapping, both to 1 byte.

Results are very similar (choose the ada-002quant embedding) but not identical. The query is not yet been quantisised on ada-002quant knn searches.

22Dec22 - After trialling openAi's babbage and ada encodings, openAi released ada-002: cheaper and they claim much better, so tried this. Slightly hacked Lucene 9.3 to allow DenseVectors of length 1536 (generated by ada-002. Lucene 9.3 also fixed a few DenseVector bugs and allowed weighting of vector similarity search clauses in a multi-clause query.
Oct22 - tried using BERT embeddings, but results were poorer than CLIP probably due to lack of fine-tuning
10Sep22 - Rather than striding through text, try generating a vector for each sentence, and using weighted average (weight is ln(token-count). Also, normalise before indexing so dot product can be used rather than cosine-similarity (which requires more computation.
28Aug22 - I thought the text characterisation model used about 500 text tokens, but I was wrong - it uses only 77. So, this update attempts to 'smear' the characterisation vector by averaging it over multiple vectors returned from passing a 'stride' of 77 tokens at a time (with a 5 token overlap). Maybe this just smears, like mixing colours... Maybe average is too crude and median or some other function is better (research required..)

General background papers/reports Project Computing Pty Ltd