Traditional keyword TF/IDF searching with boosting for phrase and exact (non-stemmed) word form has been the basis for searching for decades, including in Trove. It is great when the keywords used match the words used in the sought text, and when the sought text has been perfectly OCRed or corrected. It is less great when searching for more abstract or broader oncepts, or when things can be expressed in several ways, and is very fragile when the OCRed text is less than perfect.
However, semantic dense-vector-based searching, as also used by major search engines such as Google and Bing, is now implicitly expected by the community: it aims at capturing the essence or semantics behind the user-provided search, and finding text that best matches those semantics. Both keyword and dense-vector searching have strengths and weaknesses and they are complementary. How best to combine them is an open research topic, but the motivation for this experiment is to see whether a naive approach can provide results better than the traditional keyword-only approach.
See also a prototype chat interface to 1994 Canberra Times articles
This is an experiment. About 47K news articles from The Canberra Times from 1994 were processed as follows:
Text cleaned up a bit with regexp to remove obvious crud and reinstate some hyphenated words.
Stanford NLP package used to identify named entities of type person, location, organisation and misc.
CLIP ONNX Server (from CLIP-as-a-service) using the ViT-L-14::openai model generates a classification vector of length 768 for each article. (The ViT-B-32 512 length model has also been configured - not sure which is best.) Sentences less than 4 words are dropped, long sentences are truncated to 35 words. Sentence vectors are weighted based on ln(word-count) then added together to produce an article vector, which is then normalised. First sentence vector is then normalised and both the first sentence vector and article vector are indexed.
For only the 1994 Canberra Times articles, openAI ada-002 embeddings are generated from the first 1200 words of each article and indexed (a modified Lucene 9.3 DenseVector implementation which supports vectors up to 2048 elements long was used as ada-002 produces vectors of length 1536).
Article text, named entities and the classification vector are indexed by SOLR 9.1 using the Dense vector search recently added to SOLR/Lucene 9, specifically the Hierarchical Navigable Small Worlds (HNSW) search graph to implement an approximate K-nearest-neighbours search.
A simple node.js app is used to search. For each search:
The results are very promising when searching for general concepts, allowing people to find more relevant articles. Ada-002 performs considerability better than CLIP. But classification searches are not of much value when searching for specific terms such as a person's name: the classification vector typically has too little input and cannot provide the high relevance of a keyword search. Vector-based similairity seems much better for a "more articles like this" than the "bag of words" approach. Some illuminating and promising searches:
Ada-002 can accept ~6K words for embedding into a single output vector (although we have used a max of 1200 in this test). But CLIP cannot, and is really designed to just handle a single sentence. So for CLIP, how to best combine small (currently per-sentence) vectors to produce a document vector? For some articles, such as Letters to the editor which have many topics, this is probably fruitless. Should we instead be indexing sub-articles - sequences of 'coherent' sentences that have very similar vectors? That is, partitioning an article into sequences and indexing each sequence?
How to blend keyword and vector search results? How to weigh keyword/phrase scores and kNN scores? Combining other 'signals' such as previous search history, location, ... (getting too creepy?)
Use wikidata to resolve entities. Consider using context from this resolution to create a entity hierarchy and apply this an index time. For example, Paul Keating is a member of entities ALP and Federal Parliament. Hence we could show a facet search on ALP members and have it include Paul Keating, Kevin Rudd... AND we could answer questions such as Articles about ALP members and decentralisation
Is it possible to use wikipedia articles to generate a characterisation vector for some topic or concept and to then search articles using that vector, to find articles representing some topic or concept? This is intriguing because if you are interested in say, cigarette advertising, do you search for tobacco, or nicotine, or smoking or cigarettes, and advertisting, or marketing or promotion, or ... - or, do you look for articles that have the vibe (or at least the language/vocab vibe) of Wikipedia's article on nicotine marketing?
Embeddings are thrown by OCR errors (as is entity recognition). Investigate effect of OCR correction.
Small text strings do not contain enough context for useful vectorisation and hence searching based on proximity to that vectorisation. So, it seems that the search terms first have to find some good text representative of the contents of articles the searcher is hoping to find. Maybe the newspaper corpus can provide these, maybe wikipedia can too. But perhaps this approach is too crude - perhaps creating vectors of the text from relevant paragraphs of keyword-only matches is best? Also, newspapers archive contents cover 200 years - language use changes and surely the best results will be cognisant of that?
Encoders can be fine-tuned to improve domain-specification classification - worth it for newspapers? Or would fine-tuning need to be performed on, say, a decade basis?
Is there any point encoding and using this approach on non-news articles (ie, family notices, lists, ads) - maybe not, although entity identifcation may still be useful on these.
ada-002/CLIP/BERT/word2Vec/GloVe: what's best (cheapest/fastest/most-useful) for this use-case? Also what is the optimal encoding length (we're currently using 768 for CLIP, 1536 for ada-002)
What about sparse semantic representations such SPLADE v2? They may be much cheaper and faster and may be a reasonable compromise on a very large corpus.
Stanford NLP entity identification wasn't perfect - is Apache NLP or maybe other NER tools worth trying?
...or should I augment Stanford NLP with an Australian Gazetteer and human-names list?
Optimal (for index construction? storage? search recall/relevance? search speed?) HNSW parameters for Lucene?
Comparison with current Trove searches is not exact for reasons including: Trove boosts query words found in an article's title (title is not differentiated in this prototype); Trove has much better handling of apostrophes and hyphenated words (including reconstruction of words broken across lines not performed in this prototype).
Results are very similar (choose the ada-002quant embedding) but not identical. The query is not yet been quantisised on ada-002quant knn searches.