This is an experiment. About 47K news articles from The Canberra Times from 1994 were processed as follows:

Text cleaned up a bit with regexp to remove obvious crud and reinstate some hyphenated words.
Stanford NLP package used to identify named entities of type person, location, organisation and misc.
CLIP ONNX Server (from CLIP-as-a-service) using the ViT-L-14::openai model generates a classification vector of length 768 for each article. (The ViT-B-32 512 length model has also been configured - not sure which is best.) Sentences less than 4 words are dropped, long sentences are truncated to 35 words. Sentence vectors are weighted based on ln(word-count) then added together to produce an article vector, which is then normalised. First sentence vector is then normalised and both the first sentence vector and article vector are indexed.
For only the 1994 Canberra Times articles, openAI ada-002 embeddings are generated from the first 1200 words of each article and indexed (a modified Lucene 9.3 DenseVector implementation which supports vectors up to 2048 elements long was used as ada-002 produces vectors of length 1536).
Article text, named entities and the classification vector are indexed by SOLR 9.1 using the Dense vector search recently added to SOLR/Lucene 9, specifically the Hierarchical Navigable Small Worlds (HNSW) search graph to implement an approximate K-nearest-neighbours search.
A simple node.js app is used to search. For each search:
- An initial standard search query is constructed using the supplied search words (anywhere, stemmed/unstemmed, boost for phrase).
- Classification vectors are retrieved along with the articles from this initial search. Up to the first 10 of these are used for up to 10 vector similarity searches, ORed together and weighted by the log of their keyword TF/IDF score, to find similar documents to the top 10 keyword results. Also OR'ed in this second search is a vector search on an embedding vector generated from the query string, and finally, the same keyword search used for the first search. Configurable boost factors are used for the three components to produce a blended result shown in the 2nd column. Results in the second column are annotated if their ranking changed based on the blending (Promoted/Demoted annotations and pastel-coloured backgrounds), or they were inserted into the results (ie, not found by the first search - these items have a drak green background).
- A comparative link to the current Trove search (title, year, article constraints applied) is rendered as are links to the articles found in Trove.

The results are very promising when searching for general concepts, allowing people to find more relevant articles. Ada-002 performs considerability better than CLIP. But classification searches are not of much value when searching for specific terms such as a person's name: the classification vector typically has too little input and cannot provide the high relevance of a keyword search. Vector-based similairity seems much better for a "more articles like this" than the "bag of words" approach. Some illuminating and promising searches:
1994
infiltration of ASIO by communist spies [ no articles are found on a keyword search, so the second column results are just a ranked search on similarity between this search term and documents]
cybersecurity [a rare term in 1994!]
the fall of John Major [note improved ranking and recall of semantic results, especially with ada-002]
damage from head and neck injuries during sport and compare with sport-induced head and neck injuries
refugee and asylum seeker policies
waterfront dispute
repercussions from the japanese invasion and occupation of manchuria before and during ww2
public service wage negotiations
liberal party leadership fight
best childrens books 1994
illegal killing of whales
native title rights mabo
interest rate rises
Wood Royal Commission into the NSW Police Service
anti-corruption commission
gp funding medicare
transport strategy canberra
traditional feminism backlash
korea nuclear tension
Israel PLO war
australia china relations
uranium miners
aftermath of the cold war
cigarette advertising
china democracy movement
space exploration
boat people
boat people person:"NICK BOLKUS"
greenhouse effect
urban consolidation
flower festival
war atrocities sarajevo
cultural institutions
science policy
republican movement person:"PAUL KEATING"
swimming medal person:"KIEREN PERKINS"
medically assisted death
railway accident
flying saucers
person:"JAMES BOND"
high speed internet to households
growth of telecommuting using the internet
russian displeasure at the expansion of nato

TODO and open problems

Ada-002 can accept ~6K words for embedding into a single output vector (although we have used a max of 1200 in this test). But CLIP cannot, and is really designed to just handle a single sentence. So for CLIP, how to best combine small (currently per-sentence) vectors to produce a document vector? For some articles, such as Letters to the editor which have many topics, this is probably fruitless. Should we instead be indexing sub-articles - sequences of 'coherent' sentences that have very similar vectors? That is, partitioning an article into sequences and indexing each sequence?
How to blend keyword and vector search results? How to weigh keyword/phrase scores and kNN scores? Combining other 'signals' such as previous search history, location, ... (getting too creepy?)
Use wikidata to resolve entities. Consider using context from this resolution to create a entity hierarchy and apply this an index time. For example, Paul Keating is a member of entities ALP and Federal Parliament. Hence we could show a facet search on ALP members and have it include Paul Keating, Kevin Rudd... AND we could answer questions such as Articles about ALP members and decentralisation
Is it possible to use wikipedia articles to generate a characterisation vector for some topic or concept and to then search articles using that vector, to find articles representing some topic or concept? This is intriguing because if you are interested in say, cigarette advertising, do you search for tobacco, or nicotine, or smoking or cigarettes, and advertisting, or marketing or promotion, or ... - or, do you look for articles that have the vibe (or at least the language/vocab vibe) of Wikipedia's article on nicotine marketing?
Embeddings are thrown by OCR errors (as is entity recognition). Investigate effect of OCR correction.
Small text strings do not contain enough context for useful vectorisation and hence searching based on proximity to that vectorisation. So, it seems that the search terms first have to find some good text representative of the contents of articles the searcher is hoping to find. Maybe the newspaper corpus can provide these, maybe wikipedia can too. But perhaps this approach is too crude - perhaps creating vectors of the text from relevant paragraphs of keyword-only matches is best? Also, newspapers archive contents cover 200 years - language use changes and surely the best results will be cognisant of that?
Encoders can be fine-tuned to improve domain-specification classification - worth it for newspapers? Or would fine-tuning need to be performed on, say, a decade basis?
Is there any point encoding and using this approach on non-news articles (ie, family notices, lists, ads) - maybe not, although entity identifcation may still be useful on these.
ada-002/CLIP/BERT/word2Vec/GloVe: what's best (cheapest/fastest/most-useful) for this use-case? Also what is the optimal encoding length (we're currently using 768 for CLIP, 1536 for ada-002)
What about sparse semantic representations such SPLADE v2? They may be much cheaper and faster and may be a reasonable compromise on a very large corpus.
Stanford NLP entity identification wasn't perfect - is Apache NLP or maybe other NER tools worth trying?
...or should I augment Stanford NLP with an Australian Gazetteer and human-names list?
How does this scale? HNSW is designed to scale, but... worth investigating Milvus, Haystack NLP, Weaviate... ?
Optimal (for index construction? storage? search recall/relevance? search speed?) HNSW parameters for Lucene?
Comparison with current Trove searches is not exact for reasons including: Trove boosts query words found in an article's title (title is not differentiated in this prototype); Trove has much better handling of apostrophes and hyphenated words (including reconstruction of words broken across lines not performed in this prototype).

Updates

17Oct23 - bgeBase768

Made BAAI/bge-base-en-v1.5 the default embedding rather than ada-002. Because this embedding has a 512 token limit, articles were split into chunks of no more than 256 words based on sentence boundaries (as determined by Stanford NLP).

Chunk embedding vectors for an article were combined into an article embedding vector by summing each chunk vector multiplied by a factor and then normalising the result. The factor was empirically determined to be the inverse of the chunk sequence number (starting from 1) multiplied by an embedding weight raised to the fifth power. This empirical determination was made by trying a variety of ways to combining the sequence number and chunk-article similarity score of all an article's chunks to best approximate a synthetic embedding with a high similarity to the article summary's embedding.

The embedding weight was generated by a neural net trained as follows. Llama-2-70b was used to generate a summary for about 40,000 newspaper articles longer than 250 words. Most of the newspaper articles were from 1948, the rest were from 1994. The summary was limited to 250 words. BAAI/bge-base-en-v1.5 was used to generate an embedding for each summary, as well as for each component chunk, and the (dot product) similarity between the article summary and each of its chunks was calculated. The neural net was trained to predict that chunk-to-containing-article similarity, given the chunk's embedding vector. The thinking behind this approach is that perhaps some chunks and hence their embeddings are mostly 'throat clearing' or otherwise tangential to the article in which they appear, and that important chunks will have embeddings with some characteristics that a neural net could identify, regardless of the article of which they are part. About 115,000 chunk embeddings and similarities were used for training, with two 'held out' sets (one of 1948, one of 1994 articles) totalling about 3000 used for evaluation of the training.

The pytorch neural net was small and simple, 3 layers, ~77K weights, accepts an embedding of dimension 768, outputs a float:

Sequential(
  (0): Linear(in_features=768, out_features=96, bias=True)
  (1): ReLU()
  (2): Dropout(p=0.2, inplace=False)
  (3): Linear(in_features=96, out_features=32, bias=True)
  (4): ReLU()
  (5): Dropout(p=0.1, inplace=False)
  (6): Linear(in_features=32, out_features=1, bias=True)
)

The test results showed reasonable learning of a weight to be used as a proxy of similarity between the embedding of the chunk and that of its containing article. In the following graphs, ideal results would have all points along the diagonal:

The advantage of BAAI/bge-base-en-v1.5 is that it is highly rated as an 'open' and effective retrieval embedding by HuggingFace Massive Text Embedding Benchmark (MTEB) Leaderboard as of Oct 2023, for a small model (0.44GB) producing an embedding with 768 dimensions. The disadvantage, compared to ada-002, is its small input sequence length of 512 tokens, compared to ada-002's 8191. The above approach to creating a good composite embedding seeks to adddress this disadvantage.

Qualitatively, the results are good - sometimes better, sometimes worse, but generally quite similar to ada-002 embeddings, but compare for yourself!

TODO: use PQ coding to reduce embedding from 768 floats to (say) 256 byte codes and evaluate loss of accuracy this will cause.

Motivation

About

1994

TODO and open problems

Updates

General background papers/reports