IATI Approaches Overview

Word Embeddings

Approach

Using the Gensim implementation of Word2Vec, Continuous Bag of Words (CBOW) method. Word embeddings represent all the words in a corpus in vector space, where words with similar meaning have a similar vector representation.

Current Status

  • building the Word2Vec model from the IATI activity descriptions (not pre-built model)

  • Using 300 dimensional vectors, as suggested to be optimal in previous studies: https://nlp.stanford.edu/pubs/glove.pdf

Packages

  • gensim.models word2vec, doc2vec

Added value

  • Should make a more advanced search engine incorporating synonyms matches to user input search terms

Current challenges

  • Some IATI activity descriptions have few words, with a mean of 17 words per entry after preprocessing.

  • The suitability of the data before or after pre-processing as a word2vec or doc2vec model training set.

Information Searching - Saliha

See Saliha’s code base for overview