Preprocessing¶

ips_python.preprocessing.get_wordnet_pos(word)[source]¶: Map POS tag to first character lemmatize() accepts

ips_python.preprocessing.preprocess_pipeline(df)[source]¶

Default process for taking the raw IATI data dump and processing the text for vectorizing

Parameters: df – dataframe of the raw IATI data with columns including identifier, description and title
Returns: dataframe of with preprocessed data with _only_ the columns IATI_IDENTIFIER_COLUMN_NAME and DESCRIPTION_COLUMN_NAME

Modeling¶

ips_python.vectorize.create_tfidf_term_document_matrix(preprocessed_text_dataframe)[source]¶

return a vectorizer object, TFIDF term document matrix and list of words

input:: preprocessed_text_dataframe: dataframe of preprocessed text with ‘description’ column
output:: tuple: vectorizer, term_document_matrix, word_list

ips_python.vectorize.vectorize_input_text(processed_query_dataframe, vectorizer)[source]¶

input:: processed_query_text: dataframe of processed user text vectorizer: TfidfVectorizer object
output:: numpy array of vectorized user input

ips_python.cosine.get_cosine_similarity(processed_user_query_vector, term_document_matrix, iati_records)[source]¶

ips_python.script.download_data()[source]¶: this is a placeholder function to show that we need to run something in order to procure the data

ips_python.utils.get_data_path()[source]¶

Return the absolute filepath of the data directory

Should work consistently across OS

ips_python.utils.get_input_path()[source]¶

Return the absolute filepath of the data directory

Should work consistently across OS

ips_python.utils.get_timestamp_string_prefix()[source]¶: return the date and time as a string in the format 2019_10_14_18_14_11