4. Custom (COICOP) classifier

Demonstration notebook for the ClassificationLLM using RAG with a custom index. In this demo, the Classification of Individual Consumption According to Purpose (COICOP) index is used.

Code: Import methods and initialise
from sic_soc_llm import setup_logging, get_config
from sic_soc_llm.llm import ClassificationLLM
from sic_soc_llm.embedding import EmbeddingHandler

logger = setup_logging('coicop_notebook')
config = get_config()

Load COICOP or other custom index

The expected format of the custom index is a text file with each line containing one index entry in the format class_code : class_descriptive. The following code snippet demonstrates how to load and embed the COICOP index. This embedding is saved in a vector store that is used in the retrieval step of RAG based classification in ClassificationLLM. Note that the coicop_demo_llm should be replaced with the LLM of your choice.

Code: Load COICOP index
index_filepath = config["lookups"]["coicop_condensed"]
with open(index_filepath) as file_object:
    for _ in range(5):
        print(next(file_object))

embed = EmbeddingHandler(db_dir=None)
with open(index_filepath) as file_object:
    embed.embed_index(file_object=file_object)

coicop_llm = ClassificationLLM(embedding_handler=embed, llm = coicop_demo_llm)
CP01111: Rice

CP01112: Flours and other cereals

CP01113: Bread

CP01114: Other bakery products

CP01115: Pizza and quiche

Example classification using COICOP index

The following code block demonstrates how to classify a few examples using the COICOP index. Note that the respondent data is passed as a dictionary. For different use cases, any custom survey fields can be used as keys in the dictionary. ClassificationLLM uses the values that are present in the dictionary to retrieve the relevant information from the index and includes all the provided fields in the generative query step.

Code: Example lookup
for item in ["organic whole milk", "skinny jeans", "tooth filling"]:
    # Get response from LLM
    response, short_list = coicop_llm.rag_general_code(respondent_data={"item": item})

    # Print the output
    print("Input:")
    print(f" item:  {item}")
    print('')
    print("Response:")
    for x,y  in response.__dict__.items():
        print (f' {x}: {y}')
    print(f" shortlist used in RAG: {short_list}")
    print("")
    print('===========================================')
    print("")
Input:
 item:  organic whole milk

Response:
 codable: True
 followup: None
 class_code: CP01141
 class_descriptive: Whole milk
 alt_candidates: [RagCandidate(class_code='CP01146', class_descriptive='Other milk products', likelihood=0.1), RagCandidate(class_code='CP01199', class_descriptive='Other food products n.e.c.', likelihood=0.05)]
 reasoning: The respondent's data mentions 'organic whole milk' which directly matches with the 'Whole milk' category in the classification index. Although the milk is organic, there is no separate category for organic milk in the provided subset of classification index. Therefore, the most suitable classification code is 'CP01141' for 'Whole milk'. Other possible but less likely categories could be 'Other milk products' or 'Other food products n.e.c.'.
 shortlist used in RAG: [{'distance': 0.34670400619506836, 'title': ' Whole milk\n', 'code': 'CP01141', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 0.5039430856704712, 'title': ' Other milk products\n', 'code': 'CP01146', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 0.732480525970459, 'title': ' Preserved milk\n', 'code': 'CP01143', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 0.7517217397689819, 'title': ' Low fat milk\n', 'code': 'CP01142', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.0075041055679321, 'title': ' Yoghurt\n', 'code': 'CP01144', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.0319886207580566, 'title': ' Other food products n.e.c.\n', 'code': 'CP01199', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.1342484951019287, 'title': ' Artificial sugar substitutes\n', 'code': 'CP01186', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.1362781524658203, 'title': ' Other cereal products\n', 'code': 'CP01118', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.1640750169754028, 'title': ' Other bakery products\n', 'code': 'CP01114', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.1741816997528076, 'title': ' Cheese and curd\n', 'code': 'CP01145', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.174346923828125, 'title': ' Confectionery products\n', 'code': 'CP01184', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.1958930492401123, 'title': ' Olive oil\n', 'code': 'CP01153', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2051899433135986, 'title': ' Pharmaceutical products\n', 'code': 'CP06110', 'four_digit_code': 'CP06', 'two_digit_code': 'CP'}, {'distance': 1.2462573051452637, 'title': ' Sugar\n', 'code': 'CP01181', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.24760901927948, 'title': ' Other edible oils\n', 'code': 'CP01154', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2480995655059814, 'title': ' Soft drinks\n', 'code': 'CP01222', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2607505321502686, 'title': ' Rice\n', 'code': 'CP01111', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2730278968811035, 'title': ' Other edible animal fats\n', 'code': 'CP01155', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2755171060562134, 'title': ' Breakfast cereals\n', 'code': 'CP01117', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2833454608917236, 'title': ' Dried fruit and nuts\n', 'code': 'CP01163', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}]

===========================================

Input:
 item:  skinny jeans

Response:
 codable: False
 followup: Is the item intended for men or women?
 class_code: None
 class_descriptive: None
 alt_candidates: [RagCandidate(class_code='CP03121', class_descriptive='Garments for men', likelihood=0.5), RagCandidate(class_code='CP03122', class_descriptive='Garments for women', likelihood=0.5)]
 reasoning: The item 'skinny jeans' can be classified as either 'Garments for men' or 'Garments for women'. Without information on the intended gender for the item, a definitive classification cannot be made.
 shortlist used in RAG: [{'distance': 1.0392100811004639, 'title': ' Garments for men\n', 'code': 'CP03121', 'four_digit_code': 'CP03', 'two_digit_code': 'CP'}, {'distance': 1.0935455560684204, 'title': ' Garments for women\n', 'code': 'CP03122', 'four_digit_code': 'CP03', 'two_digit_code': 'CP'}, {'distance': 1.1048061847686768, 'title': ' Clothing materials\n', 'code': 'CP03110', 'four_digit_code': 'CP03', 'two_digit_code': 'CP'}, {'distance': 1.1491382122039795, 'title': ' Clothing accessories\n', 'code': 'CP03132', 'four_digit_code': 'CP03', 'two_digit_code': 'CP'}, {'distance': 1.183807373046875, 'title': ' Clothes washing machines\n', 'code': 'CP05312', 'four_digit_code': 'CP05', 'two_digit_code': 'CP'}, {'distance': 1.2130918502807617, 'title': ' Other articles of clothing\n', 'code': 'CP03131', 'four_digit_code': 'CP03', 'two_digit_code': 'CP'}, {'distance': 1.2768938541412354, 'title': ' Jams\n', 'code': 'CP01182', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2890671491622925, 'title': ' Repair and hire of clothing\n', 'code': 'CP03142', 'four_digit_code': 'CP03', 'two_digit_code': 'CP'}, {'distance': 1.2973721027374268, 'title': ' Dried\n', 'code': 'CP01135', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2973721027374268, 'title': ' Dried\n', 'code': 'CP01127', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.3091436624526978, 'title': ' Crisps\n', 'code': 'CP01175', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.327451467514038, 'title': ' Bed linen\n', 'code': 'CP05202', 'four_digit_code': 'CP05', 'two_digit_code': 'CP'}, {'distance': 1.3461003303527832, 'title': ' Bicycles\n', 'code': 'CP07130', 'four_digit_code': 'CP07', 'two_digit_code': 'CP'}, {'distance': 1.3484004735946655, 'title': ' Pork\n', 'code': 'CP01122', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.3745286464691162, 'title': ' Footwear for men\n', 'code': 'CP03211', 'four_digit_code': 'CP03', 'two_digit_code': 'CP'}, {'distance': 1.3757548332214355, 'title': ' Irons\n', 'code': 'CP05323', 'four_digit_code': 'CP05', 'two_digit_code': 'CP'}, {'distance': 1.3820579051971436, 'title': ' Breakfast cereals\n', 'code': 'CP01117', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.3839714527130127, 'title': ' Cleaning of clothing\n', 'code': 'CP03141', 'four_digit_code': 'CP03', 'two_digit_code': 'CP'}, {'distance': 1.3974006175994873, 'title': ' Heaters\n', 'code': 'CP05314', 'four_digit_code': 'CP05', 'two_digit_code': 'CP'}, {'distance': 1.3990002870559692, 'title': ' Camper vans\n', 'code': 'CP09211', 'four_digit_code': 'CP09', 'two_digit_code': 'CP'}]

===========================================

Input:
 item:  tooth filling

Response:
 codable: True
 followup: None
 class_code: CP06220
 class_descriptive: Dental services
 alt_candidates: []
 reasoning: The respondent's data mentions 'tooth filling' which is a service provided by dentists. Therefore, the classification code 'CP06220' for 'Dental services' is the most appropriate.
 shortlist used in RAG: [{'distance': 0.9580115079879761, 'title': ' Dental services\n', 'code': 'CP06220', 'four_digit_code': 'CP06', 'two_digit_code': 'CP'}, {'distance': 1.2639143466949463, 'title': ' Chocolate\n', 'code': 'CP01183', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2665995359420776, 'title': ' Sugar\n', 'code': 'CP01181', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2992041110992432, 'title': ' Tyres\n', 'code': 'CP07211', 'four_digit_code': 'CP07', 'two_digit_code': 'CP'}, {'distance': 1.3504626750946045, 'title': ' Bread\n', 'code': 'CP01113', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.3816959857940674, 'title': ' Crisps\n', 'code': 'CP01175', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.4047653675079346, 'title': ' Cutlery\n', 'code': 'CP05402', 'four_digit_code': 'CP05', 'two_digit_code': 'CP'}, {'distance': 1.413558006286621, 'title': ' Cigarettes\n', 'code': 'CP02201', 'four_digit_code': 'CP02', 'two_digit_code': 'CP'}, {'distance': 1.4301748275756836, 'title': ' Salt\n', 'code': 'CP01192', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.443652868270874, 'title': ' Hearing aids\n', 'code': 'CP06132', 'four_digit_code': 'CP06', 'two_digit_code': 'CP'}, {'distance': 1.4493237733840942, 'title': ' Jewellery\n', 'code': 'CP12311', 'four_digit_code': 'CP12', 'two_digit_code': 'CP'}, {'distance': 1.4624769687652588, 'title': ' Breakfast cereals\n', 'code': 'CP01117', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.4685726165771484, 'title': ' Butter\n', 'code': 'CP01151', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.4742844104766846, 'title': ' Baby food\n', 'code': 'CP01193', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.4815491437911987, 'title': ' Coffee\n', 'code': 'CP01211', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.4819934368133545, 'title': ' Eggs\n', 'code': 'CP01147', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.4847772121429443, 'title': ' Lubricants\n', 'code': 'CP07224', 'four_digit_code': 'CP07', 'two_digit_code': 'CP'}, {'distance': 1.495471477508545, 'title': ' Whole milk\n', 'code': 'CP01141', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.5035916566848755, 'title': ' Edible offal\n', 'code': 'CP01126', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.509711503982544, 'title': ' Other cereal products\n', 'code': 'CP01118', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}]

===========================================