Code: Import methods and initialise
from sic_soc_llm import setup_logging, get_config
from sic_soc_llm.llm import ClassificationLLM
from sic_soc_llm.embedding import EmbeddingHandler
= setup_logging('coicop_notebook')
logger = get_config() config
Demonstration notebook for the ClassificationLLM
using RAG with a custom index. In this demo, the Classification of Individual Consumption According to Purpose (COICOP) index is used.
The expected format of the custom index is a text file with each line containing one index entry in the format class_code : class_descriptive
. The following code snippet demonstrates how to load and embed the COICOP index. This embedding is saved in a vector store that is used in the retrieval step of RAG based classification in ClassificationLLM
. Note that the coicop_demo_llm
should be replaced with the LLM of your choice.
index_filepath = config["lookups"]["coicop_condensed"]
with open(index_filepath) as file_object:
for _ in range(5):
print(next(file_object))
embed = EmbeddingHandler(db_dir=None)
with open(index_filepath) as file_object:
embed.embed_index(file_object=file_object)
coicop_llm = ClassificationLLM(embedding_handler=embed, llm = coicop_demo_llm)
CP01111: Rice
CP01112: Flours and other cereals
CP01113: Bread
CP01114: Other bakery products
CP01115: Pizza and quiche
The following code block demonstrates how to classify a few examples using the COICOP index. Note that the respondent data is passed as a dictionary. For different use cases, any custom survey fields can be used as keys in the dictionary. ClassificationLLM
uses the values that are present in the dictionary to retrieve the relevant information from the index and includes all the provided fields in the generative query step.
for item in ["organic whole milk", "skinny jeans", "tooth filling"]:
# Get response from LLM
response, short_list = coicop_llm.rag_general_code(respondent_data={"item": item})
# Print the output
print("Input:")
print(f" item: {item}")
print('')
print("Response:")
for x,y in response.__dict__.items():
print (f' {x}: {y}')
print(f" shortlist used in RAG: {short_list}")
print("")
print('===========================================')
print("")
Input:
item: organic whole milk
Response:
codable: True
followup: None
class_code: CP01141
class_descriptive: Whole milk
alt_candidates: [RagCandidate(class_code='CP01146', class_descriptive='Other milk products', likelihood=0.1), RagCandidate(class_code='CP01199', class_descriptive='Other food products n.e.c.', likelihood=0.05)]
reasoning: The respondent's data mentions 'organic whole milk' which directly matches with the 'Whole milk' category in the classification index. Although the milk is organic, there is no separate category for organic milk in the provided subset of classification index. Therefore, the most suitable classification code is 'CP01141' for 'Whole milk'. Other possible but less likely categories could be 'Other milk products' or 'Other food products n.e.c.'.
shortlist used in RAG: [{'distance': 0.34670400619506836, 'title': ' Whole milk\n', 'code': 'CP01141', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 0.5039430856704712, 'title': ' Other milk products\n', 'code': 'CP01146', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 0.732480525970459, 'title': ' Preserved milk\n', 'code': 'CP01143', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 0.7517217397689819, 'title': ' Low fat milk\n', 'code': 'CP01142', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.0075041055679321, 'title': ' Yoghurt\n', 'code': 'CP01144', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.0319886207580566, 'title': ' Other food products n.e.c.\n', 'code': 'CP01199', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.1342484951019287, 'title': ' Artificial sugar substitutes\n', 'code': 'CP01186', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.1362781524658203, 'title': ' Other cereal products\n', 'code': 'CP01118', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.1640750169754028, 'title': ' Other bakery products\n', 'code': 'CP01114', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.1741816997528076, 'title': ' Cheese and curd\n', 'code': 'CP01145', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.174346923828125, 'title': ' Confectionery products\n', 'code': 'CP01184', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.1958930492401123, 'title': ' Olive oil\n', 'code': 'CP01153', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2051899433135986, 'title': ' Pharmaceutical products\n', 'code': 'CP06110', 'four_digit_code': 'CP06', 'two_digit_code': 'CP'}, {'distance': 1.2462573051452637, 'title': ' Sugar\n', 'code': 'CP01181', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.24760901927948, 'title': ' Other edible oils\n', 'code': 'CP01154', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2480995655059814, 'title': ' Soft drinks\n', 'code': 'CP01222', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2607505321502686, 'title': ' Rice\n', 'code': 'CP01111', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2730278968811035, 'title': ' Other edible animal fats\n', 'code': 'CP01155', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2755171060562134, 'title': ' Breakfast cereals\n', 'code': 'CP01117', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2833454608917236, 'title': ' Dried fruit and nuts\n', 'code': 'CP01163', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}]
===========================================
Input:
item: skinny jeans
Response:
codable: False
followup: Is the item intended for men or women?
class_code: None
class_descriptive: None
alt_candidates: [RagCandidate(class_code='CP03121', class_descriptive='Garments for men', likelihood=0.5), RagCandidate(class_code='CP03122', class_descriptive='Garments for women', likelihood=0.5)]
reasoning: The item 'skinny jeans' can be classified as either 'Garments for men' or 'Garments for women'. Without information on the intended gender for the item, a definitive classification cannot be made.
shortlist used in RAG: [{'distance': 1.0392100811004639, 'title': ' Garments for men\n', 'code': 'CP03121', 'four_digit_code': 'CP03', 'two_digit_code': 'CP'}, {'distance': 1.0935455560684204, 'title': ' Garments for women\n', 'code': 'CP03122', 'four_digit_code': 'CP03', 'two_digit_code': 'CP'}, {'distance': 1.1048061847686768, 'title': ' Clothing materials\n', 'code': 'CP03110', 'four_digit_code': 'CP03', 'two_digit_code': 'CP'}, {'distance': 1.1491382122039795, 'title': ' Clothing accessories\n', 'code': 'CP03132', 'four_digit_code': 'CP03', 'two_digit_code': 'CP'}, {'distance': 1.183807373046875, 'title': ' Clothes washing machines\n', 'code': 'CP05312', 'four_digit_code': 'CP05', 'two_digit_code': 'CP'}, {'distance': 1.2130918502807617, 'title': ' Other articles of clothing\n', 'code': 'CP03131', 'four_digit_code': 'CP03', 'two_digit_code': 'CP'}, {'distance': 1.2768938541412354, 'title': ' Jams\n', 'code': 'CP01182', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2890671491622925, 'title': ' Repair and hire of clothing\n', 'code': 'CP03142', 'four_digit_code': 'CP03', 'two_digit_code': 'CP'}, {'distance': 1.2973721027374268, 'title': ' Dried\n', 'code': 'CP01135', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2973721027374268, 'title': ' Dried\n', 'code': 'CP01127', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.3091436624526978, 'title': ' Crisps\n', 'code': 'CP01175', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.327451467514038, 'title': ' Bed linen\n', 'code': 'CP05202', 'four_digit_code': 'CP05', 'two_digit_code': 'CP'}, {'distance': 1.3461003303527832, 'title': ' Bicycles\n', 'code': 'CP07130', 'four_digit_code': 'CP07', 'two_digit_code': 'CP'}, {'distance': 1.3484004735946655, 'title': ' Pork\n', 'code': 'CP01122', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.3745286464691162, 'title': ' Footwear for men\n', 'code': 'CP03211', 'four_digit_code': 'CP03', 'two_digit_code': 'CP'}, {'distance': 1.3757548332214355, 'title': ' Irons\n', 'code': 'CP05323', 'four_digit_code': 'CP05', 'two_digit_code': 'CP'}, {'distance': 1.3820579051971436, 'title': ' Breakfast cereals\n', 'code': 'CP01117', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.3839714527130127, 'title': ' Cleaning of clothing\n', 'code': 'CP03141', 'four_digit_code': 'CP03', 'two_digit_code': 'CP'}, {'distance': 1.3974006175994873, 'title': ' Heaters\n', 'code': 'CP05314', 'four_digit_code': 'CP05', 'two_digit_code': 'CP'}, {'distance': 1.3990002870559692, 'title': ' Camper vans\n', 'code': 'CP09211', 'four_digit_code': 'CP09', 'two_digit_code': 'CP'}]
===========================================
Input:
item: tooth filling
Response:
codable: True
followup: None
class_code: CP06220
class_descriptive: Dental services
alt_candidates: []
reasoning: The respondent's data mentions 'tooth filling' which is a service provided by dentists. Therefore, the classification code 'CP06220' for 'Dental services' is the most appropriate.
shortlist used in RAG: [{'distance': 0.9580115079879761, 'title': ' Dental services\n', 'code': 'CP06220', 'four_digit_code': 'CP06', 'two_digit_code': 'CP'}, {'distance': 1.2639143466949463, 'title': ' Chocolate\n', 'code': 'CP01183', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2665995359420776, 'title': ' Sugar\n', 'code': 'CP01181', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.2992041110992432, 'title': ' Tyres\n', 'code': 'CP07211', 'four_digit_code': 'CP07', 'two_digit_code': 'CP'}, {'distance': 1.3504626750946045, 'title': ' Bread\n', 'code': 'CP01113', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.3816959857940674, 'title': ' Crisps\n', 'code': 'CP01175', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.4047653675079346, 'title': ' Cutlery\n', 'code': 'CP05402', 'four_digit_code': 'CP05', 'two_digit_code': 'CP'}, {'distance': 1.413558006286621, 'title': ' Cigarettes\n', 'code': 'CP02201', 'four_digit_code': 'CP02', 'two_digit_code': 'CP'}, {'distance': 1.4301748275756836, 'title': ' Salt\n', 'code': 'CP01192', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.443652868270874, 'title': ' Hearing aids\n', 'code': 'CP06132', 'four_digit_code': 'CP06', 'two_digit_code': 'CP'}, {'distance': 1.4493237733840942, 'title': ' Jewellery\n', 'code': 'CP12311', 'four_digit_code': 'CP12', 'two_digit_code': 'CP'}, {'distance': 1.4624769687652588, 'title': ' Breakfast cereals\n', 'code': 'CP01117', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.4685726165771484, 'title': ' Butter\n', 'code': 'CP01151', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.4742844104766846, 'title': ' Baby food\n', 'code': 'CP01193', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.4815491437911987, 'title': ' Coffee\n', 'code': 'CP01211', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.4819934368133545, 'title': ' Eggs\n', 'code': 'CP01147', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.4847772121429443, 'title': ' Lubricants\n', 'code': 'CP07224', 'four_digit_code': 'CP07', 'two_digit_code': 'CP'}, {'distance': 1.495471477508545, 'title': ' Whole milk\n', 'code': 'CP01141', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.5035916566848755, 'title': ' Edible offal\n', 'code': 'CP01126', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}, {'distance': 1.509711503982544, 'title': ' Other cereal products\n', 'code': 'CP01118', 'four_digit_code': 'CP01', 'two_digit_code': 'CP'}]
===========================================