API Reference

readpyne

The key functions and tools that allow the end user to use the library with no fuss

readpyne.readpyne.extract(image, classifier, context=default_context, output_folder=None, return_negatives=False, override_prediction=False)

A function utilised the core of the package to extract required lines and by default classifies the required and non-required lines.

Note

Needs refactoring.

Parameters
  • image (np.array or str) – image as loaded by cv2.imread or string path to the image on disk

  • classifier (sklearn model or str) – sklearn model for classification or a string to a pickled model loads the last trained model. Current default model is loaded if nothing else is provided.

  • context (dict) – parameter dictionary which contains default settings for various functions # TODO: Write better summary of how to use this

  • output_folder (str) – if provided will save the predicted lines

  • override_prediction (bool) – if True then it overwrites any filtering done by the model and turns this into a regular pipeline of getting just the subsets

  • expand (dict) – experimental feature. This will eventually accept a dictionary of parameters which will be trickled down into the core making testing easier. at the moment we can only change the vertical padding of the system

Returns

a list of cutout lines in numpy array form if return_negatives is disabled, else a tuple containing both positive predictions and negatives (1s and 0s)

Return type

list | tuple

readpyne.readpyne.extract_from_folder(folder, **kwargs)

A folder version of the extract function.

Parameters
  • folder (str) – path the the folder from which to extract the images from

  • **kwrags – any other key word arguments to pass to the extract function. see readpyne.model.extract function documentation for more on which arguments are accepted

Returns

list containing the results of applying the extract function on each image

Return type

list

readpyne.readpyne.extras_pipe(imgs, expr={'date': '(\\d{1, 2}\\/\\d{1, 2}\\/\\d{1, 2} \\d{1, 2}:\\d{1, 2})', 'shop': '(shop1|shop2)'}, unique=True)

Pipeline function to extract the extra bits we need.

In essence provide it with all the non-item lines and a set of regular expressions. It will then go through each expression for each image and see what it can extract.

Note

Under active development, might not be nice to use in this current API format, in which case this will be changed. This should be done by the NLP side to be fair, since its just regular expressions

Parameters
  • imgs (list) – list of numpy arrays representing images.

  • expr (dict[str:str]) – dictionary with a name for an extra item and its corresponding regular expression

Returns

dictionary with signature str:[str]. each item in the dictionary is going to be a list of the same length as imgs with either found outputs from the regular expression or just an empty string. each key will be the same key as the ones within expr

Return type

dict

readpyne.readpyne.item_pipe(imgs, cleaner=<function default_cleaner>, extractor=<function extract_re>, expr='(.+?)(-?\\d+.\\d+$)')

This function abstracts a lot of work done by by ocr_textM and other function into one longer function that just takes in images of lines and exports a pandas dataframe.

The opinionated parts of this pipeline are the regular expression (expr) that splits the string into item and price and the default_cleaner (see docs) function that is passed to the cleaner parameter by default. Both of which you can swap out.

Note

Currently the OCR picks up minus signs as ~. This is dealt with by substituting all occurances of ~ with - before using regular expressions to parse things. This is necessary to pick up negative values of price. This is performed by the default_cleaner function.

Parameters
  • imgs (list) – list of numpy arrays representing images.

  • cleaner (function) – a function with the type signature String -> String that performs whatever cleaning is needed prior to splitting it by the regular expr.

  • extractor (function) – a function with a signature String -> (String, String) that extracts the item and the price from a given string. by default such a function is provided (see docs for extract_re)

  • expr (str) – str of the regexp to pass into the expr parameter into the extractor

Returns

pandas dataframe containing the extracted items and prices

Return type

pandas.DataFrame

readpyne.readpyne.shopscan(img, shops='(shop1|shop2|shop3)', header_frac=0.1, tail=False)

Takes in an image of a receipt, then based on a certain set of shop names and a header percentage will try to detect if the shopname is in the header.

Useful for detecting shop type and such.

Note

Setting header_frac to either 1 or 0 will just run the whole receipt

Parameters
  • img (numpy.ndarray) – image of the receipt

  • shops (str) – regex string similary to (shop1|shop2|shop3). In other words, the shop has to be captured in a group

  • header_frac (float) – a floating point number representing how much of the height to search

  • tail (bool) – if True the function will look at a given fraction from the bottom instead of from the top

Returns

the shop name that it found. if none are found, it returns ‘’

Return type

str

readpyne.model

Model training, prediction and data making functions

readpyne.model.make_training_data(input_folder, output_folder=None, interactive=True, context={'boxes': {'min_east_confidence': 0.4}, 'expand': {'pad': (0.2, 0.3)}})

Make training data from a folder of images.

Parameters
  • input_folder (str) – Folder where the images are stored

  • output_folder (str) – Folder where will the data will be saved. If not provided then data won’t be saved and will be just returned.

  • interactive (bool (default: True)) – If True will enable a interactive labelling mode where images shown one by one and the user is asked to label them 1 or 0 one by one. If output_folder is provided this will be respected.

Returns

if the data was interactively labelled, the labelled dataframe will be returned, else a boolean will be returned indicating successfull completion and saving of the training data.

Return type

bool or pd.DataFrame

readpyne.model.plot_scaling(df, plot, model_type=<class 'sklearn.neighbors.classification.KNeighborsClassifier'>, splits=100)

Plot the scaling of data given a sklearn classifier and a training dataset.

Parameters
  • df (pd.DataFrame) – pandas dataframe containing the training feature columns and labels. The last column should be the label column.

  • plot (bool) – if True this will plot the data before returning

  • model_type (sklearn classifier) – this has to be a classifier following the sklearn API (uses .fit and .predict)

  • splits (int) – number of splits to do of your data

Returns

numpy array containing the size of data, accuracy score and f1 score. This is the data that is used to plot the chart internally.

Return type

np.array

readpyne.model.status(name, x, y, model)

This function is responsible for reporting the quality of the model.

Parameters
  • name (str) – A string that will be the title of report

  • x (numpy.array) – A numpy array with training features.

  • y (numpy.array) – A numpy array with labels

  • model (sklearn model) – A model to be scored

Returns

Return type

None

readpyne.model.train_model(df, report=False, save_path=None, frac_test=0.25, sk_model=<class 'sklearn.neighbors.classification.KNeighborsClassifier'>, model_params={'n_neighbors': 2}, grid_params={'n_neighbors': [2, 3, 4, 5]}, grid_cv=5, scaling_plot=False)

Given a set of data an labels. Train a sklearn model.

Parameters
  • df (pd.DataFrame (or str)) –

    the dataframe containing the training features and labels anything until the last column as features (df.iloc[:,:-1]) and the last column is the labels (df.iloc[:,-1])

    if a string is provided, it will attempt to load the dataset from the .csv file

  • report (bool) – A boolean that tells you if you need the reporting procedure to run.

  • save_path (str) – A path to save the model to

  • frac_test (float) – A float indicating the amount of data to keep for testing.

  • sk_model (sklearn model object) – sklearn model that will be trained. An instance of it will be created and then trained.

  • model_params (dict) – A dict of parameters to be passed to the sklearn model.

  • grid_params (dict or None) – parameter dictionary for the GridSearchCV class in sklearn (see its documentation) for more information. If provided this enables gridsearch to be done for best performing hyperparameters

  • grid_cv (int) – specifies how many times crossvalidation is to be done in GridSearchCV (ignored if grid_params is set to None)

  • scaling_plot (bool) – if True will plot the scaling of the given model with increase in data for this to work, report has to be set to True

Returns

  • sklearn model – Trained sklearn model

  • tuple – A tuple containing the untouched test data (X_test, y_test)

readpyne.ocr

The OCR functionality of the package

readpyne.ocr.ocr_img_preprocess(img)

This is the default preprocessor that goes before the ocr Current preprocessing (note: might forget to update this)

  1. Convert to grayscale

  2. Threshold (img, 0, 255, Binary Thresh Invert | Thresh Otsu)

  3. Median blur with size 3

  4. Dilate, kernel size (1,1), iterations = 1

  5. Bitwise not (restore normal colors)

Parameters

img (np.array) – A numpy array representing an image (3 channels)

readpyne.ocr.ocr_text(img, pyt_config='--psm 7 --oem 1', ret_preprocessed=True, preprocessor=<function ocr_img_preprocess>)

Take in an image and after some preprocessing shove it into tesseract. Push through tesseract.

Parameters
  • img (np.array) – A numpy array representing an image (3 channels)

  • pyt_config (str) – Config to pass to pytesseract.

  • ret_preprocessed (bool) – If True, this will return the image after preprocessing

  • preprocessor (function) –

    function with a type signature of Image -> Image that will preprocess the image before OCR

    by default the ocr_img_preprocess is used

Returns

A tuple containing the extracted text and the image (original or preprocessed)

Return type

tuple

readpyne.ocr.ocr_textM(imgs, jobs=8, **kwargs)

A mapping function across a list of images for ocr_text. Takes in a list of images and applies ocr_text.

Note

Does not return a map object. Uses a list comprehension so you get a list.

Parameters
  • imgs (list) – list of numpy arrays representing images.

  • jobs (int) – integer telling how many processes to spin up to do the extraction. if 0 none will be spun up and it will be done without any concurrency by default set to number of cores (docs might say otherwise)

  • **kwargs – Any other keyword arguments you want to pass to ocr_text such as custom configs for tesseract and perhaps diverging preprocessing in the future.

Returns

A list of tuples [(str, np.array)]

Return type

list

readpyne.utils

A small selection of utility functions

readpyne.utils.build_distance_matrix(iter1, iter2, function=<function levenshtein>)

Build a distance matrix given two lists of text and a function used to compare the entries

If iter1 is of length m and iter2 is of length n then the resulting matrix is m x n

Parameters
  • iter1 (list) – iterable of any kind but developed with lists

  • iter2 (list) – iterable of any kind but developed with lists

  • function (callable) – a callable (function) that takes in 2 positional arguments and returns a value. mapped across a cartesian product of iterables to return a matrix

Returns

Return type

numpy.ndarray

readpyne.utils.fuzzy_compare_iter(iterable1, iterable2, function=<function levenshtein>, threshold=0.8, fill_val='N/A')

Used to compare two lists (or iterables). Their comparison will be based on the values outputed for each pair by the function given.

The idea is to first create a distance matrix using the function as the comparator, then retrieve any pairs that pass a given value threshold. If a value from the first iterable doesn’t have anything in the second iterable that passed the threshold the fill value will be returned

The method: x = [str] of length n y = [str] of length m

  1. compute an n x m matrix of levenshtein distances D between each string in x and each string in y.

  2. for each row from i to n in the resulting matrix D, map across each row and find the maximum value

  3. if the maximum value is not greater than threshold, value of a new variable z will be set fill_val, if the value is larger than threshold then the column index is returned.

  4. finally any column indices generated in step (3) are turned into the corresponding values from y and the fill_val indicating missing rows will be propagated

Example

a = [ 1, 2, 3 ] b = [ 0, 0, 100 ] distance = lambda x, y: y - x

fuzzy_compare_iter(a, b, function=distance, threshold=0, fill_val=”MISSING”)

Out:

[“MISSING”, “MISSING”, 97]

Parameters
  • iterable1 (list | iterable) – something to use as a basis. the length of the return will be based on the length of this iterable

  • iterable2 (list | iterable) – essential similar to iterable1, but will be used more as lookup

  • function (callable) – callable function that takes 2 postional arguments which will provide a score as this is developed for text, it is set to the levenshtein distance

  • threshold (numeric) – value to compare the distance matrix by

  • fill_val – anything passed into this function will be the placeholder that will be passed if a value for iterable1 has nothing to close enough in iterable2

Returns

Return type

list

readpyne.utils.quality_metric(lines, gold)

Compare a list of text lines to the gold standard (ground truth) for a given receipt. This metric takes into account both the character level OCR quality and the recall of lines.

The key method here is that of the fuzzy_compare_iter function with a levenshtein (edit) distance and 2 lists of strings.

By using fuzzy_compare_iter and then filtering values with a placeholder we can then check how many rows were recalled ‘close enough’ (using the default) threshold value in fuzzy_compare_iter. TODO: Add parametrisation of this fn.

Once we have the lines we recalled, we then use levenstein distance to compare them and use that as our precision metric.

Finally the number of lines recalled ‘close enough’ divided by the number of lines in the gold standard is our recall approximation and the precision of is measured by how well the characters in each of the recalled lines match using levenshtein distance across the whole string.

To get the final metric we use a harmonic mean of precision and recall.

Parameters
  • lines (list[str]) – list of text strings

  • gold (list[str]) – list of text strings representing ground truth

Returns

an approximation of a makeshift F1-score

Return type

float

readpyne.core

the core functionality for readpyne.

readpyne.core.binit(l_arrays, n_features=100)

Takes in a list of 3 1d arrays of length n and firstly it bins it into a set number of features dictated by n_features. This produces an array of length n_features. Then it stacks the three arrays into a 1d array of length 3 * n_features

Note

This function does not check if the len of the input list of arrays is 3.

Parameters
  • l_arrays (list) – A list of arrays of length n. Usually this will be a vertically collapsed image that has been passed through cv2.split to split its channels.

  • n_features (int) – An integer telling the function how many features to produce per array. This dictates the shape of end array as the resulting array will have a length of 3 * n_features

Returns

A numpy array of length 3 * n_features

Return type

numpy.array

readpyne.core.blobify(img)

Take in an image and pass it through cv2.dnn.blobFromImage

Parameters

blob (cv2.blob) – An image that has been blobified by cv2. (see the cv2.dnn.blobFromImage function documentation)

Returns

Return type

np.array

readpyne.core.boxes(img, context)

Take in an image, resize it, predict boxes. Then perform expansion of the boxes to the width of the receipt and then perform non_max_supression.

Parameters
  • img (numpy.array) – A numpy array representation of an image.

  • context (dict) – parameter dictionary which contains default settings for various functions # TODO: Write better summary of how to use this

Returns

  • np.array – The original image

  • list – Predicted subsets for the image.

readpyne.core.decode(scores, geo, min_east_confidence)

This takes the geometries and confidence scores and produces bounding box values. The inputs to this function come from the EAST model with 2 layers.

Note

THIS FUNCTION IS UNTESTED

This function borrows heavily from: https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/

Parameters
  • scores (numpy.array) – numpy array of size (number of found boxes, ) indicating the assigned confidence scores for each box.

  • geo (numpy.array) – numpy array coming out from EAST that describes where the boxes are.

  • min_east_confidence (float) – confidence level at which to cut off the boxes

Returns

A numpy array of shape (n, 5) containing the confidences and box locations for the boxes that are of a certain confidence.

Return type

numpy.array

readpyne.core.expand(arr, shape, pad)

Function to expand an array of image coordinates to the full width of the image.

Parameters
  • arr (numpy.array) – A two dimensional array of coordinates which are in the form of startX, startY, endX, endY

  • shape (tuple) – A tuple containing the shape of the original receipt image. Easily accessible through the use of the .shape method on an image.

Returns

A numpy array of the same shape as the original array but with the x values in each row expanded to the width of the image.

Return type

numpy.array

readpyne.core.features(img, subsets)

Take an image and its subsets created from boxes and produce histogram based features for each subset.

Parameters
  • img (numpy.array) – numpy array representation of an image.

  • subsets (list) – list of numpy arrays of the subsets.

Returns

  • list – subsets originally passed into the function

  • list – list of 1d numpy arrays

readpyne.core.forward(blob, layers=['feature_fusion/Conv_7/Sigmoid', 'feature_fusion/concat_3'])

Take in a cv2 blob and pass it forward through an EAST model.

Note

The layers in the model by default are "feature_fusion/Conv_7/Sigmoid", "feature_fusion/concat_3"

Parameters
  • blob (cv2.blob) – image that has been blobified by cv2. (see the blobify function documentation)

  • layers (list[str]) – list of string indicating which layers of the NN to pass the image through

Returns

  • numpy.array – Scores array for each box found.

  • numpy.array – Bounding box locations from EAST

readpyne.core.get_subsets(img, boxes)

Take an image and box locations. Then cut out these boxes from the given image.

Parameters
  • img (numpy.array) – numpy array representation of an image.

  • boxes (iterable) – iterable (most likely a numpy.array) containing box coordinates.

Returns

list of subsets.

Return type

list

readpyne.core.hist(img)

Histogram creation function. This function takes an input image and then collapses it vertically by using np.mean. It does this for each channel of the image as it uses cv2.split to get each channel.

Parameters

img (numpy.array) – This is a numpy array representation of an input image. Expected shape for this array is (n,m,3) where n is the height and m is the width of the image.

Returns

A list of numpy arrays. Each array will be of length m where m is the width of the input image.

Return type

list

readpyne.core.process(img)

Function that performs preprocessing before histograms are created.

Parameters

img (numpy.array) – A numpy array of shape (n,m,3) containing an image. Usually this will be a subset of a larger receipt.

Returns

A numpy array representation of the input image after all the processing was applied.

Return type

numpy.array

readpyne.core.resize(img)

This function resizes an input image to the correct dimensions for the EAST model that is used for text detection.

The dimensions for EAST have to be divisable by 32. This function pads the bottom and the right side of the image by a as many pixels as it needs for this to happen.

Note

The image is padded with white as the color and this is then propogated to the rest of the pipeline under normal circumstances.

Parameters

img (numpy.array) – This is a numpy array representation of an input image. Expected shape for this array is (n,m,3) where n is the height and m is the width of the image.

Returns

img – The white padded image with dimensions now divisable by 32.

Return type

numpy.array

readpyne.core.stack(features)

Stack features. Basically take in a list containing tuples of subsets and histogram based features and stack them all up.

Parameters

features (list) – List of type [([subsets], [features]), ...]

Returns

tuple of type ([all_subsets],[all_features])

Return type

tuple

readpyne.io

All input output functions

readpyne.io.cutout_save(path, img, subsets)

Take an image and its subsets and export it to the given path.

Parameters
  • path (str) – A path to be used to save all subsets.

  • img (numpy array) – A numpy array representing the image.

  • subsets (list) – A list of numpy array of each subset.

Returns

Return type

None

readpyne.io.export_raw_ocr(ocr_results, filename='ocr_results.txt')

Export the result of the ocr line extraction provided by ocr.ocr_textM.

Parameters
  • ocr_results (list) – list of tuples containing a str of text and a numpy.array for the image.

  • filename (str) – a string path to the while which will be used to save the ocr text. (default: "ocr_results.txt")

Returns

Return type

None

readpyne.io.get_data(folder='data/training', jobs=8, **kwargs)

Load all images in a folder.

Parameters
  • folder (str) – string path to where the images are.

  • jobs (int) – number of threads to start for the loading default: this will be set to the number of cores although the documenation might state otherwise. will stop using threading if set to 0

  • **kwargs – anything passed into the keyword arguments will be passed onto the load_validate function

Returns

list of images

Return type

list

readpyne.io.interactive_labelling(subs, features, output_folder=None)

Function with an event loop to label the subsets of training images.

Note

This function is untested.

Parameters
  • subs (list[np.array]) – list of image representing found lines in a receipt

  • features (list[np.array]) – list representing the features for line classifier

  • output_folder (str) – path to the output folder (without the final /) where the labeled data is to be saved.

Returns

if successful will return pandas dataframe with features and final column as labels

Return type

pd.DataFrame

readpyne.io.load_model(path)

Load a model from a string path.

Parameters

path (str) – A path to the model

Returns

Return type

sklearn model

readpyne.io.load_validate(filename, multiple_receipts=False)

A thin wrapper on top of cv2.imread that makes sure the image is loaded and if its sideways.

Parameters
  • filename (str) – path to the file

  • multiple_receipts (bool) – specify whether the image is a single receipt or multiple receipts on one image. defaults to False.

Returns

numpy array of the image

Return type

np.array

readpyne.io.save_images(image_list, path='outputs/training', offset=0)

A list of images exports to a given folder.

Parameters
  • image_list (list) – A list of images represented as numpy.arrays

  • path (string) – A path to the folder.

  • offset (int) – this integer dictates by how many to shift the names of the resulting images. for example if offset is 2, the images will start being numbered from 2 onwards.

Returns

Return type

None

readpyne.io.save_stack(subs, features, folder)

Get subsets and features and export them.

Parameters
  • subs (list[np.array]) – list of numpy arrays representing subsets of the image.

  • features (list[np.array]) – list of features.

Returns

  • subs – Same as input

  • features – Same as input

readpyne.io.show(img)

Use the matplotlib pyplot function to show the image.

Note

This function is untested.

Parameters

img (numpy array) – A numpy array representing the image.

Returns

Return type

None

readpyne.io.show_ocr(ocr_results)

Show OCR results from the ocr_textM function.

Parameters

ocr_results (list) – list of tuples containing a str of text and a numpy.array for the image.

Returns

Return type

None