API Reference¶

readpyne¶

The key functions and tools that allow the end user to use the library with no fuss

readpyne.readpyne.extract(image, classifier, context=default_context, output_folder=None, return_negatives=False, override_prediction=False)¶

A function utilised the core of the package to extract required lines and by default classifies the required and non-required lines.

Note

Needs refactoring.

Parameters

image (np.array or str) – image as loaded by cv2.imread or string path to the image on disk
classifier (sklearn model or str) – sklearn model for classification or a string to a pickled model loads the last trained model. Current default model is loaded if nothing else is provided.
context (dict) – parameter dictionary which contains default settings for various functions # TODO: Write better summary of how to use this
output_folder (str) – if provided will save the predicted lines
override_prediction (bool) – if True then it overwrites any filtering done by the model and turns this into a regular pipeline of getting just the subsets
expand (dict) – experimental feature. This will eventually accept a dictionary of parameters which will be trickled down into the core making testing easier. at the moment we can only change the vertical padding of the system

Returns

a list of cutout lines in numpy array form if return_negatives is disabled, else a tuple containing both positive predictions and negatives (1s and 0s)

Return type

list | tuple

readpyne.readpyne.extract_from_folder(folder, **kwargs)¶

A folder version of the extract function.

Parameters

folder (str) – path the the folder from which to extract the images from
**kwrags – any other key word arguments to pass to the extract function. see readpyne.model.extract function documentation for more on which arguments are accepted

Returns

list containing the results of applying the extract function on each image

Return type

list

readpyne.readpyne.extras_pipe(imgs, expr={'date': '(\\d{1, 2}\\/\\d{1, 2}\\/\\d{1, 2} \\d{1, 2}:\\d{1, 2})', 'shop': '(shop1|shop2)'}, unique=True)¶

Pipeline function to extract the extra bits we need.

In essence provide it with all the non-item lines and a set of regular expressions. It will then go through each expression for each image and see what it can extract.

Note

Under active development, might not be nice to use in this current API format, in which case this will be changed. This should be done by the NLP side to be fair, since its just regular expressions

Parameters

imgs (list) – list of numpy arrays representing images.
expr (dict[str:str]) – dictionary with a name for an extra item and its corresponding regular expression

Returns

dictionary with signature str:[str]. each item in the dictionary is going to be a list of the same length as imgs with either found outputs from the regular expression or just an empty string. each key will be the same key as the ones within expr

Return type

dict

readpyne.readpyne.item_pipe(imgs, cleaner=<function default_cleaner>, extractor=<function extract_re>, expr='(.+?)(-?\\d+.\\d+$)')¶

This function abstracts a lot of work done by by ocr_textM and other function into one longer function that just takes in images of lines and exports a pandas dataframe.

The opinionated parts of this pipeline are the regular expression (expr) that splits the string into item and price and the default_cleaner (see docs) function that is passed to the cleaner parameter by default. Both of which you can swap out.

Note

Currently the OCR picks up minus signs as ~. This is dealt with by substituting all occurances of ~ with - before using regular expressions to parse things. This is necessary to pick up negative values of price. This is performed by the default_cleaner function.

Parameters

imgs (list) – list of numpy arrays representing images.
cleaner (function) – a function with the type signature String -> String that performs whatever cleaning is needed prior to splitting it by the regular expr.
extractor (function) – a function with a signature String -> (String, String) that extracts the item and the price from a given string. by default such a function is provided (see docs for extract_re)
expr (str) – str of the regexp to pass into the expr parameter into the extractor

Returns

pandas dataframe containing the extracted items and prices

Return type

pandas.DataFrame

readpyne.readpyne.shopscan(img, shops='(shop1|shop2|shop3)', header_frac=0.1, tail=False)¶

Takes in an image of a receipt, then based on a certain set of shop names and a header percentage will try to detect if the shopname is in the header.

Useful for detecting shop type and such.

Note

Setting header_frac to either 1 or 0 will just run the whole receipt

Parameters

img (numpy.ndarray) – image of the receipt
shops (str) – regex string similary to (shop1|shop2|shop3). In other words, the shop has to be captured in a group
header_frac (float) – a floating point number representing how much of the height to search
tail (bool) – if True the function will look at a given fraction from the bottom instead of from the top

Returns

the shop name that it found. if none are found, it returns ‘’

Return type

str

readpyne.model¶

Model training, prediction and data making functions

readpyne.model.make_training_data(input_folder, output_folder=None, interactive=True, context={'boxes': {'min_east_confidence': 0.4}, 'expand': {'pad': (0.2, 0.3)}})¶

Make training data from a folder of images.

Parameters

input_folder (str) – Folder where the images are stored
output_folder (str) – Folder where will the data will be saved. If not provided then data won’t be saved and will be just returned.
interactive (bool (default: True)) – If True will enable a interactive labelling mode where images shown one by one and the user is asked to label them 1 or 0 one by one. If output_folder is provided this will be respected.

Returns

if the data was interactively labelled, the labelled dataframe will be returned, else a boolean will be returned indicating successfull completion and saving of the training data.

Return type

bool or pd.DataFrame

readpyne.model.plot_scaling(df, plot, model_type=<class 'sklearn.neighbors.classification.KNeighborsClassifier'>, splits=100)¶

Plot the scaling of data given a sklearn classifier and a training dataset.

Parameters

df (pd.DataFrame) – pandas dataframe containing the training feature columns and labels. The last column should be the label column.
plot (bool) – if True this will plot the data before returning
model_type (sklearn classifier) – this has to be a classifier following the sklearn API (uses .fit and .predict)
splits (int) – number of splits to do of your data

Returns

numpy array containing the size of data, accuracy score and f1 score. This is the data that is used to plot the chart internally.

Return type

np.array

readpyne.model.status(name, x, y, model)¶

This function is responsible for reporting the quality of the model.

Parameters

name (str) – A string that will be the title of report
x (numpy.array) – A numpy array with training features.
y (numpy.array) – A numpy array with labels
model (sklearn model) – A model to be scored

Returns

Return type

None

readpyne.model.train_model(df, report=False, save_path=None, frac_test=0.25, sk_model=<class 'sklearn.neighbors.classification.KNeighborsClassifier'>, model_params={'n_neighbors': 2}, grid_params={'n_neighbors': [2, 3, 4, 5]}, grid_cv=5, scaling_plot=False)¶

Given a set of data an labels. Train a sklearn model.

Parameters

df (pd.DataFrame (or str)) –
the dataframe containing the training features and labels anything until the last column as features (df.iloc[:,:-1]) and the last column is the labels (df.iloc[:,-1])

if a string is provided, it will attempt to load the dataset from the .csv file
report (bool) – A boolean that tells you if you need the reporting procedure to run.
save_path (str) – A path to save the model to
frac_test (float) – A float indicating the amount of data to keep for testing.
sk_model (sklearn model object) – sklearn model that will be trained. An instance of it will be created and then trained.
model_params (dict) – A dict of parameters to be passed to the sklearn model.
grid_params (dict or None) – parameter dictionary for the GridSearchCV class in sklearn (see its documentation) for more information. If provided this enables gridsearch to be done for best performing hyperparameters
grid_cv (int) – specifies how many times crossvalidation is to be done in GridSearchCV (ignored if grid_params is set to None)
scaling_plot (bool) – if True will plot the scaling of the given model with increase in data for this to work, report has to be set to True

Returns

sklearn model – Trained sklearn model
tuple – A tuple containing the untouched test data (X_test, y_test)

readpyne.ocr¶

The OCR functionality of the package

readpyne.ocr.ocr_img_preprocess(img)¶

This is the default preprocessor that goes before the ocr Current preprocessing (note: might forget to update this)

Convert to grayscale
Threshold (img, 0, 255, Binary Thresh Invert | Thresh Otsu)
Median blur with size 3
Dilate, kernel size (1,1), iterations = 1
Bitwise not (restore normal colors)

Parameters: img (np.array) – A numpy array representing an image (3 channels)

readpyne.ocr.ocr_text(img, pyt_config='--psm 7 --oem 1', ret_preprocessed=True, preprocessor=<function ocr_img_preprocess>)¶

Take in an image and after some preprocessing shove it into tesseract. Push through tesseract.

Parameters

img (np.array) – A numpy array representing an image (3 channels)
pyt_config (str) – Config to pass to pytesseract.
ret_preprocessed (bool) – If True, this will return the image after preprocessing
preprocessor (function) –
function with a type signature of Image -> Image that will preprocess the image before OCR

by default the ocr_img_preprocess is used

Returns

A tuple containing the extracted text and the image (original or preprocessed)

Return type

tuple

readpyne.ocr.ocr_textM(imgs, jobs=8, **kwargs)¶

A mapping function across a list of images for ocr_text. Takes in a list of images and applies ocr_text.

Note

Does not return a map object. Uses a list comprehension so you get a list.

Parameters

imgs (list) – list of numpy arrays representing images.
jobs (int) – integer telling how many processes to spin up to do the extraction. if 0 none will be spun up and it will be done without any concurrency by default set to number of cores (docs might say otherwise)
**kwargs – Any other keyword arguments you want to pass to ocr_text such as custom configs for tesseract and perhaps diverging preprocessing in the future.

Returns

A list of tuples [(str, np.array)]

Return type

list

readpyne.utils¶

A small selection of utility functions

readpyne.utils.build_distance_matrix(iter1, iter2, function=<function levenshtein>)¶

Build a distance matrix given two lists of text and a function used to compare the entries

If iter1 is of length m and iter2 is of length n then the resulting matrix is m x n

Parameters

iter1 (list) – iterable of any kind but developed with lists
iter2 (list) – iterable of any kind but developed with lists
function (callable) – a callable (function) that takes in 2 positional arguments and returns a value. mapped across a cartesian product of iterables to return a matrix

Returns

Return type

numpy.ndarray

readpyne.utils.fuzzy_compare_iter(iterable1, iterable2, function=<function levenshtein>, threshold=0.8, fill_val='N/A')¶

Used to compare two lists (or iterables). Their comparison will be based on the values outputed for each pair by the function given.

The idea is to first create a distance matrix using the function as the comparator, then retrieve any pairs that pass a given value threshold. If a value from the first iterable doesn’t have anything in the second iterable that passed the threshold the fill value will be returned

The method: x = [str] of length n y = [str] of length m

compute an n x m matrix of levenshtein distances D between each string in x and each string in y.
for each row from i to n in the resulting matrix D, map across each row and find the maximum value
if the maximum value is not greater than threshold, value of a new variable z will be set fill_val, if the value is larger than threshold then the column index is returned.
finally any column indices generated in step (3) are turned into the corresponding values from y and the fill_val indicating missing rows will be propagated

Example

a = [ 1, 2, 3 ] b = [ 0, 0, 100 ] distance = lambda x, y: y - x

fuzzy_compare_iter(a, b, function=distance, threshold=0, fill_val=”MISSING”)

Out:: [“MISSING”, “MISSING”, 97]

Parameters

iterable1 (list | iterable) – something to use as a basis. the length of the return will be based on the length of this iterable
iterable2 (list | iterable) – essential similar to iterable1, but will be used more as lookup
function (callable) – callable function that takes 2 postional arguments which will provide a score as this is developed for text, it is set to the levenshtein distance
threshold (numeric) – value to compare the distance matrix by
fill_val – anything passed into this function will be the placeholder that will be passed if a value for iterable1 has nothing to close enough in iterable2

Returns

Return type

list

readpyne.utils.quality_metric(lines, gold)¶

Compare a list of text lines to the gold standard (ground truth) for a given receipt. This metric takes into account both the character level OCR quality and the recall of lines.

The key method here is that of the fuzzy_compare_iter function with a levenshtein (edit) distance and 2 lists of strings.

By using fuzzy_compare_iter and then filtering values with a placeholder we can then check how many rows were recalled ‘close enough’ (using the default) threshold value in fuzzy_compare_iter. TODO: Add parametrisation of this fn.

Once we have the lines we recalled, we then use levenstein distance to compare them and use that as our precision metric.

Finally the number of lines recalled ‘close enough’ divided by the number of lines in the gold standard is our recall approximation and the precision of is measured by how well the characters in each of the recalled lines match using levenshtein distance across the whole string.

To get the final metric we use a harmonic mean of precision and recall.

Parameters

lines (list[str]) – list of text strings
gold (list[str]) – list of text strings representing ground truth

Returns

an approximation of a makeshift F1-score

Return type

float

readpyne.core¶

the core functionality for readpyne.

readpyne.core.binit(l_arrays, n_features=100)¶

Takes in a list of 3 1d arrays of length n and firstly it bins it into a set number of features dictated by n_features. This produces an array of length n_features. Then it stacks the three arrays into a 1d array of length 3 * n_features

Note

This function does not check if the len of the input list of arrays is 3.

Parameters

l_arrays (list) – A list of arrays of length n. Usually this will be a vertically collapsed image that has been passed through cv2.split to split its channels.
n_features (int) – An integer telling the function how many features to produce per array. This dictates the shape of end array as the resulting array will have a length of 3 * n_features

Returns

A numpy array of length 3 * n_features

Return type

numpy.array

readpyne.core.blobify(img)¶

Take in an image and pass it through cv2.dnn.blobFromImage

Parameters: blob (cv2.blob) – An image that has been blobified by cv2. (see the cv2.dnn.blobFromImage function documentation)
Returns
Return type: np.array

readpyne.core.boxes(img, context)¶

Take in an image, resize it, predict boxes. Then perform expansion of the boxes to the width of the receipt and then perform non_max_supression.

Parameters

img (numpy.array) – A numpy array representation of an image.
context (dict) – parameter dictionary which contains default settings for various functions # TODO: Write better summary of how to use this

Returns

np.array – The original image
list – Predicted subsets for the image.

readpyne.core.decode(scores, geo, min_east_confidence)¶

This takes the geometries and confidence scores and produces bounding box values. The inputs to this function come from the EAST model with 2 layers.

Note

THIS FUNCTION IS UNTESTED

This function borrows heavily from: https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/

Parameters

scores (numpy.array) – numpy array of size (number of found boxes, ) indicating the assigned confidence scores for each box.
geo (numpy.array) – numpy array coming out from EAST that describes where the boxes are.
min_east_confidence (float) – confidence level at which to cut off the boxes

Returns

A numpy array of shape (n, 5) containing the confidences and box locations for the boxes that are of a certain confidence.

Return type

numpy.array

readpyne.core.expand(arr, shape, pad)¶

Function to expand an array of image coordinates to the full width of the image.

Parameters

arr (numpy.array) – A two dimensional array of coordinates which are in the form of startX, startY, endX, endY
shape (tuple) – A tuple containing the shape of the original receipt image. Easily accessible through the use of the .shape method on an image.

Returns

A numpy array of the same shape as the original array but with the x values in each row expanded to the width of the image.

Return type

numpy.array

readpyne.core.features(img, subsets)¶

Take an image and its subsets created from boxes and produce histogram based features for each subset.

Parameters

img (numpy.array) – numpy array representation of an image.
subsets (list) – list of numpy arrays of the subsets.

Returns

list – subsets originally passed into the function
list – list of 1d numpy arrays

readpyne.core.forward(blob, layers=['feature_fusion/Conv_7/Sigmoid', 'feature_fusion/concat_3'])¶

Take in a cv2 blob and pass it forward through an EAST model.

Note

The layers in the model by default are "feature_fusion/Conv_7/Sigmoid", "feature_fusion/concat_3"

Parameters

blob (cv2.blob) – image that has been blobified by cv2. (see the blobify function documentation)
layers (list[str]) – list of string indicating which layers of the NN to pass the image through

Returns

numpy.array – Scores array for each box found.
numpy.array – Bounding box locations from EAST

readpyne.core.get_subsets(img, boxes)¶

Take an image and box locations. Then cut out these boxes from the given image.

Parameters

img (numpy.array) – numpy array representation of an image.
boxes (iterable) – iterable (most likely a numpy.array) containing box coordinates.

Returns

list of subsets.

Return type

list

readpyne.core.hist(img)¶

Histogram creation function. This function takes an input image and then collapses it vertically by using np.mean. It does this for each channel of the image as it uses cv2.split to get each channel.

Parameters: img (numpy.array) – This is a numpy array representation of an input image. Expected shape for this array is (n,m,3) where n is the height and m is the width of the image.
Returns: A list of numpy arrays. Each array will be of length m where m is the width of the input image.
Return type: list

readpyne.core.process(img)¶

Function that performs preprocessing before histograms are created.

Parameters: img (numpy.array) – A numpy array of shape (n,m,3) containing an image. Usually this will be a subset of a larger receipt.
Returns: A numpy array representation of the input image after all the processing was applied.
Return type: numpy.array

readpyne.core.resize(img)¶

This function resizes an input image to the correct dimensions for the EAST model that is used for text detection.

The dimensions for EAST have to be divisable by 32. This function pads the bottom and the right side of the image by a as many pixels as it needs for this to happen.

Note

The image is padded with white as the color and this is then propogated to the rest of the pipeline under normal circumstances.

Parameters: img (numpy.array) – This is a numpy array representation of an input image. Expected shape for this array is (n,m,3) where n is the height and m is the width of the image.
Returns: img – The white padded image with dimensions now divisable by 32.
Return type: numpy.array

readpyne.core.stack(features)¶

Stack features. Basically take in a list containing tuples of subsets and histogram based features and stack them all up.

Parameters: features (list) – List of type [([subsets], [features]), ...]
Returns: tuple of type ([all_subsets],[all_features])
Return type: tuple

readpyne.io¶

All input output functions

readpyne.io.cutout_save(path, img, subsets)¶

Take an image and its subsets and export it to the given path.

Parameters

path (str) – A path to be used to save all subsets.
img (numpy array) – A numpy array representing the image.
subsets (list) – A list of numpy array of each subset.

Returns

Return type

None

readpyne.io.export_raw_ocr(ocr_results, filename='ocr_results.txt')¶

Export the result of the ocr line extraction provided by ocr.ocr_textM.

Parameters

ocr_results (list) – list of tuples containing a str of text and a numpy.array for the image.
filename (str) – a string path to the while which will be used to save the ocr text. (default: "ocr_results.txt")

Returns

Return type

None

readpyne.io.get_data(folder='data/training', jobs=8, **kwargs)¶

Load all images in a folder.

Parameters

folder (str) – string path to where the images are.
jobs (int) – number of threads to start for the loading default: this will be set to the number of cores although the documenation might state otherwise. will stop using threading if set to 0
**kwargs – anything passed into the keyword arguments will be passed onto the load_validate function

Returns

list of images

Return type

list

readpyne.io.interactive_labelling(subs, features, output_folder=None)¶

Function with an event loop to label the subsets of training images.

Note

This function is untested.

Parameters

subs (list[np.array]) – list of image representing found lines in a receipt
features (list[np.array]) – list representing the features for line classifier
output_folder (str) – path to the output folder (without the final /) where the labeled data is to be saved.

Returns

if successful will return pandas dataframe with features and final column as labels

Return type

pd.DataFrame

readpyne.io.load_model(path)¶

Load a model from a string path.

Parameters: path (str) – A path to the model
Returns
Return type: sklearn model

readpyne.io.load_validate(filename, multiple_receipts=False)¶

A thin wrapper on top of cv2.imread that makes sure the image is loaded and if its sideways.

Parameters

filename (str) – path to the file
multiple_receipts (bool) – specify whether the image is a single receipt or multiple receipts on one image. defaults to False.

Returns

numpy array of the image

Return type

np.array

readpyne.io.save_images(image_list, path='outputs/training', offset=0)¶

A list of images exports to a given folder.

Parameters

image_list (list) – A list of images represented as numpy.arrays
path (string) – A path to the folder.
offset (int) – this integer dictates by how many to shift the names of the resulting images. for example if offset is 2, the images will start being numbered from 2 onwards.

Returns

Return type

None

readpyne.io.save_stack(subs, features, folder)¶

Get subsets and features and export them.

Parameters

subs (list[np.array]) – list of numpy arrays representing subsets of the image.
features (list[np.array]) – list of features.

Returns

subs – Same as input
features – Same as input

readpyne.io.show(img)¶

Use the matplotlib pyplot function to show the image.

Note

This function is untested.

Parameters: img (numpy array) – A numpy array representing the image.
Returns
Return type: None

readpyne.io.show_ocr(ocr_results)¶

Show OCR results from the ocr_textM function.

Parameters: ocr_results (list) – list of tuples containing a str of text and a numpy.array for the image.
Returns
Return type: None

API Reference¶

readpyne¶

readpyne.model¶

readpyne.ocr¶

readpyne.utils¶

readpyne.core¶

readpyne.io¶

Table of Contents

Previous topic

Next topic

This Page