API Reference¶
readpyne¶
The key functions and tools that allow the end user to use the library with no fuss
-
readpyne.readpyne.
extract
(image, classifier, context=default_context, output_folder=None, return_negatives=False, override_prediction=False)¶ A function utilised the core of the package to extract required lines and by default classifies the required and non-required lines.
Note
Needs refactoring.
- Parameters
image (np.array or str) – image as loaded by
cv2.imread
or string path to the image on diskclassifier (sklearn model or str) – sklearn model for classification or a string to a pickled model loads the last trained model. Current default model is loaded if nothing else is provided.
context (dict) – parameter dictionary which contains default settings for various functions # TODO: Write better summary of how to use this
output_folder (str) – if provided will save the predicted lines
override_prediction (bool) – if
True
then it overwrites any filtering done by the model and turns this into a regular pipeline of getting just the subsetsexpand (dict) – experimental feature. This will eventually accept a dictionary of parameters which will be trickled down into the core making testing easier. at the moment we can only change the vertical padding of the system
- Returns
a list of cutout lines in numpy array form if
return_negatives
is disabled, else a tuple containing both positive predictions and negatives (1s and 0s)- Return type
list | tuple
-
readpyne.readpyne.
extract_from_folder
(folder, **kwargs)¶ A folder version of the
extract
function.- Parameters
folder (str) – path the the folder from which to extract the images from
**kwrags – any other key word arguments to pass to the extract function. see
readpyne.model.extract
function documentation for more on which arguments are accepted
- Returns
list containing the results of applying the
extract
function on each image- Return type
list
-
readpyne.readpyne.
extras_pipe
(imgs, expr={'date': '(\\d{1, 2}\\/\\d{1, 2}\\/\\d{1, 2} \\d{1, 2}:\\d{1, 2})', 'shop': '(shop1|shop2)'}, unique=True)¶ Pipeline function to extract the extra bits we need.
In essence provide it with all the non-item lines and a set of regular expressions. It will then go through each expression for each image and see what it can extract.
Note
Under active development, might not be nice to use in this current API format, in which case this will be changed. This should be done by the NLP side to be fair, since its just regular expressions
- Parameters
imgs (list) – list of numpy arrays representing images.
expr (dict[str:str]) – dictionary with a name for an extra item and its corresponding regular expression
- Returns
dictionary with signature
str:[str]
. each item in the dictionary is going to be a list of the same length as imgs with either found outputs from the regular expression or just an empty string. each key will be the same key as the ones withinexpr
- Return type
dict
-
readpyne.readpyne.
item_pipe
(imgs, cleaner=<function default_cleaner>, extractor=<function extract_re>, expr='(.+?)(-?\\d+.\\d+$)')¶ This function abstracts a lot of work done by by
ocr_textM
and other function into one longer function that just takes in images of lines and exports a pandas dataframe.The opinionated parts of this pipeline are the regular expression (
expr
) that splits the string into item and price and thedefault_cleaner
(see docs) function that is passed to thecleaner
parameter by default. Both of which you can swap out.Note
Currently the OCR picks up minus signs as
~
. This is dealt with by substituting all occurances of~
with-
before using regular expressions to parse things. This is necessary to pick up negative values of price. This is performed by thedefault_cleaner
function.- Parameters
imgs (list) – list of numpy arrays representing images.
cleaner (function) – a function with the type signature
String -> String
that performs whatever cleaning is needed prior to splitting it by the regular expr.extractor (function) – a function with a signature
String -> (String, String)
that extracts the item and the price from a given string. by default such a function is provided (see docs forextract_re
)expr (str) – str of the regexp to pass into the
expr
parameter into the extractor
- Returns
pandas dataframe containing the extracted items and prices
- Return type
pandas.DataFrame
-
readpyne.readpyne.
shopscan
(img, shops='(shop1|shop2|shop3)', header_frac=0.1, tail=False)¶ Takes in an image of a receipt, then based on a certain set of shop names and a header percentage will try to detect if the shopname is in the header.
Useful for detecting shop type and such.
Note
Setting
header_frac
to either 1 or 0 will just run the whole receipt- Parameters
img (numpy.ndarray) – image of the receipt
shops (str) – regex string similary to
(shop1|shop2|shop3)
. In other words, the shop has to be captured in a groupheader_frac (float) – a floating point number representing how much of the height to search
tail (bool) – if
True
the function will look at a given fraction from the bottom instead of from the top
- Returns
the shop name that it found. if none are found, it returns ‘’
- Return type
str
readpyne.model¶
Model training, prediction and data making functions
-
readpyne.model.
make_training_data
(input_folder, output_folder=None, interactive=True, context={'boxes': {'min_east_confidence': 0.4}, 'expand': {'pad': (0.2, 0.3)}})¶ Make training data from a folder of images.
- Parameters
input_folder (str) – Folder where the images are stored
output_folder (str) – Folder where will the data will be saved. If not provided then data won’t be saved and will be just returned.
interactive (bool (default: True)) – If
True
will enable a interactive labelling mode where images shown one by one and the user is asked to label them 1 or 0 one by one. Ifoutput_folder
is provided this will be respected.
- Returns
if the data was interactively labelled, the labelled dataframe will be returned, else a boolean will be returned indicating successfull completion and saving of the training data.
- Return type
bool or pd.DataFrame
-
readpyne.model.
plot_scaling
(df, plot, model_type=<class 'sklearn.neighbors.classification.KNeighborsClassifier'>, splits=100)¶ Plot the scaling of data given a sklearn classifier and a training dataset.
- Parameters
df (pd.DataFrame) – pandas dataframe containing the training feature columns and labels. The last column should be the label column.
plot (bool) – if True this will plot the data before returning
model_type (sklearn classifier) – this has to be a classifier following the sklearn API (uses
.fit
and.predict
)splits (int) – number of splits to do of your data
- Returns
numpy array containing the
size of data
,accuracy
score andf1
score. This is the data that is used to plot the chart internally.- Return type
np.array
-
readpyne.model.
status
(name, x, y, model)¶ This function is responsible for reporting the quality of the model.
- Parameters
name (str) – A string that will be the title of report
x (numpy.array) – A numpy array with training features.
y (numpy.array) – A numpy array with labels
model (sklearn model) – A model to be scored
- Returns
- Return type
None
-
readpyne.model.
train_model
(df, report=False, save_path=None, frac_test=0.25, sk_model=<class 'sklearn.neighbors.classification.KNeighborsClassifier'>, model_params={'n_neighbors': 2}, grid_params={'n_neighbors': [2, 3, 4, 5]}, grid_cv=5, scaling_plot=False)¶ Given a set of data an labels. Train a sklearn model.
- Parameters
df (pd.DataFrame (or str)) –
the dataframe containing the training features and labels anything until the last column as features (
df.iloc[:,:-1]
) and the last column is the labels (df.iloc[:,-1]
)if a string is provided, it will attempt to load the dataset from the
.csv
filereport (bool) – A boolean that tells you if you need the reporting procedure to run.
save_path (str) – A path to save the model to
frac_test (float) – A float indicating the amount of data to keep for testing.
sk_model (sklearn model object) – sklearn model that will be trained. An instance of it will be created and then trained.
model_params (dict) – A dict of parameters to be passed to the sklearn model.
grid_params (dict or None) – parameter dictionary for the GridSearchCV class in sklearn (see its documentation) for more information. If provided this enables gridsearch to be done for best performing hyperparameters
grid_cv (int) – specifies how many times crossvalidation is to be done in GridSearchCV (ignored if grid_params is set to None)
scaling_plot (bool) – if
True
will plot the scaling of the given model with increase in data for this to work,report
has to be set to True
- Returns
sklearn model – Trained sklearn model
tuple – A tuple containing the untouched test data
(X_test, y_test)
readpyne.ocr¶
The OCR functionality of the package
-
readpyne.ocr.
ocr_img_preprocess
(img)¶ This is the default preprocessor that goes before the ocr Current preprocessing (note: might forget to update this)
Convert to grayscale
Threshold (img, 0, 255, Binary Thresh Invert | Thresh Otsu)
Median blur with size 3
Dilate, kernel size (1,1), iterations = 1
Bitwise not (restore normal colors)
- Parameters
img (np.array) – A numpy array representing an image (3 channels)
-
readpyne.ocr.
ocr_text
(img, pyt_config='--psm 7 --oem 1', ret_preprocessed=True, preprocessor=<function ocr_img_preprocess>)¶ Take in an image and after some preprocessing shove it into tesseract. Push through tesseract.
- Parameters
img (np.array) – A numpy array representing an image (3 channels)
pyt_config (str) – Config to pass to pytesseract.
ret_preprocessed (bool) – If True, this will return the image after preprocessing
preprocessor (function) –
function with a type signature of
Image -> Image
that will preprocess the image beforeOCR
by default the
ocr_img_preprocess
is used
- Returns
A tuple containing the extracted text and the image (original or preprocessed)
- Return type
tuple
-
readpyne.ocr.
ocr_textM
(imgs, jobs=8, **kwargs)¶ A mapping function across a list of images for
ocr_text
. Takes in a list of images and applies ocr_text.Note
Does not return a
map
object. Uses a list comprehension so you get alist
.- Parameters
imgs (list) – list of numpy arrays representing images.
jobs (int) – integer telling how many processes to spin up to do the extraction. if
0
none will be spun up and it will be done without any concurrency by default set to number of cores (docs might say otherwise)**kwargs – Any other keyword arguments you want to pass to
ocr_text
such as custom configs for tesseract and perhaps diverging preprocessing in the future.
- Returns
A list of tuples
[(str, np.array)]
- Return type
list
readpyne.utils¶
A small selection of utility functions
-
readpyne.utils.
build_distance_matrix
(iter1, iter2, function=<function levenshtein>)¶ Build a distance matrix given two lists of text and a function used to compare the entries
If iter1 is of length
m
and iter2 is of lengthn
then the resulting matrix ism x n
- Parameters
iter1 (list) – iterable of any kind but developed with lists
iter2 (list) – iterable of any kind but developed with lists
function (callable) – a callable (function) that takes in 2 positional arguments and returns a value. mapped across a cartesian product of iterables to return a matrix
- Returns
- Return type
numpy.ndarray
-
readpyne.utils.
fuzzy_compare_iter
(iterable1, iterable2, function=<function levenshtein>, threshold=0.8, fill_val='N/A')¶ Used to compare two lists (or iterables). Their comparison will be based on the values outputed for each pair by the
function
given.The idea is to first create a distance matrix using the function as the comparator, then retrieve any pairs that pass a given value threshold. If a value from the first iterable doesn’t have anything in the second iterable that passed the threshold the fill value will be returned
The method: x = [str] of length n y = [str] of length m
compute an n x m matrix of levenshtein distances
D
between each string inx
and each string iny
.for each row from
i
ton
in the resulting matrix D, map across each row and find the maximum valueif the maximum value is not greater than
threshold
, value of a new variablez
will be setfill_val
, if the value is larger thanthreshold
then the column index is returned.finally any column indices generated in step (3) are turned into the corresponding values from
y
and thefill_val
indicating missing rows will be propagated
Example
a = [ 1, 2, 3 ] b = [ 0, 0, 100 ] distance = lambda x, y: y - x
fuzzy_compare_iter(a, b, function=distance, threshold=0, fill_val=”MISSING”)
- Out:
[“MISSING”, “MISSING”, 97]
- Parameters
iterable1 (list | iterable) – something to use as a basis. the length of the return will be based on the length of this iterable
iterable2 (list | iterable) – essential similar to iterable1, but will be used more as lookup
function (callable) – callable function that takes 2 postional arguments which will provide a score as this is developed for text, it is set to the levenshtein distance
threshold (numeric) – value to compare the distance matrix by
fill_val – anything passed into this function will be the placeholder that will be passed if a value for iterable1 has nothing to close enough in iterable2
- Returns
- Return type
list
-
readpyne.utils.
quality_metric
(lines, gold)¶ Compare a list of text lines to the
gold standard
(ground truth) for a given receipt. This metric takes into account both the character level OCR quality and the recall of lines.The key method here is that of the
fuzzy_compare_iter
function with a levenshtein (edit) distance and 2 lists of strings.By using
fuzzy_compare_iter
and then filtering values with a placeholder we can then check how many rows were recalled ‘close enough’ (using the default) threshold value infuzzy_compare_iter
. TODO: Add parametrisation of this fn.Once we have the lines we recalled, we then use levenstein distance to compare them and use that as our precision metric.
Finally the number of lines recalled ‘close enough’ divided by the number of lines in the gold standard is our
recall
approximation and the precision of is measured by how well the characters in each of the recalled lines match using levenshtein distance across the whole string.To get the final metric we use a harmonic mean of precision and recall.
- Parameters
lines (list[str]) – list of text strings
gold (list[str]) – list of text strings representing ground truth
- Returns
an approximation of a makeshift F1-score
- Return type
float
readpyne.core¶
the core functionality for readpyne.
-
readpyne.core.
binit
(l_arrays, n_features=100)¶ Takes in a list of 3 1d arrays of length
n
and firstly it bins it into a set number of features dictated byn_features
. This produces an array of lengthn_features
. Then it stacks the three arrays into a 1d array of length3 * n_features
Note
This function does not check if the len of the input list of arrays is 3.
- Parameters
l_arrays (list) – A list of arrays of length
n
. Usually this will be a vertically collapsed image that has been passed throughcv2.split
to split its channels.n_features (int) – An integer telling the function how many features to produce per array. This dictates the shape of end array as the resulting array will have a length of
3 * n_features
- Returns
A numpy array of length
3 * n_features
- Return type
numpy.array
-
readpyne.core.
blobify
(img)¶ Take in an image and pass it through
cv2.dnn.blobFromImage
- Parameters
blob (cv2.blob) – An image that has been
blobified
by cv2. (see the cv2.dnn.blobFromImage function documentation)- Returns
- Return type
np.array
-
readpyne.core.
boxes
(img, context)¶ Take in an image, resize it, predict boxes. Then perform expansion of the boxes to the width of the receipt and then perform
non_max_supression
.- Parameters
img (numpy.array) – A numpy array representation of an image.
context (dict) – parameter dictionary which contains default settings for various functions # TODO: Write better summary of how to use this
- Returns
np.array – The original image
list – Predicted subsets for the image.
-
readpyne.core.
decode
(scores, geo, min_east_confidence)¶ This takes the geometries and confidence scores and produces bounding box values. The inputs to this function come from the EAST model with 2 layers.
Note
THIS FUNCTION IS UNTESTED
This function borrows heavily from: https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/
- Parameters
scores (numpy.array) – numpy array of size
(number of found boxes, )
indicating the assigned confidence scores for each box.geo (numpy.array) – numpy array coming out from EAST that describes where the boxes are.
min_east_confidence (float) – confidence level at which to cut off the boxes
- Returns
A numpy array of shape
(n, 5)
containing the confidences and box locations for the boxes that are of a certain confidence.- Return type
numpy.array
-
readpyne.core.
expand
(arr, shape, pad)¶ Function to expand an array of image coordinates to the full width of the image.
- Parameters
arr (numpy.array) – A two dimensional array of coordinates which are in the form of
startX, startY, endX, endY
shape (tuple) – A tuple containing the shape of the original receipt image. Easily accessible through the use of the
.shape
method on an image.
- Returns
A numpy array of the same shape as the original array but with the
x
values in each row expanded to the width of the image.- Return type
numpy.array
-
readpyne.core.
features
(img, subsets)¶ Take an image and its subsets created from
boxes
and produce histogram based features for each subset.- Parameters
img (numpy.array) – numpy array representation of an image.
subsets (list) – list of numpy arrays of the subsets.
- Returns
list – subsets originally passed into the function
list – list of 1d numpy arrays
-
readpyne.core.
forward
(blob, layers=['feature_fusion/Conv_7/Sigmoid', 'feature_fusion/concat_3'])¶ Take in a
cv2 blob
and pass it forward through anEAST
model.Note
The layers in the model by default are
"feature_fusion/Conv_7/Sigmoid", "feature_fusion/concat_3"
- Parameters
blob (cv2.blob) – image that has been
blobified
by cv2. (see the blobify function documentation)layers (list[str]) – list of string indicating which layers of the NN to pass the image through
- Returns
numpy.array – Scores array for each box found.
numpy.array – Bounding box locations from EAST
-
readpyne.core.
get_subsets
(img, boxes)¶ Take an image and box locations. Then cut out these boxes from the given image.
- Parameters
img (numpy.array) – numpy array representation of an image.
boxes (iterable) – iterable (most likely a numpy.array) containing box coordinates.
- Returns
list of subsets.
- Return type
list
-
readpyne.core.
hist
(img)¶ Histogram creation function. This function takes an input image and then collapses it vertically by using
np.mean
. It does this for each channel of the image as it usescv2.split
to get each channel.- Parameters
img (numpy.array) – This is a numpy array representation of an input image. Expected shape for this array is
(n,m,3)
wheren
is the height andm
is the width of the image.- Returns
A list of numpy arrays. Each array will be of length
m
wherem
is the width of the input image.- Return type
list
-
readpyne.core.
process
(img)¶ Function that performs preprocessing before histograms are created.
- Parameters
img (numpy.array) – A numpy array of shape (n,m,3) containing an image. Usually this will be a subset of a larger receipt.
- Returns
A numpy array representation of the input image after all the processing was applied.
- Return type
numpy.array
-
readpyne.core.
resize
(img)¶ This function resizes an input image to the correct dimensions for the
EAST
model that is used for text detection.The dimensions for
EAST
have to be divisable by 32. This function pads the bottom and the right side of the image by a as many pixels as it needs for this to happen.Note
The image is padded with white as the color and this is then propogated to the rest of the pipeline under normal circumstances.
- Parameters
img (numpy.array) – This is a numpy array representation of an input image. Expected shape for this array is
(n,m,3)
wheren
is the height andm
is the width of the image.- Returns
img – The white padded image with dimensions now divisable by 32.
- Return type
numpy.array
-
readpyne.core.
stack
(features)¶ Stack features. Basically take in a list containing tuples of subsets and histogram based features and stack them all up.
- Parameters
features (list) – List of type
[([subsets], [features]), ...]
- Returns
tuple of type
([all_subsets],[all_features])
- Return type
tuple
readpyne.io¶
All input output functions
-
readpyne.io.
cutout_save
(path, img, subsets)¶ Take an image and its subsets and export it to the given path.
- Parameters
path (str) – A path to be used to save all subsets.
img (numpy array) – A numpy array representing the image.
subsets (list) – A list of numpy array of each subset.
- Returns
- Return type
None
-
readpyne.io.
export_raw_ocr
(ocr_results, filename='ocr_results.txt')¶ Export the result of the ocr line extraction provided by
ocr.ocr_textM
.- Parameters
ocr_results (list) – list of tuples containing a
str
of text and anumpy.array
for the image.filename (str) – a string path to the while which will be used to save the ocr text. (default:
"ocr_results.txt"
)
- Returns
- Return type
None
-
readpyne.io.
get_data
(folder='data/training', jobs=8, **kwargs)¶ Load all images in a folder.
- Parameters
folder (str) – string path to where the images are.
jobs (int) – number of threads to start for the loading default: this will be set to the number of cores although the documenation might state otherwise. will stop using threading if set to 0
**kwargs – anything passed into the keyword arguments will be passed onto the
load_validate
function
- Returns
list of images
- Return type
list
-
readpyne.io.
interactive_labelling
(subs, features, output_folder=None)¶ Function with an event loop to label the subsets of training images.
Note
This function is untested.
- Parameters
subs (list[np.array]) – list of image representing found lines in a receipt
features (list[np.array]) – list representing the features for line classifier
output_folder (str) – path to the output folder (without the final /) where the labeled data is to be saved.
- Returns
if successful will return pandas dataframe with features and final column as labels
- Return type
pd.DataFrame
-
readpyne.io.
load_model
(path)¶ Load a model from a string path.
- Parameters
path (str) – A path to the model
- Returns
- Return type
sklearn model
-
readpyne.io.
load_validate
(filename, multiple_receipts=False)¶ A thin wrapper on top of
cv2.imread
that makes sure the image is loaded and if its sideways.- Parameters
filename (str) – path to the file
multiple_receipts (bool) – specify whether the image is a single receipt or multiple receipts on one image. defaults to False.
- Returns
numpy array of the image
- Return type
np.array
-
readpyne.io.
save_images
(image_list, path='outputs/training', offset=0)¶ A list of images exports to a given folder.
- Parameters
image_list (list) – A list of images represented as numpy.arrays
path (string) – A path to the folder.
offset (int) – this integer dictates by how many to shift the names of the resulting images. for example if
offset
is 2, the images will start being numbered from 2 onwards.
- Returns
- Return type
None
-
readpyne.io.
save_stack
(subs, features, folder)¶ Get subsets and features and export them.
- Parameters
subs (list[np.array]) – list of numpy arrays representing subsets of the image.
features (list[np.array]) – list of features.
- Returns
subs – Same as input
features – Same as input
-
readpyne.io.
show
(img)¶ Use the matplotlib pyplot function to show the image.
Note
This function is untested.
- Parameters
img (numpy array) – A numpy array representing the image.
- Returns
- Return type
None
-
readpyne.io.
show_ocr
(ocr_results)¶ Show
OCR
results from theocr_textM
function.- Parameters
ocr_results (list) – list of tuples containing a
str
of text and anumpy.array
for the image.- Returns
- Return type
None