Modules

ehrcorral.ehrcorral

Herd

class ehrcorral.ehrcorral.Herd[source]

A collection of Record with methods for interacting with and linking records in the herd.

similarity_matrix

numpy.ndarray, None

A numpy array containing the similarities between Record instances, ordered by accession number on both axes. Each entry is between 0 and 1 with 1 being perfect similarity.

append_block_dict(record)[source]

Appends the herd’s block dictionary with the given Record’s blocking codes.

The dictionary keys are block codes. The value of each key is a list of references to Records that have that block.

Parameters:record (Record) – An object of class Record
append_names_freq_counters(record)[source]

Adds the forename and surname for the given Record to the forename and surname counters.

Parameters:record (Record) – An object of class Record
corral(forename_freq_method=<function first_letter>, surname_freq_method=<function doublemetaphone>, blocking_compression=<function doublemetaphone>)[source]

Perform record matching on all Records in the Herd.

Parameters:
  • forename_freq_method (func) – A function that performs some sort of compression. Compression of forename can be different than compression of surname. The compression information is used to determine weights for certain matching scenarios. For example, if forename is compressed to be just the first initial, matching a name that begins with the letter ‘F’ will result in a weight equal to the fraction of names that begin with the letter ‘F’ in the entire Herd. The less common names that begin with ‘F’ are, the more significant a match between two same or similar forenames that begin with ‘F’ will be. Defaults to the first initial of the forename.
  • surname_freq_method (func) – A function that performs some sort of compression. Defaults to double metaphone.
  • blocking_compression (func) – Compression method to use when blocking. Blocks are created by compressing the surname and then appending the first initial of the forename. Defaults to double metaphone and then uses the primary compression from that compression. By default the first initial of the forenames are appended to the surname compressions to generate block codes.
populate(records)[source]

Sets the Herd’s sub-population.

Parameters:records (list, tuple) – A list or tuple containing multiple Record
size

Returns the size of the Herd’s population.

Record

class ehrcorral.ehrcorral.Record[source]

A Record contains identifying information about a patient, as well as generated phonemic and meta information.

gen_blocks(compression)[source]

Generate and set the blocking codes for a given record.

Blocking codes are comprised of the phonemic compressions of the profile surnames combined with the first letter of each forename. Generated blocking codes are stored in self._blocks, and only contain the unique set of blocking codes.

Parameters:compression (func) – A function that performs phonemic compression.
save_name_freq_refs(record_number, forename_freq_method, surname_freq_method)[source]

Compress the forenames and surnames and save the compressions to the Record.

Parameters:
  • record_number (int) – An integer to be assigned as initial person and accession number.
  • forename_freq_method (func) – A function that performs some sort of compression on a single name.
  • surname_freq_method (func) – A function that performs some sort of
  • on a single name. (compression) –

Profile

class ehrcorral.ehrcorral.Profile[source]

A selection of patient-identifying information from a single electronic health record.

All fields should be populated with an int or string and will be coerced to the proper type for that field automatically.

forename

Also known as first name.

mid_forename

Also known as middle name.

birth_surname

Last name at birth, often same as mother’s maiden name.

current_surname

Current last name. Can differ from birth surname often in the case of marriage for females.

suffix

Sr., Junior, II, etc.

address1

Street address, such as “100 Main Street”.

address2

Apartment or unit information, such as “Apt. 201”.

state_province

State or province.

postal_code
country

Consistent formatting should be used. Do not use USA in one Record and United States of America in another.

sex

Physiological sex (M or F)

gender

The gender the patient identifies with (M or F), e.g. in the case of transexualism.

national_id1

For example, social security number. This should be the same type of number for all patients. Do not use USA social security in one Record and with Mexico passport number in another.

id2

Can be used as an additional identifying ID number, such as driver’s license number. Again, define the type of ID number this is for the entire sub-population.

mrn

Medical record number.

birth_year

In the format YYYY.

birth_month

In the format MM.

birth_day

In the format DD.

blood_type

One of A, B, AB, or O with an optional +/- denoting RhD status.

gen_record()

ehrcorral.ehrcorral.gen_record(data)[source]

Generate a Record which can be used to populate a Herd.

In addition to extracting the profile information for

Parameters:data (dict) – A dictionary containing at least one of fields in PROFILE_FIELDS.
Returns:py:class:.Record.
Return type:A object of class

compress()

ehrcorral.ehrcorral.compress(names, method)[source]

Compresses surnames using different phonemic algorithms.

Parameters:
  • names (list) – A list of names, typically surnames
  • method (func) – A function that performs phonemic compression
Returns:

A list of the compressions.

ehrcorral.measures

record_similarity()

ehrcorral.measures.record_similarity(herd, first_record, second_record, forename_method=<function damerau_levenshtein>, surname_method=<function damerau_levenshtein>)[source]

Determine weights for the likelihood of two records being the same.

Parameters:
  • herd (Herd) – An object of Herd which contains the two records being compared.
  • first_record (Record) – An object of Record to be compared to the other one.
  • second_record (Record) – An object of Record to be compared to the other one.
  • forename_method (func) – A function that performs some sort of comparison between strings.
  • surname_method (func) – A function that performs some sort of comparison between strings.
Returns:

A tuple of the sum of name weights and the sum of non-name weights.

get_forename_similarity()

ehrcorral.measures.get_forename_similarity(herd, records, method, name_type)[source]

Determine weights for the likelihood of two forenames being the same.

Parameters:
  • herd (Herd) – An object of Herd which contains the two records being compared.
  • records (List[Record]) – A list of two objects of Record to be compared to one another.
  • method (func) – A function to be used to compare the forenames.
  • name_type (unicode) – A unicode string to indicate which forename is being compared.
Returns:

The forename weight for the similarity of the forenames.

extract_forename_similarity_info()

ehrcorral.measures.extract_forename_similarity_info(herd, record, name_type)[source]

Extract desired forename and associated frequency weight.

Parameters:
  • herd (Herd) – An object of Herd which contains the frequency dictionary used for the frequency weight.
  • record (Record) – An object of Record from which to extract the forename.
  • name_type (unicode) – A unicode string to indicate which forename is being extracted.
Returns:

The forename and associated frequency weight for requested name.

get_surname_similarity()

ehrcorral.measures.get_surname_similarity(herd, records, method, name_type)[source]

Determine weights for the likelihood of two surnames being the same.

Parameters:
  • herd (Herd) – An object of Herd which contains the two records being compared.
  • records (List[Record]) – A list of two objects of Record to be compared to one another.
  • method (func) – A function to be used to compare the surnames.
  • name_type (unicode) – A unicode string to indicate which surname is being compared.
Returns:

The surname weight for the similarity of the surnames.

extract_surname_similarity_info()

ehrcorral.measures.extract_surname_similarity_info(herd, record, name_type)[source]

Extract desired surname and associated frequency weight.

Parameters:
  • herd (Herd) – An object of Herd which contains the frequency dictionary used for the frequency weight.
  • record (Record) – An object of Record from which to extract the surname.
  • name_type (unicode) – A unicode string to indicate which surname is being extracted.
Returns:

The forename and associated frequency weight for requested name.

get_address_similarity()

ehrcorral.measures.get_address_similarity(records, method=<function damerau_levenshtein>)[source]

Determine weights for the likelihood of two addresses being the same.

Parameters:
  • records (List[Record]) – A list of two objects of Record to be compared to one another.
  • method (func) – A function to be used to compare the addresses.
Returns:

The address weight for the similarity of the addresses.

clean_address()

ehrcorral.measures.clean_address(address)[source]

Clean unicode string that contains an address of all punctuation and standardize all street suffixes and unit designators.

Parameters:address (unicode) – A unicode string that contains an address to be cleaned and standardized.
Returns:The cleaned unicode address string.

get_post_code_similarity()

ehrcorral.measures.get_post_code_similarity(records, method=<function damerau_levenshtein>)[source]

Determine weights for the likelihood of two postal codes being the same.

Parameters:
  • records (List[Record]) – A list of two objects of Record to be compared to one another.
  • method (func) – A function to be used to compare the postal codes.
Returns:

The postal code weight for the similarity of the postal codes.

get_sex_similarity()

ehrcorral.measures.get_sex_similarity(records)[source]

Determine weights for the likelihood of two sexes being the same.

Parameters:records (List[Record]) – A list of two objects of Record to be compared to one another.
Returns:The sex weight for the similarity of the sexes.

get_dob_similarity()

ehrcorral.measures.get_dob_similarity(records, method=<function damerau_levenshtein>)[source]

Determine weights for the likelihood of two dates of birth being the same.

Parameters:
  • records (List[Record]) – A list of two objects of Record to be compared to one another.
  • method (func) – A function to be used to compare the dates of birth.
Returns:

The date of birth weight for the similarity of the dates of birth.

get_id_similarity()

ehrcorral.measures.get_id_similarity(records, method=<function damerau_levenshtein>)[source]

Determine weights for the likelihood of two national IDs being the same.

Parameters:
  • records (List[Record]) – A list of two objects of Record to be compared to one another.
  • method (func) – A function to be used to compare the national IDs.
Returns:

The national ID weight for the similarity of the two national IDs.