Usage¶
To use EHRcorral in a project:
import ehrcorral
EHRCorral operates on a collection of Records, each of which represents a single electronic health record. A collection of Records is called a Herd, hence the name EHRCorral: generating a master patient index of all the records is done by “corralling” the Herd.
There is a small number of actions to perform, but potentially several setting to consider:
- Create Records
- Create a Herd
- Populate the Herd with the Records
- Corral the Herd
Records¶
Ref: ehrcorral.ehrcorral.Record
A Record is a simplified representation of a patient’s EHR which only contains information relevant to the current matching algorithm. Each Record must contain a forename and a current surname, but it can also house other identifying information. All the information in a Record is used to discover other Records that describe the same individual.
ehr_entries = [
{
'forename': 'John',
'mid_forename': '',
'current_surname': 'Doe',
'suffix': 'Sr.',
'address1': '1 Denny Way',
'city': 'Orlando',
'state_province': 'FL'
'postal_code': 32801,
'sex': 'M',
'gender': 'M',
'national_id1': '123-45-678', # Using field as social security num
'id2': 'F1234578', # Optional ID, such as driver license num
'birth_year': '1985',
'birth_month': '08',
'birth_day': '04',
'blood_type': 'B+'
},
{
'first_name': 'Jane',
'middle_name': 'Erin',
'birth_surname': 'Doe',
'current_surname': 'Fonda',
'suffix': '',
'address1': '1 Bipinbop St',
'address2': 'Apt. 100',
'city': 'Austin',
'state_province': 'TX',
'postal_code': 73301,
'sex': 'F',
'gender': 'F',
'national_id1': '876-54-321',
'birth_year': 1976,
'birth_month': '08', # Numeric fields are coerced to proper type
'birth_day': 01,
'blood_type': 'A-',
}
]
records = [ehrcorral.gen_record(entry) for entry in ehr_entries]
Above, we create two Records (an entry for John and one for Jane) using the
function ehrcorral.ehrcorral.gen_record()
. Generally, you will not
need to interact directly with the Records once they are created.
In practicality, you won’t have just two EHR entries, but thousands or millions
of them, and there might be multiple entries for John or Jane and many other
individuals in the sub-population. The Record class is designed to be extremely
light on memory usage, much more so than a dictionary or list, for example. A
collection of 10 million Records will occupy about 5—6 GB, whereas 10 million
dictionaries containing the same data will occupy about three times the memory.
Therefore, when generating Records it is advisable not to build up a large
dictionary of data to then be sent one by one to
ehrcorral.ehrcorral.gen_record()
. Instead, generate the Records in a
loop that operates only on a single EHR entry at a time so the dictionaries like
the ones above are thrown away once the Record is created:
records = []
for entries in ehr:
# Extract forenames, sex, etc. from EHR data into dict called 'entry'
# ...
# entry = {'forename': 'John', ... , 'blood_type': 'B+'}
records.append(ehrcorral.gen_record(entry))
Record Fields¶
For the full list of fields available to generate a Record, see
ehrcorral.ehrcorral.Profile
.
If additional fields are passed to gen_record()
they are ignored.
Missing fields receive a value of empty string. No transformations are applied
to these fields other than to coerce strings to integers when the algorithm
requires integers. You should perform any pre-processing that you think is
relevant for your region or data set, such as removing accents or umlauts if you
do not want to match based on such special characters, defining forename and
mid forename if names in your region are particularly long, removing prefixes
like Mr. and Mrs., and determining what to use for the national ID field.
Creating a Herd¶
Ref: ehrcorral.ehrcorral.Herd.populate()
Once the Records have been created, you can populate a Herd. A list or tuple of Records can be used.
herd = ehrcorral.Herd()
herd.populate(records)
In order to prevent race conditions during matching, the population of a Herd
cannot be updated once it is set. Calling populate()
again with additional
records will raise an error.
Matching Records¶
To performing record-linkage on the Herd, you call its corral()
method. This
method requires as input a function which performs phonemic name compression,
for Record blocking purposes. For convenience, Soundex, NYSIIS, metaphone, and
double metaphone implementations have been included. Below, double metaphone is
used. If you are not yet familiar with blocking methods, please consult
Record Blocking in the documentation.
from ehrcorral.compressions import dmetaphone
# Alternate blocking compressions:
# from ehrcorral.compressions import soundex
# from ehrcorral.compressions import nysiis
# from ehrcorral.compressions import metaphone
# from ehrcorral.compressions import first_letter
herd.corral(blocking_compression=dmetaphone)
similarities = herd.similarity_matrix
See ehrcorral.ehrcorral.Herd.corral()
for documentation of additional
function parameters.
Running corral()
on the Herd generates a similarity (i.e. probability)
matrix with dimension N _x_ N, where N is the number of records in the Herd.
This matrix provides the probabilities that each record belongs to the same
person as contained in every other record in the Herd. Each row and column
index in the similarity matrix corresponds to each Record’s record_number
property (see documentation for Record class). The user can decide how to link
records using a threshold value to determine which records belong to the same
individual. Currently there is no built-in method to automatically merge
records together since there are many different strategies for merging that
the user might want to employ. Additionally, it is likely that the user would
want to merge the original data that was used to generate each Record rather
than merging the Records themselves.