Develop and apply discovery techniques to physical data sets to develop an understanding of the content, quality, and rules of a specified set of data under management.
Data profiling is a cornerstone of an effective data quality improvement effort and is an important first step for many information technology initiatives. It is a discovery task conducted through automated (tool supported or custom queries) and/or manual analysis of physical records. For a selected data set, it reveals what is stored in databases and how physical values may differ from expected, allowed, or required values listed in data store documentation or described in metadata repositories. Definite errors are referred to as “defects.” Suspected errors are referred to as “anomalies.”
Data profiling differs from Data Quality Assessment in that profiling activities result in a series of conclusions about physical data sets, whereas the assessment process evaluates how well the data meets specific business quality requirements. Data profiling is typically the first step in conducting data quality assessments.
There are several levels of tests a data profiler can apply to a data set. At the most basic level, vendor data quality tools contain out-of-the-box tests that examine nulls, lengths, ranges, values, and formats. As a hypothetical example, if a profiling effort were conducted on 1,000 patient demographic records, the basic test (knows as “syntax” checks) results might yield defects and anomalies similar to these: 13 birth dates were entered in a MMDDYY format instead of the intended MMDDYYYY format; 27 street addresses contained less than 7 characters; 34 records did not contain a phone number; 15 records did not contain a ZIP code; 193 records did not contain a middle name; 3 records contained a patient gender value of ‘O’.
Another level of profiling includes tests on values that affect each other within a given record according to business rules (aka, “semantic” checks). For example, checking if a record with a patient status of “Deceased” also contains a date value in the date of death column. More complex exploration can also be applied between different records (aka, “context” tests). For example, determining that two patients are co-located (aka, “householding”) based on the occurrence of identical addresses, or identifying duplicate patient records. This level of profiling typically requires custom queries.
Profiling results are a primary input to resolution of anomalies, typically through manual analysis and data cleansing activities. In the example above of unusually short street addresses, a manual review may find that the values are valid, e.g., Zip ST or Fen ST. In the gender value example, the reviewer may need to research the gender of the three patients. In the case of missing ZIP codes or phone numbers, the reviewer might refer the lookup to billing. Profiling results also feed into data cleansing to correct errors. For example, standardization of the MMDDYYYY format can be implemented through an automated script.
An organization that implements effective profiling practices can realize the following benefits:
It is recommended that an organization which hosts at least one data store containing patient demographic data first conduct a comprehensive baseline profiling effort. This will clearly reveal the current state of the data, allow formulation of meaningful metrics, and become the foundation for tracking improvements going forward.
Once the data has undergone baseline cleansing and anomaly resolution, the organization should determine the frequency at which it will conduct periodic profiling of the patient demographic data, reusing the same tests and enhancing with new tests if additional quality rules are adopted through data quality assessments, changes in standards, or additions to demographic data. For example, adopting and following a standard practice of basic profiling and cleansing before running advanced rule sets (i.e. algorithms) for identifying matching patient records will help to increase the accuracy of matching algorithms. This may initially increase the number of duplicates discovered (e.g., false negatives), but should lower the instance of duplicate records over time.
Data profiling is an important activity in evaluating data quality for a specific purpose and can be event-driven. For example, if a healthcare organization acquires another organization, profiling the new source would be recommended prior to migrating and integrating records into the destination system. Similarly, organizations are advised to profile data prior to: finalizing the design of a master data store; performing data conversion or consolidation; implementing a new EHR, patient registration, billing, or similar system or migration; and connecting to a health information exchange.
In addition to defect fixes and resolution of anomalies, organizations may determine that results indicate that: metadata descriptions should be updated, loading scripts should be enhanced with refined or new quality rules, or existing data structures should be redesigned. What is learned through data profiling often also serves as a key input to development of metrics, content standards, business process redesign, and technology refreshes, to further improve the quality of the organization's patient data.
It is recommended that an organization implement a standard process for conducting profiling efforts, which includes the reporting and publication of results to relevant stakeholders. Prioritization of candidate data stores and data sets for profiling should be based on business needs and objectives expressed in the data quality plan.
Data profiling is the technical means for revealing the magnitude and frequency of differences within and across data sets. It reveals quantitative patterns in data sets, such as the distinct number of values in a field, the frequency distribution of distinct values, null or blank value counts, and numeric ranges. The purpose is to identify defects and anomalies, such as varying formats in the same attribute within or across systems.
Even if similar data attributes are used across the continuum of care, they are usually stored in different formats. For example, a patient’s middle name or middle initial may be used, the use of hyphens may differ, previously used names or nicknames may or not be captured, and people of different ethnic backgrounds may use their family names in patterns other than in terms of first, middle, and last name.
Example Work Products
A data profiling method is a planned approach to analyzing data sets that is not restricted to a specific technology solution. The method serves as a process guide that defines the types of analyses to be performed, their rationale, relevant scenarios, highlevel activity steps, tests and rules to be applied, as well as report templates for results. The goal is to define the process steps and supporting work products so they are reusable across various data stores.
One of the most common types of advanced profiling methods is aimed at the identification and resolution of duplicate records in a data set. The patient matching algorithms used across the healthcare industry are a classic example of both the target objective, and the difficulty in stabilizing profiling methods that work. Most algorithms for determining duplicates may require trial and error, as well as customization that is achieved through numerous iterations of data analysis and standardization.
Typically, an information technology resource performs data profiling and may know very little about the business meaning and criticality of the data set. Therefore, stakeholders (i.e. data stewards and owners) should be consulted to define the authoritative current state of the quality of the data set, which will serve as the baseline for tracking and quantifying improvements.
Likewise, the profiling resource is responsible only for implementing defect resolution rules that have been provided as approved data quality requirements by the appropriate stakeholders (i.e., data stewards and owners).
Example Work Products
It is important for data stewards to define and approve a set of data attributes to profile according to business needs (i.e., applying an algorithm to identify duplicate patient records) and specify the data stores to be profiled.
Once a baseline for the data set has been established and approved, data governance should determine the frequency for the performance of periodic profiling, and review the results to establish enhancements as needed (i.e. updated data quality requirements, changes in standards, or new demographic attributes).
For example, stakeholders within and across healthcare organizations agree that standardized patient identifying attributes should be required in relevant data exchange transactions to ensure interoperability.
Selecting and consistently using standard profiling tools against all patient demographic data stores within the organization will result in greater productivity, more efficient training, and more rapid evolution in overall profiling capabilities. The use of standard metrics and reporting templates will not only be easier to produce but will foster more efficient and confident decision-making on the improvement of data assets by business stakeholders.
Example Work Products
1.1 Has the organization profiled patient demographic data?
2.1 Has the organization defined an approach and method for profiling a data set?
2.2 Are defects and anomalies identified through data profiling, and are resulting recommendations for remediation reported to stakeholders?
3.1 Has data governance approved the set of patient demographic data attributes and sources for regular profiling and monitoring?
3.2 Has the organization developed standard data profiling tools, metrics, and result report templates?