Developing FHIR based Activity Libraries to Support Clinical Trial Direct Data Capture

Patrick Genyn; Andy Richardson; Patrick Genyn; Andrew Richardson

doi:10.47912/jscdm.423

Introduction

In clinical research, the protocol schedule of activities (SoA) specifies the progression of research participants through a study and outlines required activities at each time point. Richardson¹ previously demonstrated how SoAs can be represented as graph objects and converted to the HL7^® Fast Healthcare Interoperability Resources (FHIR^®) PlanDefinitions² for interoperability with systems such as electronic health records (EHRs). While scheduling elements of an SoA are study-specific, activity definitions originate from standardized data, interventions, or review requirements and are configured per protocol. This work extends the SoA graph model to (a) manage activity definitions using the same graph principles as for scheduling, (b) generate activity definitions as FHIR resources, and (c) utilize defined concepts from standard health care coding systems such as Logical Observation Identifiers Names and Codes (LOINC)³ and Systematized Nomenclature of Medicine (SNOMED)⁴ to create comprehensive activity libraries. These libraries enable complete FHIR-based study specifications to be configured, focusing on the direct data capture (DDC) of clinical trial or health care data directly from its source—such as EHRs, clinicians, or patients—into FHIR-based structured, interoperable formats.

Background

The electronic case report form (eCRF) is the current de facto standard for collecting patient data. It serves as an electronic questionnaire that aligns with the study-specific SoA details as outlined in the study protocol, thereby specifying the sponsor’s data requirements for the clinical study.⁵ There is a growing interest in leveraging EHRs as a direct source for research data due to their potential to streamline data collection, reduce duplication, and enhance data accuracy. EHRs, as repositories of real-time clinical information, offer a rich, continuously updated source of data that encompasses patient demographics, clinical observations, interventions, and outcomes.

EHRs are primarily designed to document and manage health care information, not research data. However, much of the data required for clinical research is inherently captured within EHRs.⁶ Key research-relevant information, such as demographic details, vital signs, medical history, medications, allergies, immunizations, laboratory results, radiology images, and billing information, are routinely recorded during clinical care.

While EHRs contain a wealth of data relevant to clinical research, this information is not always ready for direct research use, particularly due to the way it is coded. Standard health care coding systems, such as LOINC,³ Systematized Nomenclature of Medicine – Clinical Terms (SNOMED CT),⁴ the International Classification of Diseases, Tenth Revision (ICD-10),⁷ and the Digital Imaging and Communications in Medicine (DICOM),⁸ are designed to support clinical workflows rather than research-specific requirements.

Achieving integration between EHRs and eCRFs or other clinical research systems is hindered by two key barriers: structural interoperability and semantic interoperability.^9,10 Structural interoperability ensures that information is consistent in structure and format. Semantic interoperability focuses on enabling systems to exchange information in a way that preserves its meaning, ensuring the recipient system can accurately interpret and apply the data within its context.

The HL7^® Fast Healthcare Interoperability Resources (FHIR^®) standard² offers a robust framework to address these challenges by providing the necessary interoperability components for both specification and collection of research-relevant data. The definitional FHIR resources, such as ActivityDefinition, ObservationDefinition, and SpecimenDefinition allow the creation of structured, machine-readable formats that align clinical research requirements with EHR capabilities.¹¹ Realizing this integration requires translating clinical trial protocols into actionable FHIR resources. A critical aspect of this translation is defining the study’s SoA and its associated data requirements as FHIR resources. This process not only ensures alignment with clinical workflows but also enables automated data capture and operational implementation, addressing interoperability challenges and advancing the integration of EHRs into clinical research.

The aim of this work was to develop methods and tools for defining and operationally using the ‘activity’ part of the SoA to create standard FHIR-compliant activity libraries of the common data domains, procedures, and interventions for subsequent inclusion in study-specific implementations.

This work builds upon the foundational work of Richardson,¹ utilizing examples and illustrative figures derived from the SoA graph with six planned visits depicted in Figure 1.

Figure 1

Example study SoA directed graph. This SoA comprises six planned visits (blue, V1-V6) and one unscheduled visit (blue, U), with their associated activities (highlighted in yellow). Two types of operational nodes (coloured green and red) delineate the initiation and completion of graph instantiation (IS: activity initiation, IF: activity completion) and the start and end of contiguous activities (AS: activity start, AF: activity end).

Methods

Graph-Based SoA Activity Definition

Compliant with the HL7 Vulcan SoA Project Implementation Guide,¹² the approach adopted here was to use the basic activity block of the SoA graph and to modify, extend, or redesign it to enable all the data implied or explicitly stated by the activity definition to be specified as a graph (Figure 2). ‘Activity’ in SoA representations embody two important concepts at the same time: first, the scheduling, sequence, or order of the activity, and second, an implied or explicit reference to what the activity actually is. Graph methods were adopted to develop and test different approaches to adding the specific activity details to these graphs without losing any of the scheduling and timing information already present. The main objective was to be able to add the details of the required data (required measurements, observations, procedures, etc.) to these graphs. Model development proceeded iteratively.

Figure 2

The minimal generic activity block used as a starting point for activity definition development. IS (green) instantiates the activity, followed immediately by the timing ‘interaction’ (blue). This is equivalent in the clinic to scheduling an appointment and the subject arriving. The activity can then be started (AS) or abandoned, if necessary (IF). Similarly, if some reason prevents the ‘activity’ proceeding, it can be skipped (immediately to AF) or goes ahead as planned.

FHIR SoA Activity Definitions

The standard FHIR definitional resources were used to transform the graph activity definitions into FHIR resources. The primary resources used were ActivityDefinition, ObservationDefinition, SpecimenDefinition, and ConditionDefinition. Activity graphs as Python NetworkX graph objects were transformed into JavaScript Open Notation (JSON) objects and tested for compliance with the FHIR definitional resource definitions and for information completeness.

Activity Library Development

Having established a systematic graph methodology for defining any SoA activity, methods for developing, managing, and revising these definitions were developed to enable comprehensive FHIR-based activity definition libraries to be built. Custom Python tools were developed to populate, update, and manage the resulting library collections, based on publicly defined data collections, such as the LOINC panels.

Synthetic Data Library Development

A collection of data generators was designed to operate with the activity library definitions to enable real-time synthetic data generation that were aligned with the predefined SoA and activity specifications. Using FHIR standard data types, such as valueQuantity, valueCodeableConcept, valueString, valueDateTime, and valuePeriod, parameterized functions incorporating statistical distributions were coded to generate realistic synthetic data on demand.¹³

Tools and Software

Python scripts and notebooks¹⁴ were utilized for program development and maintenance. Visual Studio Code¹⁵ served as the primary code editor. The NetworkX Python library¹⁶ facilitated the creation and manipulation of SoAs and activity definitions represented as graphs. The Shiny for Python¹⁷ framework provided a user interface replicating the DDC pipeline. MELD Sandboxes were used to emulate EHR FHIR Servers.¹⁸ The yEd Graph Editor (yWorks)¹⁹ was employed to visualize and manage the NetworkX-generated graphs. All FHIR resources were developed and tested to be compliant with FHIR version 4.0.1 (R4).²

Results

SoA Activity Definition

Using the template in Figure 2 as the starting point, various approaches to adding activity details to this model were developed and tested (not shown). Figure 3 shows an example of the final adopted model, which was found to satisfy all the required clinical trial protocol requirements, compliant with the HL7 Vulcan SoA Project Implementation Guide. Specifically, the required data elements were now present, and the added nodes were able to carry the full details of the data requirements. An important characteristic of these graphs is that the source-target direction for the added nodes is from the data element to the activity. This ensures that no scheduling, sequencing, or timing information has been modified, and path analyses from IS (activity initiation) to IF (activity finish) will always return the same routes through the graph.

Figure 3

Vital signs activity definition comprising five observations and one specimen. The left side of the figure details the descriptions of the observations and specimen, while the right side presents the corresponding codes from clinical terminologies, LOINC and SNOMED CT, ensuring full semantic alignment with the protocol’s data requirements.

Adding data requirements, it was found necessary to add extra node attributes that reflected the type of object represented. This can be achieved in several ways, the simplest being to add a node type (eg, ‘Observation’) characteristic for the data type. In practice, this was found somewhat limiting for subsequent FHIR resource generation, and a set of parameterized FHIR-specific attributes were used to associate these nodes with the appropriate FHIR resources (not shown). This method also enabled implied data or practical requirements presented in footnotes of the SoA or in other sections of the protocol to be systematically modelled, as discussed by Richardson¹ in “the meaning of X”. Figure 3 is a basic vital signs panel, but for study reasons or practical efficiency, includes the collection of a blood sample as part of this ‘activity’. This requirement was fully specified in this example and included full details of the type of blood draw to be taken (SNOMED-119297000: Blood Specimen).

Conversely, modifying this graph to reflect study-specific requirements required only simple Create, Read, Update, Delete (CRUD) operations. New requirements could be created by adding more data elements, specific attributes such as codes could be modified or extended (Update), and elements not required could be Deleted. Thus, observations, specimens, conditions, etc. could be added to the graph as required by the study protocol’s SoA to fully implement “the meaning of X”. Simple graph editing tools such as yED were used to modify and revise the library versions for study implementation.

Activity Libraries

The ability to define activities systematically as described offers no value if every activity has to be developed on a study-by-study basis. To support scalability through reuse, four activity libraries were developed: (a) an observation library, (b) a specimen library, (c) an activity library, and (d) a synthetic data library. These libraries served as the starting point for configuring study-specific activity definitions. Links to the corresponding ObservationDefinition and SpecimenDefinition FHIR resources were established by referencing their unique identifiers in the observation and specimen libraries and adding them as attributes to the respective observation or specimen nodes within the activity graph (discussed earlier).

Curated and validated LOINC panels and groups were used as the primary source for these libraries. LOINC panel definitions were retrieved from the LOINC FHIR servers as JSON objects. These were converted to SoA activity graphs and loaded to the appropriate library. Similarly, SNOMED CT specimen definitions were used to populate the specimen library.

The resulting observation library comprised some ~3000 LOINC terms of the most frequently utilized LOINC concepts.²⁰ This library included ‘ObservationDefinition’ and ‘Observation’ FHIR resources for various LOINC and FHIR versions. The ‘Observation’ FHIR resource included placeholders for patient and encounter references as well as for the actual observation results, to support synthetic DDC data generation for validation purposes.

Similarly, the resulting specimen library contained the blood specimen (SNOMED CT: 119297000), urine specimen (SNOMED CT: 122575003), and all their descendant terms. This library housed ‘SpecimenDefinition’ and ‘Specimen’ FHIR resources across multiple SNOMED CT and FHIR versions. The ‘Specimen’ FHIR resource contained placeholders for the patient reference and specimen collection dates.

Figure 4 illustrates how this approach could be used to ensure clinical sites are in alignment with the protocol’s data requirements while maintaining semantic integrity. It shows the Bilirubin LOINC group LG2811-0, which provides codes for the measurement of Bilirubin in different types of blood samples. The corresponding library ‘ObservationDefinition’, to be referenced by a “Biochemistry” activity in the SoA, can now carry all or none of the allowed LOINC Bilirubin codes to ensure semantic equivalence between the recorded values at the clinical research sites and the protocol’s data requirements. This is particularly valuable to ensure consistency between clinical sites that use different coding systems for the same parameters.

Figure 4

Bilirubin LOINC group. This defines LOINC codes that measure the substance concentration (SCnc) of bilirubin at a single point in time (Pt) (Bilirubin.total [Moles/volume]) across various blood-derived systems: Serum or Plasma (14631-6), Blood (54363-7), Serum, Plasma or Blood (77137-8), Venous blood (89871-8), Arterial blood (89872-6) and Capillary blood (97770-2). The ‘activity’ can be customised to meet study requirements by deleting or retaining each method. Particularly, variations in the methods used by the study sites that would not compromise study requirements can be specified by having several options available for sites to select from.

Synthetic Activity Data Generation

These libraries were also used to generate data that would emulate the data expected to be found in an EHR. Using the FHIR definitional resources to define ‘what is required’ (eg, ‘ObservationDefinition’), and their associated instantiated resources (eg, ‘Observation’), to hold synthetic but realistic data, various protocol visit workflows could be accurately created, including scenarios such as protocol completion, screening failure, and early withdrawal, with or without unscheduled visits. Using the FHIR ‘ResearchStudy’ and ‘ResearchSubject’ resources as the starting point, the method then associated a Patient, and using the SoA graphs generated subject Encounters, Observations, Specimens, Medications, etc. with realistic data in full compliance with the defined route through the protocol and the expected activities.

Once loaded to a FHIR server, these would be available for clinical trial sponsors to support, for example, data pipeline testing and validation, reviews of expected data volumes, or to test operational procedures. When loaded to FHIR servers already populated with clinical synthetic datasets such as MELD/Synthea, it was possible to test how to respond to important events, such as recognising prescribed medication to operationally optimise the review ‘ConMeds’ SoA activity.

Discussion

The inherent simplicity of directed graphs—vertices (source and target) connected by edges—makes them an effective tool for representing and managing the complexity of a clinical trial schedule. Graphs facilitate the precise definition of activities within the schedule, aligning these activities with the data requirements outlined in the clinical trial protocol. We demonstrated that defining activity definitions in a graph form that is supported by confirmed representation in FHIR formats offers a powerful approach to the goal of using FHIR as the interoperability standard for communicating clinical research requirements using machine methods.

The methods described here have proven robust and reliable, particularly when combined with already well-defined data domains, such as those defined by LOINC, SNOMED CT, etc. Since these coding systems are adopted widely in clinical practice, they offer an alternative approach to specifying study requirements to, for example, CDISC Controlled Terminology (CT).²¹ They have also proved simple to revise and manipulate, enabling study teams whose focus is study data accuracy to exploit the approach with minimal training.

While Clinical Data Interchange Standards Consortium (CDISC) Controlled Terminology (CT) provides a streamlined vocabulary for data submission, working with broader terminologies like LOINC and SNOMED Clinical Terms (CT) introduces both opportunities and challenges. Their granularity supports rich clinical detail but can complicate mapping to the coarser terms in CDISC CT. Mapping from LOINC or SNOMED to higher-level Controlled Terminology concepts is often necessary for semantic clarity. CDISC CT is best viewed as a subset of the broader biomedical space covered by LOINC and SNOMED, which are increasingly aligned through collaboration—LOINC focusing on observations and SNOMED on procedures—to support both clinical and regulatory uses.

Leveraging libraries of predefined observations, specimens, and activities parallels the use of eCRF libraries in electronic data capture (EDC) systems and similarly enhances consistency and scalability across clinical trials and research sites. Once these resources are established, they can be reused across multiple studies, facilitating standardization and minimizing redundancy in trial design and execution. Additionally, incorporating the capability to generate synthetic data that mirrors anticipated study data under varying scenarios allows for systematic and accurate testing of the data pipeline from clinical research sites to trial sponsors.

Several key considerations and potential limitations accompany this approach. First, adoption requires technical familiarity with HL7 FHIR and access to a working FHIR server. Second, transforming protocol-specific data into semantically accurate FHIR resources calls for custom conversion tools tailored to each study. Third, while LOINC panels offer valuable standard definitions, not all are readily available in FHIR-compatible formats, and FHIR Questionnaire encoding can be inconsistent. Fourth, aligning activity libraries with sponsor expectations—especially regarding CDISC terminology and submission formats—requires close collaboration to ensure system compatibility. Finally, capturing nuanced protocol requirements demands expertise in both technical implementation and domain knowledge, particularly in semantic modeling and graph-based design. These factors should be considered by teams aiming to replicate or expand this methodology.

A significant development supporting this approach is the US Food and Drug Administration’s (FDA’s) recent public docket (FDA-2025-N-0287), which invites feedback on the use of HL7 FHIR for submitting study data derived from real-world data sources.²² This initiative signals strong regulatory interest in evaluating FHIR as a viable format for end-to-end data submission—from data capture at the clinical site through to regulatory reporting. The framework proposed in this paper directly supports such a pipeline. Libraries of FHIR ActivityDefinition and related resources not only ensure semantic interoperability and compliance with protocol-defined requirements but also enhance traceability and operational alignment. By enabling a direct link between protocol specifications, EHR-based data capture, and structured FHIR-based submissions, this work aligns with the FDA’s exploration of more modern, interoperable pathways for regulatory data exchange.

Conclusion

The goal of this work was to be able to (a) build unambiguous study ‘activity’ specifications as defined in protocols, using (b) the health care interoperability standard (FHIR) as used by EHRs, to (c) ensure accurate common understanding of the requirements and avoid misinterpretation.

This work developed a streamlined methodology for defining and manipulating SoA activity definitions. By leveraging existing LOINC panels and adapting them to address the specific data collection needs of a clinical trial protocol, it showed how commonly used clinical terminologies, such as LOINC and SNOMED CT, can be employed to achieve semantic interoperability, whilst simultaneously ensuring accurate study data specification.

Furthermore, the work has highlighted how LOINC groups can be used to tailor activity definitions to align with local procedures and practices at clinical research sites. This approach effectively addresses how operational differences across sites can be maintained whilst ensuring semantic consistency and interoperability, with no compromising on quality or reliability of the data.

Making available/developing libraries of activities, observations, and specimens in this form supports scalability and reuse across clinical trials, both at the level of the clinical trial sponsor and the clinical research site. It offers a consistent yet specific approach to establishing and testing direct data capture pipelines and similar study implementation and logistics issues.

Leveraging FHIR and health care coding systems, rather than regulatory submission codes, ensures semantic accuracy and highlights equivalence (or lack thereof) between sponsor requirements and site practices. This approach gives sponsors confidence that protocol elements such as footnotes, appendices, or embedded instructions can be precisely represented. Predefined libraries in FHIR format also help clinical data managers ensure that all required clinical concepts are correctly and consistently interpreted.

Establishing a skillset within a pharmaceutical company to translate trial protocols into complete and semantically correct FHIR resources requires a strategic investment but has potentially significant returns. The major return is the ability to incorporate trial protocol specifications directly into the EHRs at clinical research sites. This integration will support better compliance with protocol requirements by aligning clinical workflows with study expectations, reducing deviations, and ensuring that investigators have clear, structured guidance at the point of care. It also has the potential to reduce the operational burden on site staff by embedding relevant activities into their existing systems and routines. Separately, a well-implemented FHIR-based representation of the protocol is essential for enabling DDC from EHR to EDC systems. This interoperability eliminates the need for redundant data transcription, reduces the risk of errors, and accelerates the delivery of high-quality data.

Acknowledgements

The examples used here were taken from Richardson.¹

Competing Interests

Patrick Genyn is a Director of InteropWorks LLC, a consultancy company providing clinical data management, standards, and related services.

Andrew Richardson is a Director of Zenetar Ltd., a consultancy company that provides clinical operations and related services and is an active member of the HL7 Vulcan Schedule of Activities project and other HL7 initiatives.

Author Approval

The authors have read and approved this work.

References

1. Richardson A. Representing Clinical Study Schedule of Activities as FHIR Resources: Required Characteristic Attributes. Journal of the Society for Clinical Data Management. 2024; 5(2). DOI: http://doi.org/10.47912/jscdm.266

2. HL7 FHIR. FHIR v4.0.1. Accessed January 17, 2025. https://hl7.org/fhir/R4/

3. Logical Observation Identifiers Names and Codes (LOINC). Accessed January 17, 2025. https://loinc.org

4. SNOMED International Accessed January 17, 2025. https://www.snomed.org

5. CDISC eCRF Portal. Accessed January 17, 2025. https://www.cdisc.org/kb/ecrf

6. From electronic health records to electronic data capture systems (EHR2EDC). Accessed January 17, 2025. https://eithealth.eu/product-service/ehr2edc/

7. National Center for Health Statistics Accessed January 17, 2025. https://www.cdc.gov/nchs/icd/icd-10/index.html

8. Digital Imaging and Communications in Medicine (DICOM). Accessed January 17, 2025. Available: https://www.dicomstandard.org

9. Leroux H, Metke-Jimenez A, Lawley MJ. Towards achieving semantic interoperability of clinical study data with FHIR, 2017; 8(1):41. Accessed June 30, 2021. https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-017-0148-7

10. Genyn P. The role of FHIR Resources in ensuring Semantic Equivalence in EHR2EDC direct data capture. Accessed January 17, 2025. https://phuse.s3.eu-central-1.amazonaws.com/Archive/2023/Connect/EU/Birmingham/PAP_RE08.pdf

11. Richardson A, Low G, Ward M. Study Designs using FHIR-Schedule of Activities exchange using FHIR Resources PP10. PHUSE Research on FHIR Working Group. Presentation RW05, PHUSE EU Connect 2021 Accessed January 17, 2025. https://phuse.s3.eu-central-1.amazonaws.com/Archive/2021/Connect/EU/Virtual/PRE_RW05.pdf

12. Vulcan Clinical Study Schedule of Activities Implementation Guide. Accessed January 17, 2025. https://hl7.org/fhir/uv/vulcan-schedule/

13. Genyn P, Richardson A. The role of FHIR Resources in testing and validating an EHR to Sponsor data pipeline. 2023. Accessed January 17, 2025. https://phuse.s3.eu-central-1.amazonaws.com/Archive/2024/Connect/US/Bethesda/PRE_RE01.pdf

14. Python Software Foundation. “Python.” Accessed: January 17, 2025. https://www.python.org/

15. Visual Studio Code editor, Accessed: January 17, 2025. https://code.visualstudio.com

16. NetworkX. NetworkX — NetworkX documentation. Accessed January 02, 2025. https://networkx.org/

17. Shiny for python. Accessed January 17, 2025. https://shiny.posit.co/py/

18. Meld Sandbox. Accessed January 17, 2025. https://meld.interop.community/ provided by InterOp. Community: https://interop.community/

19. yWorks. yEd – Graph Editor. Accessed January 17, 2025. https://www.yworks.com/products/yed.

20. Logical Observation Identifiers Names and Codes (LOINC). Most frequently used LOINC codes. Accessed January 17, 2025. https://loinc.org/usage/obs/

21. CDISC Controlled Terminology. Accessed January 17, 2025. https://www.cdisc.org/standards/terminology/controlled-terminology

22. US Food and Drug Administration. Exploration of Health Level Seven Fast Healthcare Interoperability Resources for Use in Study Data Created From Real-World Data Sources for Submission to the Food and Drug Administration. Accessed June 17, 2025. https://www.regulations.gov/docket/FDA-2025-N-0287