DailyMed NDC to Label Image Mart #326

jrlegrand · 2024-10-24T15:07:03Z

Resolves #309
Resolves #322

Explanation

Took the approach of using FTP to download all desired DailyMed SPL zip files. Can specify in the DAG whether you want all human rx / 1 out of the 5 human rx / OTC / etc. By default it pulls all human rx.

Extract:
DAG will unzip all outer zip files into one folder leaving a folder (i.e. data/dailymed/prescription) full of thousands of zip files.

Load:
DAG will peek inside of each zip file and unzip just the XML document to the folder. Then it will parse the XML document using a custom XSLT template (dags/dailymed/template.xsl). It will then delete that XML file and move on to the next zip file. It will store the resulting XML created from the template in a list of dicts and when finished will convert that to a Pandas dataframe and then load it into Postgres. Any changes to the initial XML parsing need to be in the template.xml file for now. Optimization will be to modularize this a bit so different parts might be in different XSL files.

NOTE: see example XML template output at the bottom of this PR.

Transform:
This is probably the most unusual part, but the part I like the best. Instead of using Python or other methods to transform the resulting smaller XML document, I use dbt data models (using PostgreSQL XML functions) to transform the XML in the data lake in a stepwise manner using staging and intermediate tables that can be checked along the way for troubleshooting purposes.

Rough transform workflow:

stg_dailymed__ndcs - get all valid NDCs for SPL (at the SPL level, not at the package label section level) - used for validation of parsed (RegEx'ed) NDCs
stg_dailymed__package_label_sections - get each package label section for each SPL
stg_dailymed__package_label_section_images - get each image from MediaList/Media
stg_dailymed__package_label_section_ndcs - use RegEx to parse all potential NDCs from the Text of the section
int_dailymed_validated_package_label_ndcs - compare parsed NDCs against valid NDCs for SPL (to filter out noise). Maintain the order the NDCs appear in the text using ranking
int_dailymed_ranked_package_label_ndcs - re-rank the filtered NDCs (i.e. if 3 NDCs were found but only 1 and 3 were valid, then the new rank would be 1, 2 instead of 1, 3).
int_dailymed_ranked_package_label_images - rank the order in which the image files appear in the package label section.
int_dailymed_image_xml_ndcs - Map validated NDCs to images based on order they appear. The assumption is that NDCs and images appear in the same order and we can map them together as such.
int_dailymed_image_name_ndcs - using the image names found in the package label sections, try to RegEx NDCs out of the name and validate the match against valid NDCs for the SPL, converting both to NDC11 first to ensure no formatting issues. Store the NDC11 version of the matched NDC from the image file name.
MART: ndcs_to_label_images - pull everything together. Basically just union together the last two intermediate models (matches from XML and matches from image file names) and then concatenate stuff together to get links to images and DailyMed SPL pages.

Rationale

DailyMed SPLs have label images for many drug products, but they are not at the NDC level - they are at the SPL level. To get to NDC-level images, you need to do something along the lines of what we've done here.

NDC-level images are useful for drug purchasing or basic drug information about what an NDC looks like.

Tests

Ran DAG to completion and built marts using dbt run --select ndcs_to_label_images

This produced around 57k NDC -> label image matches at the time of writing this PR

I had run this DAG several times before and each time, I compared outputs manually to try to validate that I wasn't breaking anything that worked before and was actually adding new matches. I feel like I'm at a stable point where enough is working well that this should finally be merged to main.

Future Enhancements

NOTE: every time this DAG is run, we currently need to manually DROP/CASCADE from the sagerx_lake.dailymed table to avoid duplication. This needs to be addressed.

ALSO NOTE: I think if we expand from just all human rx to all human rx and OTC, something weird happens with the folders during extract or load. If we expand to human and OTC this needs to be fixed.

General optimizations:

OCR images and/or programmatically scan barcodes of images that couldn't be matched using parsing-based methods OCR / Barcode Scan to pull NDC from DailyMed Image Label #328
Go after NDC9's NDC9 issues related to DailyMed label images #321
Go after misc formatting issues Misc NDC formatting issues related to DailyMed label images #323

example XML template output

NOTE: the important parts for this work are everything inside <PackageLabels />.

<MediaList /> contains a list of all images found directly within or referenced from the package label section. We try to associate this with any NDCs parsed out of the text of the section and also try to parse NDCs directly from the image name (i.e. sometimes images are named "12345-456-2.jpg".
<Text /> is the raw text of the section that we parse for NDCs using RegEx in a dbt data model
<ID /> is the ID of the section

There can be multiple package label sections. In this example, there is only one.

<NDCList /> is also relevant to this work. It contains the list of NDCs represented by the SPL overall. It is used to validate any NDCs we parse out of the text of the package label section.

<dailymed>
  <documentId>057302f7-9a50-42f4-8f96-ce23f409bc4c</documentId>
  <SetId>8aa48212-19b0-4304-9c61-dcf4db2b19ea</SetId>
  <VersionNumber>5</VersionNumber>
  <EffectiveDate>20220601</EffectiveDate>
  <MarketStatus>ANDA</MarketStatus>
  <ApplicationNumber>ANDA091240</ApplicationNumber>
  <PackageLabels>
    <PackageLabel>
      <MediaList>
        <Media>
          <ID>EB654092-98B7-485A-927B-56AB5CA4B2B8</ID>
          <Image>8c496175-figure-02.jpg</Image>
        </Media>
      </MediaList>
      <ID>2bfd3ca6-c29b-4bc6-96c1-6017e64b59d2</ID>
      <Text>
               
               
               PACKAGE LABEL.PRINCIPAL DISPLAY PANEL 
               
                  
               
               
               
                  
                     71205-215-30
                     
                        
                     
                  
               
            </Text>
    </PackageLabel>
  </PackageLabels>
  <NDCList>
    <NDC>71205-215-20</NDC>
    <NDC>71205-215-30</NDC>
    <NDC>71205-215-60</NDC>
    <NDC>71205-215-90</NDC>
  </NDCList>
  <InteractionText/>
  <Organizations>
    <establishment>
      <DUN>079196022</DUN>
      <name>Proficient Rx LP</name>
      <type>Repacker</type>
      <source_list>
        <source>31722-542</source>
      </source_list>
    </establishment>
    <establishment>
      <DUN>079196022</DUN>
      <name>Proficient Rx LP</name>
      <type>Functioner</type>
      <function>
        <name>REPACK</name>
        <item_list>
          <item>71205-215</item>
        </item_list>
      </function>
      <function>
        <name>RELABEL</name>
        <item_list>
          <item>71205-215</item>
        </item_list>
      </function>
    </establishment>
    <OrganizationsText>
                  Indomethacin Capsules, USP are available containing either 25 mg of Indomethacin, USP.
                  The 25 mg capsules are size &#8216;3&#8217; hard gelatin capsules, with opaque light green cap imprinted with &#8216;H&#8217; and opaque light green body imprinted with &#8216;103&#8217;, containing white to off-white powder.
                  Bottles of 20 capsules NDC 71205-215-20
                  Bottles of 30 capsules NDC 71205-215-30
                  Bottles of 60 capsules NDC 71205-215-60
                  Bottles of 90 capsules NDC 71205-215-90
                  
                     Store at 20&#176; to 25&#176;C (68&#176; to 77&#176;F) [see USP Controlled Room Temperature]. 
                  
                  
                     Protect from light. 
                  
                  Dispense in a tight, light-resistant container as defined in the USP using a child-resistant closure.
                  
                     PHARMACIST: Dispense a Medication Guide with each prescription.
                  Manufactured for:Camber Pharmaceuticals, Inc&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; 2012729Piscataway, NJ 08854
                  By: Hetero Labs LimitedJeedimetla, Hyderabad-500 055, India.
                  Repackaged by:Proficient Rx LPThousand Oaks, CA 91320
               </OrganizationsText>
    <OrganizationsText>
                  
                     
                        INDOMETHACIN CAPSULES USP
                     
                  
                  
                     Medication Guide for Non-Steroidal Anti-Inflammatory Drugs (NSAIDs)
                  
                  
                     (See the end of this Medication Guide for a list of prescription NSAID medicines.) 
                  
                  
                     What is the most important information I should know about medicines called Non-Steroidal Anti-Inflammatory Drugs (NSAIDs)? 
                  
                  
                     NSAID medicines may increase the chance of a heart attack or stroke that can lead to death. This chance increases:
                  
                  
                     
                        &#8226;with longer use of NSAID medicines
                     
                        &#8226;in people who have heart disease
                  
                  
                     NSAID medicines should never be used right before or after a heart surgery called a "coronary artery bypass graft (CABG)." 
                  
                  
                     NSAID medicines can cause ulcers and bleeding in the stomach and intestines at any time during treatment. Ulcers and bleeding:
                  
                  
                     
                        &#8226;can happen without warning symptoms
                     
                        &#8226;may cause death
                  
                  
                     The chance of a person getting an ulcer or bleeding increases with: 
                  
                  
                     
                        &#8226;taking medicines called "corticosteroids" and "anticoagulants"
                     
                        &#8226;longer use
                     
                        &#8226;smoking
                     
                        &#8226;drinking alcohol
                     
                        &#8226;older age
                     
                        &#8226;having poor health
                  
                  
                     NSAID medicines should only be used: 
                  
                  
                     
                        &#8226;exactly as prescribed
                     
                        &#8226;at the lowest dose possible for your treatment
                     
                        &#8226;for the shortest time needed
                  
                  
                     What are Non-Steroidal Anti-Inflammatory Drugs (NSAIDs)? 
                  
                  NSAID medicines are used to treat pain and redness, swelling, and heat (inflammation) from medical conditions such as:
                  
                     
                        &#8226;different types of arthritis
                     
                        &#8226;menstrual cramps and other types of short-term pain
                  
                  
                     Who should not take a Non-Steroidal Anti-Inflammatory Drug (NSAID)? 
                  
                  
                     Do not take an NSAID medicine: 
                  
                  
                     
                        &#8226;if you had an asthma attack, hives, or other allergic reaction with aspirin or any other NSAID medicine
                     
                        &#8226;for pain right before or after heart bypass surgery
                  
                  
                     Tell your healthcare provider: 
                  
                  
                     
                        &#8226;about all of your medical conditions.
                     
                        &#8226;about all of the medicines you take. NSAIDs and some other medicines can interact with each other and cause serious side effects. Keep a list of your medicines to show to your healthcare provider and pharmacist.
                     
                     
                        &#8226;if you are pregnant. NSAID medicines should not be used by pregnant women late in their pregnancy. 
                     
                     
                        &#8226;if you are breastfeeding. Talk to your doctor. 
                     
                  
                  
                     What are the possible side effects of Non-Steroidal Anti-Inflammatory Drugs (NSAIDs)? 
                  
                  
                     
                     
                     
                        
                           
                              
                                 &#160;&#160; serious side effects include:
                              
                           
                           
                              
                                  Other side effects include:&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; 
                              
                           
                        
                        
                           
                              
                                 
                                    &#8226;heart attack
                                 
                                    &#8226;stroke
                                 
                                    &#8226;high blood pressure
                                 
                                    &#8226;heart failure from body swelling (fluid Retension)
                                 
                                    &#8226;kidney problems including kidney failure
                                 
                                    &#8226;bleeding and ulcers in the stomach and intestine
                                 
                                    &#8226;low red blood cells (anemia)
                                 
                                    &#8226;life-threatening skin reactions
                                 
                                    &#8226;life-threatening allergic reactions
                                 
                                    &#8226;liver problems including liver failure
                                 
                                    &#8226;asthma attacks in people who have asthma
                              
                           
                           
                              
                                 
                                    &#8226;stomach pain
                                 
                                    &#8226;constipation
                                 
                                    &#8226;diarrhea
                                 
                                    &#8226;gas
                                 
                                    &#8226;heartburn
                                 
                                    &#8226;nausea
                                 
                                    &#8226;vomiting
                                 
                                    &#8226;dizziness
                              
                           
                        
                     
                  
                  
                     Get emergency help right away if you have any of the following symptoms: 
                  
                  
                     
                        &#8226;shortness of breath or trouble breathing
                     
                        &#8226;chest pain
                     
                        &#8226;weakness in one part or side of your body
                     
                        &#8226;slurred speech
                     
                        &#8226;swelling of the face or throat
                  
                  
                     Stop your NSAID medicine and call your healthcare provider right away if you have any of the following symptoms: 
                  
                  
                     
                        &#8226;nausea
                     
                        &#8226;more tired or weaker than usual
                     
                        &#8226;itching
                     
                        &#8226;your skin or eyes look yellow
                     
                        &#8226;stomach pain
                     
                        &#8226;flu-like symptoms
                     
                        &#8226;vomit blood
                     
                        &#8226;there is blood in your bowel movement or it is black and sticky like tar
                     
                        &#8226;unusual weight gain
                     
                        &#8226;skin rash or blisters with fever
                     
                        &#8226;swelling of the arms and legs, hands and feet
                  
                  These are not all the side effects with NSAID medicines. Talk to your healthcare provider or pharmacist for more information about NSAID medicines. Call your doctor for medical advice about side effects. You may report side effects to FDA at 1-800-FDA-1088.
                  
                  
                     Other information about Non-Steroidal Anti-Inflammatory Drugs (NSAIDs) 
                  
                  Aspirin is an NSAID medicine but it does not increase the chance of a heart attack.Aspirin can cause bleeding in the brain, stomach, and intestines. Aspirin can also cause ulcers in the stomach and intestines.
                  Some of these NSAID medicines are sold in lower doses without a prescription (over-the-counter). Talk to your healthcare provider before using over-the-counter NSAIDs for more than 10 days.
                  
                     
                        NSAID medicines that need a prescription
                     
                  
                  
                     
                     
                     
                        
                           
                              
                                 Celecoxib
                              
                           
                           
                              
                                 &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Celebrex
                              
                           
                        
                        
                           
                              Diclofenac
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Cataflam, Voltaren, Arthrotec (combined with&#160; misoprostol)
                           
                        
                        
                           
                              Diflunisal
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Dolobid
                           
                        
                        
                           
                              Etodolac
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Lodine, Lodine XL
                           
                        
                        
                           
                              Fenoprofen
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Nalfon, Nalfon 200
                           
                        
                        
                           
                              Flurbiprofen
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Ansaid
                           
                        
                        
                           
                              Ibuprofen
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Motrin, Tab-Profen, Vicoprofen( (combined with&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; hydrocodone), Combunox (combined with oxycodone)
                           
                        
                        
                           
                              Indomethacin
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Indocin, Indocin SR, Indo-Lemmon, Indomethagan
                           
                        
                        
                           
                              Ketoprofen
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Oruvail
                           
                        
                        
                           
                              Ketorolac
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Toradol
                           
                        
                        
                           
                              MefenamicAcid
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Ponstel
                           
                        
                        
                           
                              Meloxicam
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Mobic
                           
                        
                        
                           
                              Nabumetone
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Relafen
                           
                        
                        
                           
                              Naproxen
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Naprosyn, Anaprox, Anaprox DS, EC-Naprosyn, Naprelan,&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Naprapac (copackaged with lansoprazole)
                           
                        
                        
                           
                              Oxaprozin
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Daypro
                           
                        
                        
                           
                              Piroxicam
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Feldene
                           
                        
                        
                           
                              Sulindac
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Clinoril
                           
                        
                        
                           
                              Tolmetin
                           
                           
                              &#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160; Tolectin, Tolectin DS, Tolectin 600
                           
                        
                     
                  
                  
                     *Vicoprofen contains the same dose of ibuprofen as over-the-counter (OTC) NSAIDs, and is usually used for less than 10 days to treat pain. The OTC NSAIDS label warns that long term continuous use may increase the risk of heart attack or stroke.
                  
                     This Medication Guide has been approved by the U.S. Food and Drug Administration. 
                  
                  Manufactured for:Camber Pharmaceuticals, IncPiscataway, NJ 08854
                  By: Hetero Labs LimitedJeedimetla, Hyderabad-500 055, India.
                  Repackaged by:Proficient Rx LPThousand Oaks, CA 91320
               </OrganizationsText>
  </Organizations>
</dailymed>

Created an XPath for Media that looks for ObservationMedia and then grabs an image file name (if it exists - need to build a test for if it exists to reduce the noise probably) and also the entire text of the component. Next step is to build a dbt staging model to RegEx the NDC out of the <Text/> element since XPath doesn't natively support that.

Changed template to look specifically at the package label display panel section(s) in the SPL for images. Also updated the staging table to have nested XMLTABLE commands (thanks ChatGPT).

Created an XPath for Media that looks for ObservationMedia and then grabs an image file name (if it exists - need to build a test for if it exists to reduce the noise probably) and also the entire text of the component. Next step is to build a dbt staging model to RegEx the NDC out of the <Text/> element since XPath doesn't natively support that.

Changed template to look specifically at the package label display panel section(s) in the SPL for images. Also updated the staging table to have nested XMLTABLE commands (thanks ChatGPT).

lprzychodzien

LGTM

jrlegrand added 30 commits November 27, 2023 21:57

Update full with daily files

6a0b9b3

Fix xml regex matching

32cf795

Update datasource table name

69455a1

Update remaining staging file

f23bf71

Initial Dailymed dbt work

8785ff2

Convert all staging to dbt

5d9060b

Remove extraneous DAG files

9352bab

Convert organization metrics to intermediate model

3d5e426

Remove daily version to focus on full

f11d962

Rename DAG to dailymed

45676f1

Convert intermediate table to dbt

182f560

Update dailymed name change

40bc54a

Convert DAG to Taskflow format

d02259e

Change to daily only

e2308d1

Point at package label section

cb49918

Changed template to look specifically at the package label display panel section(s) in the SPL for images. Also updated the staging table to have nested XMLTABLE commands (thanks ChatGPT).

Update dbt models

f2be387

Update full with daily files

318db6c

Fix xml regex matching

012b7bd

Update datasource table name

dc9c83b

Update remaining staging file

00db7cf

Initial Dailymed dbt work

d865f6e

Convert all staging to dbt

49183ca

Remove extraneous DAG files

6897c11

Convert organization metrics to intermediate model

f3a5c74

Remove daily version to focus on full

162b4ee

Rename DAG to dailymed

c7d4dd7

Convert intermediate table to dbt

226c547

Update dailymed name change

38fef34

Convert DAG to Taskflow format

132bb52

jrlegrand added 19 commits September 5, 2024 23:47

Change to daily only

2b2ceea

Point at package label section

6970f0a

Changed template to look specifically at the package label display panel section(s) in the SPL for images. Also updated the staging table to have nested XMLTABLE commands (thanks ChatGPT).

Update dbt models

afce6d0

Dailymed work

a587e7d

Merge with origin

b1564f8

Got sorting working

b3bc70f

Initial image name work

8588bc7

Add todo

236c3c6

Account for Rx and OTC

67dd3de

Account for loading rx and otc

70b7c63

Fix image NDC issue

f16b966

Clean up mart

ac7ecdc

Mart updates

af9feff

Fix NDC whitespace issue

fbc8f1d

Account for RegEx whitespace

0925f3d

Fix Observable bugs

cd7df2a

Update mart with urls

c195753

Account for image references in XSLT

d373931

jrlegrand requested a review from lprzychodzien October 24, 2024 15:07

jrlegrand changed the title ~~DailyMed NDC to Image Mart~~ DailyMed NDC to Label Image Mart Oct 24, 2024

lprzychodzien approved these changes Oct 30, 2024

View reviewed changes

jrlegrand merged commit 8cd1504 into main Oct 30, 2024

jrlegrand deleted the jrlegrand/dailymed branch October 30, 2024 14:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DailyMed NDC to Label Image Mart #326

DailyMed NDC to Label Image Mart #326

jrlegrand commented Oct 24, 2024 •

edited

Loading

lprzychodzien left a comment

DailyMed NDC to Label Image Mart #326

DailyMed NDC to Label Image Mart #326

Conversation

jrlegrand commented Oct 24, 2024 • edited Loading

Explanation

Rationale

Tests

Future Enhancements

lprzychodzien left a comment

Choose a reason for hiding this comment

jrlegrand commented Oct 24, 2024 •

edited

Loading