How to create DataFrame from xml?

S

san_m_m2021-10-03 11:10:26

Python

san_m_m, 2021-10-03 11:10:26

It is necessary to translate the xml file into a DataFrame and I can’t figure out how to do it.

There is an xml file with the following structure:

<?xml version="1.0" encoding="utf-8"?>
<licenses_list>
  <licenses>
    <name>Министерство здравоохранения Астраханской области</name>
    <activity_type>Медицинская деятельность</activity_type>
    <abbreviated_name_licensee>ООО &quot;КЛИНИКА &quot;ЛИНЛАЙФ&quot;</abbreviated_name_licensee>
    <works>
          <work>100. При оказании первичной, в том числе доврачебной, врачебной и специализированной, медико-санитарной помощи организуются и выполняются следующие работы (услуги):</work>
          <work>100.1. при оказании первичной доврачебной медико-санитарной помощи в амбулаторных условиях по:</work>
          <work>100.1.25. сестринскому делу в косметологии</work>
          <work>100.4. при оказании первичной специализированной медико-санитарной помощи в амбулаторных условиях по:</work>
          <work>100.4.7. анестезиологии и реаниматологии</work>
    </works>
  </licenses>    
</licenses_list>

I write the following:

import xml.etree.ElementTree as ET
import pandas as pd


tree = ET.parse('Рабочий.xml')
root = tree.getroot()

df_index = ['name', 'activity_type',  'abbreviated_name_licensee', 'works']

df = pd.DataFrame(columns=df_index)  
  
df_index =  ['name', 'activity_type',  'abbreviated_name_licensee', 'works']


 
df = pd.DataFrame(columns=df_index)  
  
for elem in root:  
    for b in range(0,len(elem[3])):
        elements = [elem[0].text, elem[1].text, elem[2].text, elem[3][b].text]
        df = df.append(pd.Series(elements, index=df_index), ignore_index=True)

I just can't figure out how to put all the information under the works tag into one cell separated by commas...

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

O

o5a, 2021-10-03
@san_m_m

You don't need to iterate over the 'works' elements in the same loop where the rows are added to the DF. If you sort them out, then add them in one line (separated by commas), and then create an entry in pandas. Or, in general, replace this loop with join():

for elem in root:
    elements = [elem[0].text, elem[1].text, elem[2].text, ','.join(val.text for val in elem[3])]
    df = df.append(pd.Series(elements, index=df_index), ignore_index=True)

It is also worth considering that in such code, reference by indexes (elem[0].text and the like) creates a hard binding to the position of the node in xml. If the order changes, then the columns will be parsed incorrectly. Perhaps it should have been more explicit: elem.find('name').text