Which library is currently the most efficient in XML parsing tasks?

C

crmMaster2016-11-16 19:24:06

Java

crmMaster, 2016-11-16 19:24:06

In our project, the issue of increasing the efficiency of XML parsing has become acute.
Our main task is to parse responses from SOAP services. At the moment we are using Nokogiri (Ruby), but even with C++ optimizations its efficiency is extremely low.
Efficiency for us is the speed of completing a task in the face of infinitely available resources. Nokogiri is extremely sad in this regard, because does not work in multi-threaded mode and contains a lot of functionality we do not need.
The platform and language again do not matter - they would know how to work with unix-socket, and then everything is ready :)
In theory, the Erlang implementation could become more efficient due to the multi-threaded architecture, but competent implementations in C ++, Rust or Java have the same system capabilities.
So I would like to collect the best libraries of all worlds in order to collide foreheads on a real task.
Well, what is really the best - I would like to hear from you.
PS Guys "throw out the lib, use substring search" please go to www.coursera.org for some course on data processing and no longer advise frank nonsense.
PPS Even more stubborn guys "rewrite everything in assembler under kudu", I advise you to continue to get rid of assembler under kudu. We have a thread about robust solutions

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

P

Plus3x, 2016-11-16
@crmMaster

I think it's worth starting with researching tools in that development language www.ohler.com/dev/xml_with_ruby/xml_with_ruby.html

S

sim3x, 2016-11-16
@sim3x

lxml
what sockets, "parallel processing" and other things are completely unclear
Spider / scrapper can be written on anything - and this will not affect the speed of parsing and building an xml tree

O

Odissey Nemo, 2016-11-17
@odissey_nemo

For processing (parsing) XML, there are two ideologically different approaches:
a) DOM, when all XML is read into memory, building a complete structure hierarchy in it, and
b) SAX - when they pass through the file along it, visiting all elements once, and sequentially.
The DOM is only good for small files with internal element dependencies where you might want to access arbitrary element data at any point in time.
SAX works as fast as possible (1-2 orders of magnitude faster than DOM) but may require the implementation of complex logic for storing the necessary data if the logic of the task also requires returning to the data of previous elements.
Both DOM and SAX have robust and reliable implementations for every language and operating system in the world. The choice between them depends only on the task and development environment.
There are also mixed approaches, in particular JAXB - when using SAX they read and place XML data not in a DOM object, but in primitive objects of language classes, on which specific business logic is already implemented. The problem with JAXB is that it can ONLY process the XML structures it already knows, ie. in practice, this is compiling XSD to Java / C #, etc. the code. Changed XSD - change Java/C#, etc. code and adapt the program logic to new data. But - the maximum achievable efficiency in the process.
I personally always choose SAX because once, about 10 years ago, I observed great difficulties in working with multi-hundred-megabyte XML when using the DOM. Despite the fact that inside there were just hundreds of thousands of separate small logically independent pieces of information (telephone bills for mailing to customers). And on SAX they solved the same problem stupidly and head-on, according to the API documentation, without any tricks and problems.
What is the problem with large DOM objects? It's that they need lots and lots of little bits of memory. And this is the worst case of data access, both for RAM and disk. Everyone has observed this phenomenon, when writing a file can take tens of times longer than reading it. Actually, all data processing is purely historically focused on reading a lot of data (caching!!!) and writing a few (write through). Once updated - read hundreds of times. It is under this logic that processors, memory, disks, and software are developed and optimized!
As for multithreading, this is not a question of processing a single XML, but of ways to merge the results of processing individual XML into a common database. So and so each separate XML object can be processed ONLY in one flow. That's the way he's made. If you imagine some giant XML whose data structure allows parallel processing, then at least once it will have to be completely traversed in one thread in order to be divided into autonomous units of parallel processing.
By the way, Oracle is able to efficiently process the fields of its database containing XML. And it does it through SAX)))

A

al_gon, 2016-11-16
@al_gon

"would be able to work with unix-socket" and I would not mix XML parsing tasks.
If simple data, but immediately in large volume, then https://ru.wikipedia.org/wiki/SAX
If complex and not in large volume (fit in memory), then https://ru.wikipedia.org/wiki /Java_Architecture_fo...
If complex and don't fit in memory then a combination of SAX and JAXB .