Answer the question
In order to leave comments, you need to log in
Learn to parse - where to start?
Hello. My question is rather related to orientation in the environment of programming languages. At the moment, I'm more of a webmaster, with knowledge of photoshop\html5\css (mostly working on my sites + creating sites for clients using WordPress).
Recently, I have been increasingly asking myself the question of mastering one of the programming languages in order to become a more serious specialist. This means that I ask myself why I need it - what will I do with it. The answers are:
1) Having mastered, for example, PHP, I can create plugins for WordPressand other related scripts (I’ll say right away that I often have to do this, because I create sites from scratch and for different needs). In addition, interesting ideas often come up, and I would like to create by implementing it myself. Let's say the prospect of becoming a PHP developer within WordPress comes up to this point. After all, now it is popular, there are more and more sites, and even e-Commerce is breaking through, based on some kind of Woo.
2) Parsing (question topic). I like this topic when you can collect certain data, process it and turn it into something interesting. In practice, more than once I had to meet with such projects (for myself), but I entrusted the work to other programmers.
Now I have finally matured in order to master one of the programming languages in order to implement my tasks on my own. It's really interesting to me. I consider it important to understand why I need this, so I specifically described 1 and 2 points. Knowing what I will do, I will be able to study the area of interest in more detail. I decided to turn to you for advice in order to understand where to start and in general to hear what you think. Thank you!
Answer the question
In order to leave comments, you need to log in
1) You need to have an idea of how sites load and work. Here it is necessary to have an idea that useful content may appear on the site even after loading after a certain time.
2) You need to have an idea how the most common way to get content - Curl - works. Try to copy something, work, present a document in xml, and so on. Here you will decide on the principle of the parser.
- the parser receives input -> given the program and the input, the parser requests certain data -> the parser processes the data for the user -> if necessary, the parser repeats the request (initiated by the user or recursions) -> end
3) Next you will reach the protection mechanisms from parsing:
- restriction of requests per 1 ip, per client, etc.
- loading information after loading content
- additional request for loading content with CSRF and other methods
- blocking ip
This will open you parsers like PhantomJs, teach you how to use proxies, mimic popular browsers, etc.
You will also get to the point where the parser is multi-threaded, and think about switching to C and a similar programming language. Communicating with the site already through api.
And then, confronting new problems, you will solve them.
For parsing, I would recommend using Python . It is quite flexible and easy to learn. And for these purposes, it fits perfectly. Especially if you need to parse dynamic content (AJAX, javascript and postload)
Java can be an alternative here, but it's too complicated for beginners
You can use this bundle:
Python, Selenium + phantomjs ( page loading ), beautifulsoup (html parsing), pymysql in DB) .
If the content is static, then it's even easier - Python + beautifulsoup
Everything works very quickly. And most importantly, the api is very intuitive and it is very easy to understand the functionality.
There is nothing complicated in the parsing itself - you take several pieces of someone else's code (for example, several HTML pages of the same type with a product or news), determine what you need to pull out, then look for some patterns, nesting, signs, etc. Determine whether it always works or not. You write a template (or templates in cycles), then check it with tests - better online (for example , https://regex101.com/ ) so that you can immediately see the result.
Another thing is how to deal with "parsed" data - whether they should be trusted completely or not. What to do with the data if something went wrong.
An excellent library for web scraping is grab . True, in python. I myself had to learn python just for the sake of using this library, and I did not regret it - a convenient language, like a library - make requests to xpath and save the results:
g = Grab(log_file='parse_log.html')
g.go(url)
pages_block = g.doc.select('//div[contains(@class,"pager")]/div[contains(@class, "pages")]')
if pages_block:
pages = pages_block.select('.//li/a[not(@title="Next" or @title="Previous")]')
page_hrefs = []
for page in pages:
href = page.node.attrib['href']
page_hrefs.append(href)
print "Page: %d" % int(page.text())
If you want to start with something mundane without fear of getting confused - look towards xpath. It is used in almost all modern languages (including C#, Java). For practice, that's it. Having roughly figured out what it is, immediately set yourself a task. For example, parse a lot of data, upload everything to your database (you will immediately practice with this), and then, for example, build graphs (the easiest option).
For parsing, study requests and responses from HTTP servers, through a sniffer (for example, Charles). Master the basics of the C# language. Use the xNet library for C# from our compatriot. For data storage, I advise SQLite and NoSQL (depending on the task).
The darkness of projects has already been done on them, it works very fast, I recommend it.
I'm in your situation, only longer. Therefore, there are more cones. Parsing is best done using Python and XPath, and pass already parsed in an intermediate format to a PHP handler if it is needed at all after parsing. Very good parsing library - BeautifulSoup, for latest python - https://github.com/il-vladislav/BeautifulSoup4
One of the most powerful and flexible python scraping frameworks: scrapy
(2) theoretically, parsing of _any_ text formats in the most general sense can be done with a bunch of flex / bison / C ++:
1) writing regular expressions for elements of the input data language (strings, numbers, tags, ...), then
2) describing the grammar of the input in bison language (nested tags, attribute rules, nested bracket expressions, etc.),
flex/bison generates a couple of sish / C ++ files that do all the dirty work of parsing the format, pulling for each defined element _your_ piece of sish code. What then to do with this data (shove from the DBMS, generate AST for the compiler, just isolate the necessary single data, ..) you describe yourself in C ++.
This approach has a low-level * hemorrhoid / universality ratio tending to infinity, but as the C ++ code library is developed for your narrow tasks, for each N + 1 task, everything comes down to generating typical high-level objects (symbols, lists, trees, etc. ), and a couple dozen lines of code just for this task.
You can parse on anything. I've seen a lot of Python examples. otherwise, in fact, any programming language is suitable, it all depends on convenience and adaptation ... as a rule, it is more profitable to use interpreted programming languages and scripting
phpQuery or curl
phpQuery seems simpler to me.
I liked this lesson https://www.youtube.com/watch?v=IU_dAU7GV8w
Try to make the parser yourself according to the instructions
in the free parser (works through the browser)
https://catalogloader.com/kak-sdelat-parser-sajta-... convenient interface and everything through the browser.
for simple tasks, this instruction for parsers of sites or online stores is quite for itself.
it is possible to upload to different formats csv exce xml json + API access
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question