Answer the question
In order to leave comments, you need to log in
What is the best way to check for new URLs?
There is a news site. We need to make a parser that will check every 5 minutes, for example, for new news and add if there are any. News are displayed at different intervals. How to properly implement check for new ones? We will check on the page where the latest news is displayed briefly. There was an option to store old links to news and check them already. But I don't want to constantly disturb the base.
Answer the question
In order to leave comments, you need to log in
If not in DB then in memory. If in memory, then it’s better not in slice, but in a map that has a constant access time versus a linear one for slice.
var Urls = make(map[string]struct{}) //самая компактная реализация множества.
func IsNew(url string) bool{
if _, ok:=Urls(path.Base(url)); ok{ //зачем вам весь url. Берите только базу, это компактней и исключит рерайты
return false
}else{
/*тривиальная реализация кэша
if PresentInDb(url){ //посмотреть в базе
if len(Urls)>Limit{ //подчистить кэш
for k,v:=range Urls{
delete(Urls, k)
break
}
}
Urls(path.Base(url))=struct{}{} //подновить кэш
return false
}else{
можно раскомментировать*/
Urls(path.Base(url))=struct{}{}
return true
}
}
First, double-check that the target site does not have an RSS feed. If it is, better use it.
After pulling out the content of the news, take a hash from it. If there is already something with such a hash - the news is old. If there is already news with such a heading AND the difference between the content is small - the content of the article has been updated, if the difference is high - this is a new article, but with an already existing heading.
I would recommend you do the parser itself on phantomjs
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question