Answer the question
In order to leave comments, you need to log in
How to build application logic correctly?
Please help me, I don’t have enough logical (technical thinking) in my head I can’t close the logical chain
We need to come up with a solution so that there are always unique data in the database
And so, there is a parser, a bulletin board scab
1. I receive data, I write the array as a string
2. JSON .parse(data) parsed, given to the bot (Let's say 10 objects came in)
3.
I re-request
And then I have a logical failure, how can I compare the data
b) The second time I make a request, I get 10 objects from above, 10 + 10 = 20, each object has a pair, I compare object = object - delete (if nothing has changed).
(well, I don’t know if it’s a good solution to accumulate information in the database), so I think it’s necessary to delete pairs
c) Well, let’s say if something has changed, there were 10 ads, 10 more new ones came (5 of them unique)
1-obj {a :'a'} - ------ {a:'a'}
2-obj {a:'a'} - ------ {a:'a'}
3-obj {a:' a'} - ------ {a:'a'}
4-obj {a:'a'} - ------ {a:'a'}
5-obj {a:'a' } - ------ {b:'b'}
6-obj {a:'a'} - ------ {b:'b'}
7-obj {a:'a'} - ------ {b:'b'}
8-obj {a:'a'} - ------ {b:'b'}
9-obj {a:'a'} - -- ---- {b:'b'}
10-obj {a:'a'} - ------ {b:'b'}
and here problems already arise, the first objects will be overwritten
, and the bottom {a: 'a'}, although not unique, will go to the bot there were no new announcements), but if I make a request for the third time, then I don’t have pairs for new values and they are all considered unique
, what can I do, how to get out of this situation
, I can’t compare by publication date, because all sites have different formats (someone posts only the time, some the date, some don't post data at all, like "New", "Recent", etc.)
I also thought about the current date option, but again I can't compare current date with different update formats on the site (line above) (well, either I don’t understand how to do it
Answer the question
In order to leave comments, you need to log in
Find the parameters of non-uniqueness, that is, determine what a duplicate is.
According to these parameters, either create a hash and store it as a separate field in the table, or, if it is 1 field, check the uniqueness by it.
We put on the field with a unique hash, when adding, we make an insert, it is a duplicate ignore.
All.
If the data (objects) being loaded does not have some kind of unique identifier, make it from the data itself, for example, taking the md5 hash from the string into which the data was serialized (only if there are lists inside, try not to change the order in them or sort them, speech goes exclusively to get a string that will always be the same for the same data)
And then everything is simple, store this identifier next to the data in the database and at the time of recording check the uniqueness using it
ps be careful, hash does not guarantee the absence of collisions, i.e. that different data will not give the same hash, on the other hand, the probability of this event is very small and until you dig up the entire Internet, you will not encounter this, well, in extreme cases, you can choose a hashing function with a higher bit depth to reduce this probability
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question