S
S
Stanislav2018-10-04 17:32:09
MongoDB
Stanislav, 2018-10-04 17:32:09

How to iterate over records to eliminate duplicates?

Can someone help me, something sensible does not fit into my head =(
There are half a million documents of the form

{
 _id: ObjectId(),
 tags: [слово, слово 2, слово 3]
 href: 'ссылка на документ'
}

The title of the page is formed by tags, trite tags.join(', '), duplicate pages fly out of the index, and possibly have a detrimental effect on the site as a whole.
Therefore, a bunch of the same type of pages, about 70,000 thousand, they only have slightly different photos and that's it. Tried to edit manually, ie. add, change tags, it's terrible, I'll definitely suffer from this until retirement =(
I want to achieve the following:
1. Determine the parent of duplicates
2. Get a link to the parent (href)
3. Find other duplicates and create an additional canonical field in which to place a link to parent
Now I'm catching duplicates as follows.
For example, I get 500 records from the database
function getDocuments(request) {
    return Wallpapers.aggregate([
        { $sort: request.sorting },
        { $match: request.query },
        { $skip: request.skip },
        { $limit: request.limit },
        {
            $project: {
                _id: 1,
                href: 1,
                tags: 1
            }
        }
    ])
}

Then I sort through the records, find the doulas and push them into an array for their subsequent display on the site for editing tags manually
function cleanUniqueDocument(request) {
    return arr = [], tags = [], Promise.all(request.map( async (e) => {
        return tags.indexOf(e.tags.join(',')) < 0 && tags.push(e.tags.join(',')) || arr.push(e)
    }))
    .then(e => {
        return arr || []
    })
}

Everything is OK, for manual editing of documents on the site itself, I see a list of duplicates - I edit it!
Now I'm thinking how to automate the process so as not to edit the tags, but to immediately register a canonical duplicate from the parent's URL. In fact, you need to write a link along with the tags, i.e. form an array of objects, and then in this array look for duplicates by tags, if there is a match, then take the parent URL from the object array and change the document. I have such a mess in my head =( at the expense of all this

Answer the question

In order to leave comments, you need to log in

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question