S
S
Sergey Shinkarev2020-04-23 23:26:41
htaccess
Sergey Shinkarev, 2020-04-23 23:26:41

How to get rid of technical duplicates?

An interesting task, not so much of a practical nature as a research one.

I put things in order on my own site, after big changes. And I saw addresses that are not physically on the site. But, instead of the 404 page, WordPress gives them as 200. The URLs looked like this:

https://www.example.com/page/11/?area=1&p=news&new...
https://www.example.com /page/11/?showfile=1&fid=55...
https://www.example.com/page/11/?p=user&id=229&area=1
https://www.example.com/page/11/ ?type=interview&ar...
https://www.example.com/page/11/?p=contact&area=1
https://www.example.com/page/11/?p=misc&do=autowor...

Everything after the final slash was once generated by the site's first software, 15 years ago. Where it came from now is a separate story, it's not about that.

I did not consider the prohibition in robots.txt, because these addresses should not be at all. So, we need a directive in .htaccess. For three days I read the network, applied various options, everything ended with a cyclic redirect. Disabling plugins and enabling default URLs did not help.

In the end, the correct directive was invented yesterday, but so far only for pagination pages, like: /page/11/

Next, I checked all combinations of addresses for duplication. It turned out the following.

1. Tape of publications

https://www.example.com/kak-snegovik-v-adu/59030
A request with any numbers after the final slash returns 301, then 200:
https://www.example.com/kak-snegovik-v-adu/59030/

https://www.example.com/KAK-SNEGOVIK-V-ADU/
A request with UPPER characters in the address immediately returns 200.

2. Articles

https://www.example.com/stories/russian-poetry/07
- status 301 and redirect to -
https://www.example.com/stories/russian-poetry/7/

Any numbers give an answer of 200, although processed differently.

https://www.example.com/STORIES/russian-poetry/ - 200
https://www.example.com/stories/RUSSIAN-POETRY/ - 200

A request with UPPER characters in any part of the address, except for the domain, also works .

https://www.example.com/russian-poetry/stories/ - change the nesting order, get 301 to the address
https://www.example.com/stories/

3. Post formats

https://www.example.com/type/image/ - everything is processed as it should. Including, multiple slashes are correctly cut. Marvelous.

4. Attachment pages

https://example.com/wp-content/uploads/2015/11/201... - unexpected result. We remove www and get the response 200. For all other types of publications, a correct redirect to the main address, from www, works.

https://www.example.com/wp-content///////uploads//... is also odd. Everywhere multiple slashes are cut, except for attachment addresses.

Now the real question is: how to fix this?

Provided that the hosting is virtual and there is no access to the server settings. We can only use .htaccess. It is highly undesirable to edit the WordPress code so as not to bounce with each update.

And yes, I am aware that some of the above can be eliminated using plugins (for example, capslock), but I want to develop a comprehensive solution. Moreover, I checked the above url scheme on several sites - the problems are similar.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
V
Viktor Taran, 2020-04-23
@shambler81

all CMS have such addresses, sometimes up to 5 per page ;)
And so what can be done
1. rial canonical on the entire site on the current page without a get parameter. In the ideal, do a check during generation, if the page is with a normal CNC, then do not put the real, if there is a get, then add the real to the page without a get.
2. I have already described all other redirects here.
that's enough for you.
https://klondike-studio.ru/standards/standartnyy-h...

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question