A
A
Alexander Pashchenko2012-12-18 14:48:15
NAS
Alexander Pashchenko, 2012-12-18 14:48:15

The riddle of the brake FibreChannel NAS?

Good afternoon.
We decided in the company to move on to serious rails, so that everything is “like in adults”. NAS, SAN, FibreChannel and Hyper-V. We bought the equipment, assembled it, installed it, and… ran into a problem.
In short, the problem is disk storage performance - it floats and drops to very low.
If it's full, then read on:
So given:
2 pcs IBM DS3512 NAS equipped with 12 SAS 15k disks for 600GB each, labeled as IBM (in real life, they seem to be produced by Seagate and designed specifically for this NAS model). Also, each NAS has two (2) FibreChannel 8Gbit cards, each with 4 ports. There are 2 “heads” in the piece of iron that have independent access to disks, and, accordingly, an FC card for each.
2 pcs SAN FibreChannel Switched Fabric IBM SAN24B-5 also with 8Gbit ports.
3 pcs IBM 3550 M4 7414-F2G servers. Each server is equipped with a FibreChannel card manufactured by Qlogic for 2 8Gbit ports. Inside, too, SAS screws.
Everything is branded, compatible, assembled according to the recommendations of the best dog breeders and vendors.
On NAS-ah, RAID5 was assembled from all 12 screws and presented to the SAN. A 4TB partition (GPT, NTFS) and a couple of smaller partitions are created on the RAID.
The servers are now running Windows 2012 Server (180-trial). The drivers for all the pieces of iron are the latest, from the IBM offsite.
For simplicity, we will consider one server, one switch, and one NAS. Everything else is not included in the experiment.
Test:
We take a large file, for example 4-8GB, lying on the server screw. Using Windows, we copy it to the partition that was presented from the NAS and observe the effects.
1) The first 1-2-4 seconds are copied at a speed of 300-800MB / sec. Then it smoothly or sharply drops to 30-60MB / sec and gradually decreases further. However, sometimes it manages to copy the entire file at full speed.
2) Before and after copying, a freeze is possible when the copy window hangs and does not respond to the mouse for up to 1-20 (sometimes more) seconds. Sometimes there is no fading.
3) While copying, the NAS blinks intensively with the screw lights. When the copy window is "hanging" - the disk activity lights on the NAS are not blinking or lit
4) When you try to delete a file freshly copied to the NAS, the delete window freezes for 20-50 seconds, then it only deletes the file.
5) We tried to copy the file on the NAS to it, but to a different folder - the problems are similar.
Disclaimer or "we already tried":
- Connect the server and NAS directly, bypassing the Switch.
- Leave one single server connected to the NAS via one single link.
- Do the same with another NAS, and another server.
- Install Windows Server 2008 R2.
- Make a partition on a smaller storage - 500GB (GPT).
The special effects are the same.
What could it be? Where to look, where to dig?
P/S Sorry for the terminology. I could confuse the terms for a piece of iron, but in general the picture is correct.

Answer the question

In order to leave comments, you need to log in

6 answer(s)
M
Mikhail Konyukhov, 2012-12-18
@piromanlynx

It looks like the actual brake moment is the end of the write cache. The buffer filled up - direct writing to disk began.
PS I don’t know how it is on your hardware and software, I had such a problem with Linux + ext4 + iSCSI - it was the end of the write cache

A
Alexander Pashchenko, 2012-12-18
@point212

Well, how to say "not bad". Should be awesome.
Because the storage system must contain disk images of virtual machines to ensure their migration between servers.
Naturally, these images will constantly change, and of course you need to provide them with acceptable performance.
An ordinary iron screw provides a write speed of about 100 MB / s. In storage systems in raid5, the total speed should be ... I don’t know for sure ... but obviously not less than 100 MB / s.
Of course, it is wrong to measure everything in Mb / s, but unfortunately I swim in IOPS.

M
mark_ablov, 2012-12-18
@mark_ablov

Have you played with the depth of the queue in Windows?
AFAIR at Qlogic'a it is too low by default.
Well, in general, they didn’t look towards * nix at least for tests?

A
akurash, 2012-12-18
@akurash

Option 1. According to my ideas, the write cache is usually not enabled (Enable Write Caching - No) if there is no cache reservation. Those. if the controller does not have the ability to save the contents of the cache in the event of an external power failure. As far as I know, two cache backup technologies are currently in use: using backup flash memory (for example, HP FBWC) and using a backup power supply for controller cache chips (the so-called Battery Backup Unit, BBU). Therefore, I think that it makes sense to deal with your controller and, if necessary, buy a BBU “battery” for it. In any case, enabling the write cache will be a big plus for performance.
Option 2: Recently struggled with a similar issue. It was not the cache that was to blame, but the antivirus installed on the server (Symantec Endpoint Protection 11). The demolition of the antivirus (with subsequent replacement with another one) solved the problem completely.

A
amc, 2012-12-19
@amc

The first 1-2-4 seconds are copied at a speed of 300-800MB/sec.

the file goes into the cache
Then it drops smoothly or sharply to 30-60 MB / s and gradually decreases further

the cache is over, we write at real speed.
To test:
Upgrade all adapters to 4Gbps, up to 2Gbps if necessary;
test without MPIO, shelf directly in HBA;
cut down the cache on the shelf;
to check, create a raid-0 on all disks;
check, in such conditions you should get enough speed for both streaming and random recording.
And yet, raid-10 is two (three, four) raid-1 combined into raid-0. This is how it is created on the IBM shelves.
Also, do not forget that you need at least one global spare disk in each shelf, so as not to lose the array if the replacement takes too long.

A
Alexander Pashchenko, 2012-12-22
@point212

In general, all this did not bring us closer to the answer. The speed should be good and without any caches there.
The official response from IBM is to update the firmware of everything you can. Only bad luck - we have already updated them everywhere.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question