opencv in android. How to speed up Haar-cascade / LBP-cascade?

Faced such a problem as the low speed of the face recognition algorithms from opencv in Android (+ NDK).
Real machine for the experiment - Galaxy S4 - OS-4.2.2.
I twist in AS - 2.3.3
In general, the project is assembled and working, it recognizes faces, but it lags terribly to nausea ...
Googled SoF, people had the same problems there, but I still have a normal "tablet" Have not found.
From what I understood / googled:
1. Haar is slower than LBP, by about 30-50%.
2. If you change the frame size (by 640x480) which we give to the native method, then we noticeably increase the performance, somewhere by 150-200%.
3. If the image is "scaled" then it works 5-10% faster

JNIEXPORT void JNICALL Java_com_example_abuumaribnhattab_facedetectionlightv1_OpenCVfinder_faceDetection
( JNIEnv *, jclass, jlong addrRgba ){
      Mat& frame = *( Mat* )addrRgba;
      detect( frame );

void detect( Mat& frame ){

      // -- slowely cascade...
      //String face_cascade_name = "/storage/emulated/0/data/haarcascade_frontalface_alt.xml";

      //--  faster cascade...
      String face_cascade_name = "/storage/emulated/0/data/lbpcascade_frontalface.xml";

      CascadeClassifier face_cascade;

      if( !face_cascade.load( face_cascade_name ) ){
            printf("--(!)Error loading\n"); return;

      std::vector<Rect> faces;

      Mat frame_gray;

      cvtColor( frame, frame_gray, CV_BGR2GRAY );
      equalizeHist( frame_gray, frame_gray );

      //-- scale... wtf ?
      const int scale = 3;
      cv::Mat resized_frame_gray( cvRound( frame_gray.rows / scale ), cvRound( frame_gray.cols / scale ), CV_8UC1 );
      cv::resize( frame_gray, resized_frame_gray, resized_frame_gray.size() );

      //-- Detect faces LBP
      face_cascade.detectMultiScale( frame_gray, faces, 1.1, 2, 0, Size(30, 30) );

      //-- Detect faces haar
      //face_cascade.detectMultiScale( frame_gray, faces, 1.1, 2, 0|CV_HAAR_SCALE_IMAGE, Size(50, 50) );

      //-- Detect faces haar whith options
      //face_cascade.detectMultiScale( frame_gray, faces, 1.1, 2, HaarOptions, Size(50, 50) );

      for( size_t i = 0; i < faces.size(); i++ ) {
            Point center( faces[i].x + faces[i].width*0.5, faces[i].y + faces[i].height*0.5 );
            ellipse( frame, center, Size( faces[i].width*0.5, faces[i].height*0.5), 0, 0, 360, Scalar( 255, 0, 255 ), 4, 8, 0 );
            Mat faceROI = frame_gray( faces[i] );

The rest of the code is not needed, I think. the main essence is digested in the Native environment, from there the performance "knocks down" (memory eats under 90% (18Mb) for the application, the processor by 90-95%).
Who has already "chewed" this problem, please share ways to speed up the work!

Vitaly Stolyarov, 2017-07-07

You can try to calculate on the GPU using OpenCL
The library can work with it, but it seems that this does not apply to all functions

