M
M
max_mara2013-07-18 16:18:06
Scala
max_mara, 2013-07-18 16:18:06

Parsing text with a regular expression in Scala

Cheerful evening,

Help me compose the correct regular expression to break such a text

-2.0 RCVD_IN_RP_SAFE        RBL: Sender in ReturnPath Safe - Contact
                            [email protected]
                            [Return Path SenderScore Safe List (formerly]
                    [Habeas Safelist) - <http://www.senderscorecertified.com>]
-3.0 RCVD_IN_RP_CERTIFIED   RBL: Sender in ReturnPath Certified - Contact
                            [email protected]
                            [Return Path SenderScore Certified {formerly]
                      [Bonded Sender} - <http://www.senderscorecertified.com>]
 0.0 URIBL_BLOCKED          ADMINISTRATOR NOTICE: The query to URIBL was blocked.
                            See
                            http://wiki.apache.org/spamassassin/DnsBlocklists#dnsbl-block
                             for more information.
                            [URIs: securepaynet.net]
 0.0 HTML_IMAGE_RATIO_06    BODY: HTML has a low ratio of text to image area
 0.0 HTML_MESSAGE           BODY: HTML included in message
 1.1 MIME_HTML_ONLY         BODY: Message only has text/html MIME parts


There is another screenshot here

On Seq[String]
  • -2.0 RCVD_IN_RP_SAFE RBL: Sender in ReturnPath Safe - Contact [email protected] [Return Path SenderScore Safe List (formerly] [Habeas Safelist) - < www.senderscorecertified.com >]
  • ....
  • 0.0 HTML_IMAGE_RATIO_06 BODY: HTML has a low ratio of text to image area


Another very similar example in SMTP headers
Received: from f365.mail.ru (f365.mail.ru. [217.69.141.7])
        by mx.google.com with ESMTPS id m8si4951597lbs.75.2013.07.18.02.42.19
        for <[email protected]>
        (version=TLSv1 cipher=RC4-SHA bits=128/128);
        Thu, 18 Jul 2013 02:42:20 -0700 (PDT)
Received-SPF: pass (google.com: domain of [email protected] designates 217.69.141.7 as permitted sender) client-ip=217.69.141.7;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of [email protected] designates 217.69.141.7 as permitted sender) [email protected];
       dkim=pass [email protected]
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=mail.ru; s=mail2;
  h=Content-Type:Message-ID:Reply-To:Date:Mime-Version:Subject:To:From; bh=DA3SeGPFw4gtOV39cJaYJXRjKwbhtXq1/TjXi0eSlm0=;
  b=ktnUtYx5gZlvyeE6y79DKGU1Atdl6dqWj5y1LQS03fjdLsZpCml86mcAMMMeRA00bPR/mQ+1mF9ifDAKJgfWFrJAfyNtFecq7lv+MbE3Sq1KM9IxnAVcEWUI9ZGFEzD3tF4vxCuZKwz4OqtO6cIO7+Muss18YJ8csVvKkdQyGsQ=;

Answer the question

In order to leave comments, you need to log in

1 answer(s)
G
grender, 2013-07-20
@grender

I'm not strong in regular seasons, but if something goes wrong, I'm sorry.

val firstRegExp="""([ -]?\d\.\d.*(?:[\n\r]?[\n\r]?   .*)*)""".r
val spaceRegexp="""[ \n\r]+""".r
val data=scala.io.Source.fromFile("c:/temp/temp.txt").mkString
val result=firstRegExp.findAllIn(data).matchData.map(_.group(1)).toSeq
val finalResult=result.map(ss=>spaceRegexp.replaceAllIn(ss," "))

I wrote under Windows, having scored your test text in a notepad. Because of this, I had to take into account its line feed format (block "[\n\r]?[\n\r]?").
Cutting out extra spaces and line wrapping is done by the second regular expression, most likely it can be done right away, but as I said, regular expressions are not my forte. In general, all this can be hammered into a hellish one-liner and rejoice, although without comments I would not want to understand this.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question