EvilZone

Programming and Scripting => Web Oriented Coding => : bubzuru August 13, 2012, 03:46:14 AM

: Regex Help
: bubzuru August 13, 2012, 03:46:14 AM
im trying to parse the hidemyass proxy list but its very strange (i guess its so people cant parse it lol)

here is an example:
: (html)
<tr class="altshade"  rel="12113224">
         <td class="leftborder timestamp" rel="1344818823"><span class="updatets ">
32 secs</span></td>
         <td><span><style>
.IoxS{display:none}
.tCrX{display:inline}
.kQk4{display:none}
.RUhO{display:inline}
</style><span class="165">84</span><span style="display:none">220</span><div style="display:none">47</div>.41<span class="kQk4">53</span><div style="display:none">252</div>.<span class="51">108</span><span style="display:none">136</span><span class="IoxS">246</span><span></span>.74</span></td>   
         <td>
8080</td>
         
         <td rel="si"><span class="country"><img src="http://static.hidemyass.com/flags/si.png" alt="flag" /> Slovenia</span></td>
         
         <td> <div class="speedbar response_time" rel="5205">
    <div class="medium" style="width:48%"> </div>
        </div>
         </td>
             <td> <div class="speedbar connection_time" rel="3041">
    <div class="medium" style="width:39%"> </div>
             
        </div>
             </td>
     
             <td>HTTP</td>
             <td class="rightborder">Low</td>
         
         </tr>

the output is something like this:
  32 secs 84.41.108.74 8080 (http://static.hidemyass.com/flags/si.png) Slovenia HTTP Low

i need to extract the info , anyone got any ideas ?
         
: Re: Regex Help
: Deque August 13, 2012, 10:25:46 AM
Regex alone is not suitable for this. Use an HTML parser library to get the contents of the table.
: Re: Regex Help
: Simba August 13, 2012, 12:59:07 PM
Do you need this done automatically?
I believe it's javascript which populates table.
So you would need generated source code to use regex.
On that page paste this in URL bar:
:
javascript:%20var%20win%20=%20window.open();%20win.document.write('<html><head><title>Generated%20HTML%20of%20%20'%20+%20location.href%20+%20'</title></head><pre>'%20+%20document.documentElement.innerHTML.replace(/&/g,%20'&amp;').replace(/</g,%20'&lt;')%20+%20'</pre></html>');%20win.document.close();%20void%200;and you will get generated source code.
: Re: Regex Help
: bubzuru August 13, 2012, 03:52:00 PM
i hve the generated source i just need to parse it.
Deque's idea sounds the beszt i will look into it
: Re: Regex Help
: bubzuru August 14, 2012, 06:14:02 AM
this is just way to hard , im going to need to find a difrent way
: Re: Regex Help
: NeX August 16, 2012, 02:27:03 AM
If you're able to extract the contents from the page (like get the data from the table), then the first thing I can came up with is:
:
^(.+)\s(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s(\d{1,5})\s(\w+)\s(\w+)\s(\w+)$Haven't tested it, but should work on your case.. Extractions are:
1. The time
2. The IP address
3. The port
4. Country
5. Type
6. Speed/anonymity/whatever ?

Oh, and also, I presumed that the results would be ALWAYS right, like, no 999.999.999.999  type IP addresses, and no ports bigger than 65535,etc..
I've heard there's tool for regex (regex buddy, if I'm right), to make your life easier XD
If you have any other questions you can ask here or PM me :)




EDIT:
:
http://regexpal.com/
This website says that I've forgot a few +'es... Fixed regex:
:
^(.+)\s(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s+(\d{1,5})\s+(\w+)\s(\w+)\s+(\w+)$
: Re: Regex Help
: bubzuru August 16, 2012, 07:22:01 PM
+1 for the help, but i dont think regex is the way to go.
me and xzid are working on something, he is an excellent coder btw. jhe should get more props than he does
: Re: Regex Help
: bubzuru August 20, 2012, 03:13:38 AM
ok i got this to work (thanx to xzid) here is the code for those interested
(C# HtmlAgilityPack)

get the column containing the ip info and pass as HtmlNode. returns ip as string
: (c)
        public static string DecodeIp(HtmlNode html)
        {
            string ip = ""; // Will hold our decoded ip
            List<string> DisplayInlineNames = new List<string>(); // Contains our good class names
            List<string> Bits = new List<string>(); // Contains all the bits of the IP

            ///////////////////////////////////////////////////////////////
            // Save the names of the {display:inline}'s into an list
            ///////////////////////////////////////////////////////////////
            string[] ClassNameList = html.InnerText.Split('}');
            foreach (string str in ClassNameList)
                if (str.Contains("inline")) DisplayInlineNames.Add(str.Substring(0, str.IndexOf("{")).Replace('}', ' ').Remove(0, 1));
            ///////////////////////////////////////////////////////// ///// 

            // Store all nodes from column in HtmlNodeCollection
            HtmlNodeCollection IPInfo = html.SelectNodes("span/node()");

            // Loop through nodes and grab good ip bits
            foreach (HtmlNode node in IPInfo)
            {
                string classname = "." + node.GetAttributeValue("class", string.Empty); //classname of the node
                string style = node.GetAttributeValue("style", string.Empty); //style att of the node

                // If the style atrabute contains "display:inline" add to bits
                if (style.Contains("display: inline")) Bits.Add(node.InnerText);

                // If the first char in class name is numeric add to bits
                foreach (char c in classname.Replace(".", ""))
                {
                    if (Char.IsNumber(c)) Bits.Add(node.InnerText);
                    break;
                }

                // If the class name is "good" add to bits
                for (int i = 0; i < DisplayInlineNames.Count; i++)
                    if (classname.Contains(DisplayInlineNames[i])) Bits.Add(node.InnerText);

                // If lone text add to bits
                if (!node.OuterHtml.Contains("<")) Bits.Add(node.InnerText);
            }
           
            //
            // Time to sort all our bits into an ip
            //
            foreach (string p in Bits) ip += p + ".";     
            ip = ip.Remove(ip.Length - 1, 1); //remove trailing '.'

            // Repace multiple periods with a single one '...' becomes '.'
            Regex regex = new Regex(@"[.]{2,}", RegexOptions.None);
            ip = regex.Replace(ip, @".");

            return ip; //return decoded ip
        }