NetTalk Central

Author Topic: HOw can I parse this page?  (Read 9600 times)

Alberto

  • Hero Member
  • *****
  • Posts: 1846
    • MSN Messenger - alberto-michelis@hotmail.com
    • View Profile
    • ARMi software solutions
    • Email
HOw can I parse this page?
« on: March 07, 2012, 01:52:45 PM »
Hi, I am reading and parsing some pages without problem.
But this:
http://nuevo.bolsar.com/VistasDL/PaginaLideres.aspx
It seams the result is from a javascrpt or something else because when I read it I dont see any of the viewed data.
Any thing I can do?
Thanks
-----------
Regards
Alberto

Alberto

  • Hero Member
  • *****
  • Posts: 1846
    • MSN Messenger - alberto-michelis@hotmail.com
    • View Profile
    • ARMi software solutions
    • Email
Re: HOw can I parse this page?
« Reply #1 on: March 26, 2012, 02:31:09 AM »
Please help!
Thanks
-----------
Regards
Alberto

Bruce

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 11186
    • View Profile
Re: HOw can I parse this page?
« Reply #2 on: April 27, 2012, 12:44:02 AM »
I opened the Weather example;
\Examples\NetTalk\WebClient\Web Client Weather (requires xFiles)\weather.sln
compiled and ran that.

clicked on the Drive button
entered the URL http://nuevo.bolsar.com/VistasDL/PaginaLideres.aspx on the first tab, and did a FETCH.

Once the page returns (it's quite slow from here) I can see the raw text of the page on the Response tab, also the cookies, and form fields on the other tabs. If you want to then set one of the fields you can do that too.

Watching the page in Firebug I can see that when it opens it makes a number of Ajax requests to the server - presumably this is populating the data. You'll also see a timer firing an ajax request from time to time. I recommend you inspect the ajax requests, and responses to see which one you want to use to get the data you are looking for.

to "simulate" a ajax request, you set the following header as a custom header in the web client;
X-Requested-With: XMLHttpRequest

cheers
Bruce



Alberto

  • Hero Member
  • *****
  • Posts: 1846
    • MSN Messenger - alberto-michelis@hotmail.com
    • View Profile
    • ARMi software solutions
    • Email
Re: HOw can I parse this page?
« Reply #3 on: June 15, 2012, 09:08:27 AM »
Dear Bruce,
I finally installed fiddler to watch what the page is doing to read its table values.
If you run fiddler and fetch the follwing URL with the browser:

http://nuevo.bolsar.com/VistasDL/PaginaLideres.aspx

then choose "Cotizaciones/AccionesLideres" from the menu
you will see the page are posting a post to

POST http://nuevo.bolsar.com/VistasDL/PaginaLideres.aspx/GetDataPack HTTP/1.1
Host: nuevo.bolsar.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: es-es,es;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
DNT: 1
Connection: keep-alive
Content-Type: application/json; charset=utf-8
Referer: http://nuevo.bolsar.com/VistasDL/PaginaLideres.aspx
Content-Length: 405
Cookie: __utma=132722449.1494634927.1338832271.1338859632.1339777845.3; __utmz=132722449.1338859632.2.2.utmcsr=200.68.94.242|utmccn=(referral)|utmcmd=referral|utmcct=/uVistaDataEspecie; ASP.NET_SessionId=evcuyhygsm5jvdijfw0p3k45; ckLng=ESP; __utmc=132722449
Pragma: no-cache
Cache-Control: no-cache

{"aEstadoTabla":[{"TablaNombre...............................

the values return are a string with JSON format like:

{"d":[{"TablaNombre":"tbAcciones","MRC":20000,"CantidadTotalFilas":15,"aTabla":[{"PrecioCompraCambiado":false,"PrecioVentaCambiado":false,"PrecioUltimoCambiado":false,"CantidadNominalCompraCambiada":false,"CantidadNominalVentaCambiada":false,"VolumenNominalCambiado":false,"VariacionCambiada":false,"TendenciaCambiada":false,"Simbolo":"ERAR","VencimientoID":4,"Estado":0,"CantidadNominalCompra":1000,"PrecioCompra":1.590,"PrecioVenta":1.600,"CantidadNominalVenta":21195,"PrecioUltimo":1.590,"PrecioCierreAnterior":1.580,"PrecioApertura":1.620,"Variacion":0.6329113924050632911392405063,"Tendencia":0,"PrecioMaximo":1.620,"PrecioMinimo":1.580,"VolumenNominal":244726,"MontoOperado":393128,"CantidadOperaciones":61,"HoraCotizacion":"13:31:30","HoraCotizacionNum":133130,"MensajeNro":15246}],"HoraUltimaCotizacion":"13:31:31","UltimaActualizacionListaEspecies":634753530041652909,"MensajeNro":15246,"HashCode":439470373},{"TablaNombre":"tbMontos","MRC":20000,"CantidadTotalFilas":0,"aTabla":[{"Descripcion":"Total negociado BCBA","Rubro":"RUBRO_TOTAL_NEGOCIADO_BCBA","Monto":343252274,"Porcentaje":100},{"Descripcion":"Total negociado acciones","Rubro":"RUBRO_TOTAL_NEGOCIADO_ACCIONES","Monto":60803385,"Porcentaje":17.71},{"Descripcion":"+++Cedears","Rubro":"RUBRO_TOTAL_MONTO_CEDEARS","Monto":1079724,"Porcentaje":0.31},{"Descripcion":"Renta fija","Rubro":"RUBRO_TOTAL_RENTA_FIJA","Monto":210897607,"Porcentaje":61.44},{"Descripcion":"Cauciones","Rubro":"RUBRO_TOTAL_MONTO_CAUCIONES","Monto":69518581,"Porcentaje":20.25},{"Descripcion":"Pase tomador","Rubro":"RUBRO_TOTAL_PASE_TOMADOR","Monto":0,"Porcentaje":-1},{"Descripcion":"Pase colocador","Rubro":"RUBRO_TOTAL_PASE_COLOCADOR","Monto":0,"Porcentaje":-1}],"HoraUltimaCotizacion":"13:31:07","UltimaActualizacionListaEspecies":0,"MensajeNro":15213,"HashCode":1},{"TablaNombre":"tbIndices","MRC":20000,"CantidadTotalFilas":0,"aTabla":[],"HoraUltimaCotizacion":"13:30:16","UltimaActualizacionListaEspecies":0,"MensajeNro":15141,"HashCode":1}]}

My problem is, I try to do the same post with the DEMO/WebClient/Generic Post NT app
Then I entry:

POst URL: http://nuevo.bolsar.com/VistasDL/PaginaLideres.aspx/GetDataPack

POst String: (the same fiddler is catchig)

Host: nuevo.bolsar.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: es-es,es;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
DNT: 1
Connection: keep-alive
Content-Type: application/json; charset=utf-8
Referer: http://nuevo.bolsar.com/VistasDL/PaginaLideres.aspx
Content-Length: 405
Cookie: __utma=132722449.1494634927.1338832271.1338859632.1339777845.3; __utmz=132722449.1338859632.2.2.utmcsr=200.68.94.242|utmccn=(referral)|utmcmd=referral|utmcct=/uVistaDataEspecie; ASP.NET_SessionId=evcuyhygsm5jvdijfw0p3k45; ckLng=ESP; __utmb=132722449.2.10.1339777845; __utmc=132722449
Pragma: no-cache
Cache-Control: no-cache

{"aEstadoTabla":[{"TablaNombre":"tbAcciones","FiltroVto":"72","FiltroEspecies":"","FilasxPagina":-1,"MensajeNro":13579,"HashCode":439470373},{"TablaNombre":"tbMontos","FiltroVto":"","FiltroEspecies":"","PagActualNro":"1","FilasxPagina":-1,"MensajeNro":13555,"HashCode":1},{"TablaNombre":"tbIndices","FiltroVto":"","FiltroEspecies":"","PagActualNro":"1","FilasxPagina":-1,"MensajeNro":13554,"HashCode":1}]}

and when I post it I receive a totally differen response, something like an aspx page.

What am I doing wrong?

Thanks
Alberto




-----------
Regards
Alberto

Bruce

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 11186
    • View Profile
Re: HOw can I parse this page?
« Reply #4 on: June 20, 2012, 11:40:55 PM »
so many things might be affecting it, but obviously the starting point is to mimic the calls _exactly_ as the browser would make them. That means using fiddler to example each request and response, and make sure they are identical between your client and the browser versions.

However, you may want to step back here a bit. You are seeing this site as a "web site" and not as a "web application". It is more difficult, and ultimately may become pragmatically impossible, to "scrape" or "drive" sites that are using an "application" approach, rather than a "static site" approach.

In most cases, if the authors of this site want you to "drive" their site, they will make an API available for you to use. You should investigate this possibility - perhaps even discuss this possibility with them before going too far down this "pretend to be a browser" road.

Apart from anything else, sites built with this level of dynamism are clearly subject to change at any time, and almost any change would break your client. So this quickly becomes a losing battle.

cheers
Bruce