Get content between tags html or xml from url https

MESSIAH

New Member
I need read file xml or html from website and get content between tags like url or div etc.
Url is starting from https.
I was try using some xml parsers but without success - they are out date and don't support https/ssl
Bash:
xmlstarlet sel --net -t -c "*" https://xmlstar.sourceforge.net/xmlstarlet-xsa.xml
failed to load external entity "xmlstar.sourceforge.net/xmlstarlet-xsa.xml"
 


JasKinasis

Well-Known Member
If xmlstarlet doesn't support ssl - then perhaps use wget or curl to download the xml file locally and then use xmlstarlet on the local copy of the file.
 

JulienCC

Active Member
I need read file xml or html from website and get content between tags like url or div etc.
Most of web documents are malformed and HTML is really forgiving when it comes to markup. If the idea is to detect a bunch of text with a precise context I would advise you to use regex. XML is more reliable but you will need a parser that handles DTD correctly, which could be a problem in some scenario.
Using regex will probably allow you to use a single workflow to handle XML and HTML documents.
 

MESSIAH

New Member
If xmlstarlet doesn't support ssl - then perhaps use wget or curl to download the xml file locally and then use xmlstarlet on the local copy of the file.
Most of web documents are malformed and HTML is really forgiving when it comes to markup. If the idea is to detect a bunch of text with a precise context I would advise you to use regex. XML is more reliable but you will need a parser that handles DTD correctly, which could be a problem in some scenario.
Using regex will probably allow you to use a single workflow to handle XML and HTML documents.
Very strange thing. Even PHP support https. So why Linux bash with C++ not supporting this protocol? C++ is more powerful than PHP and have more libs.
 

JulienCC

Active Member
So why Linux bash with C++ not supporting this protocol?
First of all, xmlstarlet is not bash. This is a binary program that uses stdin/stdout as an user interface. Bash is a shell that allows you to send data to the stdin of programs and read from their stdout. It also does way more. But bash is just a shell, that sits on a virtual terminal.

Why not supporting ? Because KISS. Keep it simple, stupid. This is a common paradigm in the Unix world. xmlstarlet is meant to process xml documents, not HTTP connections. curl and wget are meant to handle HTTP. Since you have nice shells like bash with a lot a functionnality, you can easily use curl to handle HTTP then pass its output to xmlstarlet.

In the UNIX world, programs are made for one task and one task only. Then you have shells and their script language to make everything work together.

for example :
Bash:
$ curl -s https://mywebsite.com/ | my_parsing_program -option1 -option2
 

Members online


Latest posts

Top