English Amiga Board


Go Back   English Amiga Board > Coders > Coders. General

 
 
Thread Tools
Old 08 April 2014, 12:14   #1
Gundam
Users Awaiting Email Confirmation
 
Join Date: Mar 2014
Location: my town
Posts: 12
Parsing an HTML file in Arexx

I want to extract information from <td> elements, eg:

<td>
A

</td>
<td>
B

</td>

...

How do I achieve that in Arexx ?
Thanks in advance.
Gundam is offline  
Old 08 April 2014, 13:02   #2
thomas
Registered User
 
thomas's Avatar
 
Join Date: Jan 2002
Location: Germany
Posts: 6,985
How much do you know about these HTML files? For example, are <td> value </td> always on seperate lines? Are <td> </td> always written in lower case?

To write a generic HTML parser might be very difficult. But if you know that the files you read are always built up in the same way, it will get easier.
thomas is online now  
Old 08 April 2014, 13:55   #3
Gundam
Users Awaiting Email Confirmation
 
Join Date: Mar 2014
Location: my town
Posts: 12
Quote:
Originally Posted by thomas View Post
How much do you know about these HTML files? For example, are <td> value </td> always on seperate lines? Are <td> </td> always written in lower case?

To write a generic HTML parser might be very difficult. But if you know that the files you read are always built up in the same way, it will get easier.

The <td> elements (on separate lines) are within a <table class="tableclass">; the "class" attribute "tableclass" is a unique identifier.
Gundam is offline  
Old 08 April 2014, 15:03   #4
thomas
Registered User
 
thomas's Avatar
 
Join Date: Jan 2002
Location: Germany
Posts: 6,985
This wasn't what I asked for.

Let's look at an example. I saved the source code of this page as board.htm. The ARexx program board.rexx extracts the posting time and user name from the table of posts. It's done in a quite easy way but this means that it only works on HTML files which look exactly like this.

Code:
3> rx board.rexx
Today, 12:14 Gundam
Today, 13:02 thomas
Today, 13:55 Gundam
3>
Attached Files
File Type: zip board.zip (15.2 KB, 122 views)
thomas is online now  
Old 08 April 2014, 15:26   #5
Gundam
Users Awaiting Email Confirmation
 
Join Date: Mar 2014
Location: my town
Posts: 12
Quote:
Originally Posted by thomas View Post
This wasn't what I asked for.

Let's look at an example. I saved the source code of this page as board.htm. The ARexx program board.rexx extracts the posting time and user name from the table of posts. It's done in a quite easy way but this means that it only works on HTML files which look exactly like this.
...

that's OK!
I don't need a generic parser; I just want to extract the contents of <td> elements which are in the <table> with THAT specific identifer.
The value is on a separate line, like this:

<td>
value
</td>

the <td> tags don't have any "id" attribute, so they can't be easily identified; thats why I asked for help.
Gundam is offline  
Old 08 April 2014, 17:23   #6
daxb
Registered User
 
Join Date: Oct 2009
Location: Germany
Posts: 3,303
Thats easy. Use Thomas script as template and change/reduce it to your needs. You just have to readln() and check if the string is <td>. If yes, read next line until </td> is reached.
daxb is offline  
Old 08 April 2014, 18:20   #7
Gundam
Users Awaiting Email Confirmation
 
Join Date: Mar 2014
Location: my town
Posts: 12
Quote:
Originally Posted by daxb View Post
Thats easy. Use Thomas script as template and change/reduce it to your needs. You just have to readln() and check if the string is <td>. If yes, read next line until </td> is reached.

thanks!
Gundam is offline  
Old 09 April 2014, 00:16   #8
Lonewolf10
AMOS Extensions Developer
 
Lonewolf10's Avatar
 
Join Date: Jun 2007
Location: near Cambridge, UK
Age: 44
Posts: 1,924
Quote:
Originally Posted by Gundam View Post
that's OK!
I don't need a generic parser; I just want to extract the contents of <td> elements which are in the <table> with THAT specific identifer.
The value is on a separate line, like this:

<td>
value
</td>

the <td> tags don't have any "id" attribute, so they can't be easily identified; thats why I asked for help.
Just be aware that not all HTML code and data are on separate lines. Code generated by programs usually are, but I personally hand generate my HTML code for my website as I find it more fun, and easier to read doing it my way. What I'm trying to say is that some data may be stored thusly:

Code:
<TABLE>
<TR><TH>Some data</TH></TR>
<TR><TD>More data</TD></TR>
<TR><TD>Even more data</TD></TR>
<!-- and the rest of the table follows... --->
</TABLE>
<TR> is table row start and </TR> marks it's end
<TH> is table (column/s) header and </TH> marks it's end

The above example is a 1 column table, it is entirely possible to have multiple entries of each tag on the same row.
Lonewolf10 is offline  
 


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools

Similar Threads
Thread Thread Starter Forum Replies Last Post
xml parsing in arexx amiga_user Coders. General 2 17 November 2011 15:42
error parsing global configuration file line 16 DDNI project.WHDLoad 7 21 March 2011 13:09
HTML datatype? NovaCoder support.Apps 7 05 July 2010 12:59
No html posting??? Thorham project.EAB 14 18 February 2008 02:21
HTML problem Dastardly Amiga websites reviews 11 28 November 2002 15:21

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +2. The time now is 11:37.

Top

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.
Page generated in 0.07605 seconds with 16 queries