This is the mail archive of the cygwin-talk mailing list for the cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

cygwn uses for public document retrieval


Hi,
( this was originally rejected from main list, thought to be marginally relevant here)
( I searched the archives, this hasn't come up before and the question is
at the bottom- sorry for the long intro. I posted this on cygwin because
I run my scripts on cygwin and cygwin illustrates the relationship between
graphiically oriented things like windoze and information oriented systems
like linux. )
I've been using scripts now to access and organize searches from various
sources. Given the proliferation of documents and document types,
I think everyone recognizes the need for more structured documents and
the ability to easily do ad hoc searches and extractions- scripts make that
possible and indeed I have some examples to show that may be of
interest beyond specialized communities.
The federal government is one entity that collects structured documents
of public interest from a variety of sources. However, the
various agencies support automated access ( scripts) in highly
variable ways.
My favorite example of an information-friendly site is still the
ncbi api:
http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html
"Entrez Programming Utilities are tools that provide access to Entrez
data outside of the regular web query interface and may be helpful for
retrieving search results for future use in another environment."
( the IEEE you would think would he leading the charge in automated document
access but so far all I've seen is requests for money when I try to search
their journal databases)


As one example:
http://bioinformatics.org/pipermail/bio_bulletin_board/2006-May/003249.html


Most other sites seem to just accept that interactive access via a web interface is how "normal people" will use the site- this is just not practical or in the public interest at most sites.

Consider searching US patent documents- afaik you have to parse the
html document hits from their search engine- there is no way
to get documents returned in some simple to use format:

http://www.uspto.gov/main/search.html

You do have a choice of tiff images but these of course offer nothing
unless you also have local OCR software. Try to do a search on
reasonable criteria and see that you get lots of hits- it is difficult to do
keyword searches without download a bunch of documents and
finding confounding words. I've got scripts to do much of this
but it would be easier if there was a stable API supported at uspto.
This site's "API" changes everytime they regenerate their site
since they seem to use generated code:
http://portal.uspto.gov/external/portal/pair



Or, consider the SEC website ( their webmaster has been very interested in this
but apparently an API is not currently a priority):


http://www.sec.gov/edgar/quickedgar.htm

Public companies in the US submit lots of info of interest to the general public
but it is difficult to find and sort. Scripts offer a great solution for even casual
investors who happen to know a little programming. However, the SEC currently
forces you to either use the web interface or parse some cumbersome html.
Further, their full text search is being implremented with even more difficult to parse
html but it offers incredible benefits to those seeking to sort out potential
financial disasters ( for example, look at the option ARM situation):


http://www.investorshub.com/boards/read_msg.asp?message_id=13071715

( I was told that "yahoo" and "finance" attract the spam filter)
http://messages.f-----e.y----o.com/Business_%26_Finance/Investments/Sectors/Healthcare/Biotechnology_and_Drugs/threadview?bn=5990&tid=866620&mid=866620


Even FDA filings concerning the drugs we take are available but difficult to access
due to the web, rather than programming, interface that the FDA
presents- in this case they have all the important documents but
the search facilities are limited due to the data being presented as
scanned pdf files ( scripting very difficult):
( if you wanted to find all approved drugs with certain incidental properties,
this could be a great database except for the above issues)
http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm


I could go on and on about the government sources of info-
NOAA, CDC,FTC, various courts, etc - all provide great information of importance to the
public but access is artificially constrained for any serious uses. If it was available to
programmers, it could be repackaged at low cost and presented in a range
of web formats ( for even larger audiences). Of course- local governments
have even more information types (ranging from traffic cameras and abduction alerts to
court and property records)
that have audiences that could be more easily targeted by making the information
available for innovative programmers to re-distribute.



So, my question is, are there other people who have used cygwin for these purposes and what sites have you accessed or attempted to access in some script based way? Has anyone approached govt sites at any level requesting computer friendly interaction mechanisms? What responses have you gotten?

Many private sites make their money from things predicated on
interaction ( advertising for most sites- academic journals have a number
of revenue sources and I find it difficult to believe that they
would have problems with free, automated online access).
Does anyone have examples or thoughts
on free-to-user private entities that are still compatible with
automated access? 10kwizard had a nice service but they took
even their simple features into their subscription rather than free area.

Thanks.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]