This section is really a hybrid between a real Glossary
and a FAQ that intends to explain some of the terms
and the meanings as used in the building of this ranking.
Database size. The number of
records in the search engine databases that it publicly
accessible from external sources. Not all the robots
crawl the Web at the same time or with identical procedures
resulting in far different databases among the larger
engines. Post crawling processes and other commercial
requirements result in really different databases. The
current size, composition and evolution of the figures
are a relevant point in webometric analysis.
Delimited search. A key characteristic
of the search engines that allow the cybermetric analysis.
A delimiter operator has a specific syntax and meaning
that can differ among engines. It provides the number
of records (web pages) that satisfied a certain condition
filtering the results according to strings in the address
(URL) or other characteristics (language, format) of
the page. Special relevance has the link delimiter that
can be used in combination with site or other similar
to calculate inlinks.
Discipline differences. The ranking does not provide
any kind of thematic assignation to the units, so a
formal thematic analysis is not possible at the moment.
But there are important differences regarding academic
focus on our universities database that should be taken
into account. Research focused universities are mixed
with learning institutions and a group of discipline
oriented (mainly pedagogy, medicine and theology) organizations
are also present.
Formal characteristics. As there is neither universal
document control nor formal guidelines for web page
building, there is a huge diversity of formal aspects
in the Webspace, including obvious malpractices. Some
authors have focused on these to provide new indicators
such as link density, link quality, expressed as ratios
of non working links, missing tags, including those
so relevant as title or metadata, or updating frequency.
None of these characteristics are taken into account
in our rankings, but they should be taken into consideration
Geographical biases. The use of several search engines
in our ranking is due to the geographical bias observed
in some of them. We do not know if this is due to topological
or traffic problems in the network (some eastern Asian
countries are usually poorly covered) or to the crawlers
behaviour or if the biases are equal long the time.
Alexa biases preclude us to add the popularity data
in our rankings.
Institutional domains. The basic unit of our analysis
refers to the common URL domain shared by all the web
sites of an institution. Unfortunately some organizations
maintain two or more equivalent domains, without a preferred
marked one. Also for concern is the fact that some second
level departments maintain completely different domains.
Usually we maintain two entries for those institutions
with two top level equivalent domains. We intend to
merge results of smaller domains with those of the main
one in the near future, but it is a difficult task.
Invocation. The presence of the name of an institution
or a researcher in a Web page. The global presence is
the number of times the name appears in the Web and
can be calculated easily using quotation marks around
the name in the search engines. Sometimes this figure
is referred as the number of times this name is cited
in the Web. Some authors refer this as Web visibility,
although we prefer to reserve this word for link visibility.
This indicator usually favours large, well-known, old
institutions independently of their real effort for
having a relevant Web presence.
No invocation measure was used in our ranking, mainly
because it is not possible to assign a unique, unambiguous
universal name for every institution.
Invisible Web. Traditionally refers to the information
available through gateways or search interfaces that
is not accessible by the search engines’ robots.
It is a huge part of the Internet content, including
library catalogues, bibliographic and alphanumeric databases
or even some repositories of documents. During last
years some engines, specially Google, has made a great
effort to index these records and in fact several databases
are more or less covered in their systems (i.e. PubMed
is partially indexed by Google). Our ranking do not
consider the Invisible or Deep Web and we encourage
transforming it in crawler friendly information.
Language. English is the “lingua franca”
for scientific communication and it is also the language
of a significant fraction of the internet users. Institutions
publishing only in their mother tongue alone achieved
a lower visibility than those with multilingual websites.
Link motivation. Major concern in link analysis is
the motivations behind a link creation. Previous studies
suggest that “sitations”, the hypertextual
equivalent to bibliographic citations are still rare.
We think this situation will improve when more papers
became available on the Web, but consider other reasons
to link as useful for describing scholarly communication.
Informal linking is a powerful source of information
about intellectual, economic and political connections
of the academic and scientific activities.
|| Link to paper or document
|| Generally in pdf/ps/doc format
|| Link to course materials
|| Mainly html pages but also pdf, doc or ppt
|Research projects sites
|Conferences, seminars or meetings pages
||Including media files if applicable
||Pre or post prints, but also unpublished material
|Team or colleagues pages
|Third parties (non-research)
||And related ones
Link popularity. Another term to refer to link visibility
that has been used extensively. We prefer to reserve
popularity for the measure of number of visits. Although
not yet implemented on the Ranking, we intend to consider
number of visits or popularity as a relevant factor
for our rankings in the future.
Open access. The movement to distribute in an open
way the scientific production of, at least, the public
funded researchers is facing tougher opposition than
expected. A strong bet for open access initiatives will
be clearly reflected in our rankings.
Personal pages. A frequently heard statement about
web contents quality is related to the information provided
by the personal pages of students or staff members.
There is a lot of free space hosted by the university
web servers that is used for personal purposes, being
the general thinking this low quality information or
not academic related. Data suggest a large number of
small websites are crowding the institutional domains,
but most of them are interesting enough to merit consideration.
Some “personal” pages are in fact the research
group site, while others are institutional (scientific
societies, electronic bulletins, conference sites).
True personal pages cover both extremes of the contents
range, with people offering only CVs to others providing
very large arrangements of information of their academic
or research topics with links to personal repositories
of documents. A striking pattern is the absence of links
to other colleague’s websites or institutions.
Quality. We advice against the use of the rankings
as global or partial indicator of quality. Impact or
visibility describes better our aims, but in the particular
context of promotion of open and universal access to
the scientific activities and results through the Web.
Ranking. As their main objective is purely commercial,
current search engines are not offering stable, reliable,
or trustworthy results for webometric purposes. The
situation has improved in the last years but there are
still important bias and a worrisome instability. This
is the reason we are using absolute values but relative
positions for our analysis.
Rich files. A general term comprising a rather heterogeneous
group of file types, mainly those devoted to represent
unitary enriched documents, such as MS Word doc, Adobe
Acrobat pdf or PostScript ps. In our analysis we also
included MS Powerpoint ppt and excluded xls or latex
or tex. Rich files are relevant because they are use
for scholarly communication as authors usually distribute
their papers and presentations in these formats. Certainly
some of these types are used extensively for bureaucratic
purposes (forms, administrative documents, internal
reports) but these can only explain a small percentage
of large numbers observed in domains with extensive
There are several other file types that can be considered
as rich files and even raw formats like txt or text
are being used for distributing academic content, but
their individual contribution is too low to be considered
for practical reasons.
Rounding. Google and Yahoo offer rounded results, ending
in ’00 or ‘000, which means an error rate
in the order of 2 to 5%. Moreover the numbers provided
by Yahoo in the first page is about another 4-5% higher
that the one showed in the following pages that show
a trend towards the “correct” number.
Search Engine. The software that searches an index
and returns matches. Search engine is often used synonymously
with spider and index, although these are separate components
that work with the engine. There are only six engines
useful for quantitative analysis purposes as they have
a large and independent self crawled database and their
recovery system allow filtering of results according
to url-related delimiters:
Yahoo Search search.yahoo.com
MSN Search search.msn.com
Self archiving. Basically an alternative mean to provide
open access to scholarly journal articles publishing
the files with the full text in the author’s personal
page. This practice involves both post and pre-prints
and it is commoner among most prolific authors and in
certain disciplines. However globally it is only a minority
of authors who support this option. As much of these
papers are published as rich files, pdf, ps or doc,
this practice increases notably the performance of an
institution in our rankings.
Size. The size of an institutional domain is the combined
number of pages of all the websites with that domain,
including html and non html formats that can be assimilated.
From a practical point of view, size refers to the number
provided by a search engine when a search like site:domain
is done. This indicator is central for our rankings
and it is used also as denominator for Web Impact Factor
calculations by other authors. However there is a wide
range of pages according to different criteria, including
content size measured in bytes. For example, a pdf document
can be a monograph consisting of several hundreds pages
totalling several Mb of texts and images while other
page consists only of the phrase “page under construction”.
Global size could be an interesting indicator and we
expect to provide it for selected websites.
Stability. From the early times instability of the
search results in general, and of the number that represents
results in particular has been a subject of special
concern. Certainly the Web is a highly dynamic system,
growing at an incredible pace, but also the crawlers
change their specifications and schedule unexpectedly.
A world crawling round can last from 15 to 45 days and
in this meantime.
Visibility. In the context of this ranking, the term
refers to link visibility: The number of external inlinks
received by an institutional domain. The most used syntax
for this request in search engines is:
Web cost. Maintain a very large presence on the Web
can be extremely costly; including specific funding
and human resources, but the total cost is far below
any other publication method and the potential audience
is truly global. A way to undertake large projects is
distributed effort, so individual graduate students,
professors or researchers, scientific teams and other
administrative units have an autonomous web presence.
A rich content page should include a large diversity
of objects including images and other media files, certain
amount of navigational links and a selected group of
external outlinks. That can require a huge effort that
can be only face if theses tasks are subject of evaluation
as other academic and scientific activities.
Web Impact Factor. The most cited cybermetric indicator,
although its usage is not universal due to several shortcomings.
It is the defined as the ratio between the external
inlinks received by a website and the number of webpages
comprising that website. Some authors suggested modifications
to the denominator, using different alternative measures
for the size of the institution using non-internet data
such us number of potential authors (staff, professors,
graduate students), economic wealth (funding, projects)
or bibliometric data (papers in journals).
Our ranking is derived from WIF in which a ratio 1:1
is established between visibility and size. We reinforce
the visibility factor proposing a new ratio of 4:3.