UvT Expert Collection documentation

ILK Research Group Technical Report Series No. 07-06


Last updated: Tue Aug 07 2007

Toine Bogers
ILK, Tilburg University
P.O. Box 90153, NL-5000 LE Tilburg
A.M.Bogers@uvt.nl

Krisztian Balog
ISLA, University of Amsterdam
Kruislaan 403, 1098 SJ Amsterdam
kbalog@science.uva.nl


Contents

1. Introduction
2. Collection contents
2.1 Collection structure
2.2 Expert profiles
2.2.1 Language-related issues
2.2.2 Personal information
2.2.3 Publications
2.2.4 Courses
2.2.5 Research
2.2.6 Expertise areas
2.3 Course descriptions
2.4 Full text publications
2.5 Homepages
2.6 Topic hierarchy
2.7 University hierarchy
2.8 Lists
2.9 TREC format
3. Collection statistics
4. References
5. Disclaimer


1.  Introduction

The UvT Expert collection is based on the Webwijs ("Webwise") system developed at Tilburg University (UvT) in the Netherlands. Webwijs is a publicly accessible database of UvT employees involved in research or teaching. Currently, Webwijs contains information about 1168 experts, each of whom has a page with contact information and, if made available by the expert, a research description and a list of publications. In addition, each expert can select expertise areas from a list of 1491 topics and can suggest new topics that need to be approved by the Webwijs editor. Each topic has a separate page that shows all experts associated with that topic and, if available, a list of related topics. The majority of the collection was crawled in October 2006 and each section lists when the crawling took place exactly.

Webwijs is available in Dutch and English, and this bilinguality has been preserved in the collection. Every Dutch Webwijs page has an English translation. Not all Dutch topics have an English translation, but the reverse is true: the 981 English topics all have a Dutch equivalent.

If you use these resources, please let us know. By using the resources you agree to the disclaimer. If you publish results obtained using the resources made available here, please include the following citation:


2.  Collection contents

2.1 Collection structure

The UvT expert collection is divided into a Dutch and an English part. These parts are not symmetrical, i.e. some Dutch pages have no English equivalent and vice versa.

Webwijs was designed to be bilingual from the start. Unless stated otherwise the files with dutch in their filename were extracted from or contain information from the Dutch version of Webwijs; files with english come from the English version. The collection files are described in the rest of this documentation file and arranged in the following directory structure:

|- documentation
|
|- dutch
|   |- courses
|   |- profiles
|   |   |- all
|   |   |- dutch
|
|- english
|   |- courses
|   |- profiles
|   |   |- all
|   |   |- english
|
|- extra
|   |- qrels
|   |- topic-hierarchy
|   |- trectext
|   |   |- course-descriptions
|   |   |- homepages
|   |   |- publications
|   |   |- research-descriptions
|   |- university-hierarchy
|
|- lists


2.2 Expert profiles

Crawled: 2006.10.30
Directory: dutch/profiles and english/profiles

An expert profile was constructed for each of the experts listed in Webwijs. They were populated with as much information as was available on Webwijs and the course websites for each expert. This means that profiles differ in the amount of information they contain. For instance, certain profiles might not contain a <courses> section if the expert does not teach courses at Tilburg University. An complete expert profile contains the following information:

<person>

  <anr>...</anr>
  <name>...</name>
  <job>...</job>
  <institute>...</institute>
  <faculty>...</faculty>
  <room>...</room>
  <address>...</address>
  <tel>...</tel>
  <fax>...</fax>
  <homepage>...</homepage>
  <email>...</email>

  <publications>
    <publications_url>...</publications_url>
    <publication>
      <author num="1">...</author>
      <author num="2">...</author>
      ...
      <author num="n">...</author>
      <year>...</year>
      <title>...</title>
      <editors>...</editors>
      <intitle>...</intitle>
      <misc>...</misc>
      <original>...</original>
    </publication>
    <publication>
      ...
    </publication>
  </publications>

  <research>
    <research_url>...</research_url>
    <description>...</description>
  </research>
  
  <courses>
    <courses_url>...</courses_url>
    <course id="x">course_name</course>
    <course id="y">course_name</course>
  </courses>

  <expertise>
    <description>...</description>
    <topic id="x">topic_name</topic>
    <topic id="y">topic_name</topic>
  </expertise>

</person>

2.2.1 Language-related issues

Expert profiles are available for both Dutch and English. In addition, each of these profiles is available in two different versions, resulting in 4 profiles for each expert. These two different versions are due to the problem of reliably identifying the language of the publications. For some records (see the sections 'Publications' and 'Full text publications' for more details) the language was already identified; if so, this is stored in the <language> tag. Where necessary and possible we used textcat [1] to identify the language of the other publications. The publications that textcat positively identified as Dutch were added to the dutch profile of an expert that incorporates information from the Dutch version of Webwijs. The english profiles have English publications and Webwijs data.

We also created profile versions that included all publications, whether they were Dutch, English or unknown: dutch.all and english.all. The language in the filename here distinguishes the Dutch and the English versions of Webwijs the information was taken from; both files would contain the same publications. For instance, for expert 710326 this resulted in the following 4 profiles:

2.2.2 Personal information

We extracted all of the available personal and contact information from the expert's Webwijs profiles. Each expert can be uniquely identified by their 6-digit ANR ('administratienummer'). The ANRs are used in all of the collection to connect all the information to an expert. There were 318 experts with a working link to their academic homepage; if available these were included in the <homepage> tag. Furthermore, many profiles have more than one instance of the <name> tag. For instance, expert 186554 has three different name variations:

<name>Prof. dr. Sandra G. L. Schruijer</name>
<name>Sandra Schruijer</name>
<name>S. G. L. Schruijer</name>

We kept all of the different variations of the expert names for future experiments where determining the relation between documents and experts is not as trivial as in the regular Webwijs data.

2.2.3 Publications

Each Webwijs expert has a automatically generated list of publications if they were entered in the METIS system that is currently used at several Dutch universities. The pages do not offer access to the full text of the publications. The publication lists show only the full citation information from which we extracted the <author>, <year>, <title>, <editors>, and book information (<intitle> and <misc>). We also included the <original> strings describing the publications.

In addition, we were able to get 3761 full text versions of the publications in different formats (PDF, PD, DOC, HTML) from the ARNO institutional repository used by the Tilburg University library. We were able to successfully convert 1880 of them to plain text; see the section on 'Full text publications' for more information.

ARNO also contained extensive metadata for each of the 3761 publications. We have merged these publications with the METIS ones, replacing the METIS information with the ARNO information wherever possible since the METIS information was a subset of the ARNO information. The ARNO records can be identified by the presence of an <arno-id> tag. There was no unique METIS ID available for the METIS records, so these publication records have no unique ID unfortunately. The citations were also converted to the standard TREC format and included in the collection; see section 2.9 for more information about this.

2.2.4 Courses

Each course ID and course name (co-)taught by the expert is included as well as the URL of that expert's page in the Studyguide (in <courses_url>). The course descriptions themselves were also extracted and can be found in the dutch/courses and english/courses directories. See the course descriptions section for more information.

2.2.5 Research

Research descriptions can be added to a separate Webwijs research page. These research description pages are automatically generated, but users have to add their own description. Only 329 Dutch and 313 English research descriptions were added by the 1168 experts in Webwijs. The original URL is included in the <research_url> tag. If there was no research description available, then the <research> section was not included in the expert profile.

2.2.6 Expertise areas

The <expertise> section contains all the expertise areas (or topics) that an expert selected himself. In case the expert included a textual description of his or her expertise, this was also included in the <description> tag. Each topic has a unique ID in Webwijs and there were 1491 Dutch topics in Webwijs in October 2006. Not every Dutch topic had an English translation, so only 981 English topics were available.

Of the 1168 experts, 425 selected no Dutch topics at all and 441 selected no English topics. Active participation in Webwijs is voluntary, so this means 425 experts have not used or updated their auto-generated Webwijs profile. The difference between 441 and 425 is due to the fact that not every Dutch topic has an English translation.


2.3 Course descriptions

Crawled: 2006.10.06
Directory: dutch/courses and english/courses

Many Webwijs experts teach courses at the university and are listed in a separate system called the 'Studiegids' (http://studiegids.uvt.nl) that contains all the course descriptions. The Studiegids is also bilingual so many course descriptions are available in Dutch or English.

We crawled all 840 course descriptions that were associated with the experts in Webwijs. Each course has a separate 6-character course ID (e.g. 880404 and each course description was stored in a separate XML file for each language (e.g. course.NL.880404.xml and course.EN.880404.xml). They have the following structure:

<course id="880404">
  <name>
      Communicatietechnologie
  </name>
  <description>
      De student weet wat [...] twee praktische opdrachten.
  </description>
  <experts>
    <expert anr="797847">
    <expert anr="946583">
  </experts>
</course>

If no description was available, then the <description> tag was left empty (but still included!). Sometimes an English course description was entered in both the English and the Dutch version of the course catalog. This has been filtered out as much possible, but certain English descriptions remain in the Dutch overview (and vice versa). We also included a separate list of the course IDs associated with each expert in experts+courses.list, which is located in the lists directory. Finally, the course descriptions were also converted to the standard TREC format and included for ease of use; see section 2.9 for more information about this.


2.4 Full text publications

Crawled: 2006.11.28
Directory: extra/full-text-publications

We obtained 3761 full text publications from the ARNO institutional repository of Tilburg University. Of these 3761 publications 1880 were successfully converted to plain text format (many PDFs contained only scanned images which could not be converted). If a publication is available as full text, then its filename is specified in the <filename> tag of the <publication> record. Only ARNO publication records contain this <filename> tag and these records can be uniquely identified by their <arno-id> tag.

Some ARNO records came with the publications language pre-specified; if so, this was added in the <language> tag. For the other (METIS) records, we performed language identification using textcat on the citation information alone and included the language in the <guessed-language> tag. A small number of publications was written in Spanish, Portuguese, and German, but the majority is Dutch or English.

The publications were also converted to the standard TREC format and included for ease of use; see section 2.9 for more information about this.


2.5 Homepages

Crawled: 2006.10.11
Directory: extra/homepages

We collected the academic homepages of the UvT experts that specified a URL in their expert profile (in the <homepage> tag). These URLs were crawled using wget v1.10.2 and all website files were saved in a separate directory for each expert (using the options -nd -r -np). We removed all the images, movies, audio files and all other file formats that do not represent text. These plain text version were converted to the standard TREC format and included; see section 2.9 for more information about this. Considerable effort was expended to automatically extract as much usable text from the homepages; however, if researchers desire access to the raw data, please contact us.


2.6 Topic hierarchy

Extracted: 2007.04.26
Directory: extra/topic-hierarchy

We obtained the thesaurus used in the Webwijs system from its developers and it can be found in XML form in thesaurus.xml. Below is a short example of typical thesaurus entries:

<topic id="1369" name_NL="computergebruik" name_EN="computer_use />
<topic id="5801" name_NL="computerlinguistiek" name_EN="computer_linguistics">
  <NT id="3185" name_NL="spraaktechnologie" name_EN="speech_technology" />
  <NT id="1796" name_NL="taaltechnologie" name_EN="language_technology" />
  <BT id="5803" name_NL="toegepaste_taalwetenschap" name_EN="applied_linguistics" />
  <UF id="1790" name_NL="taal_en_kunstmatige_intelligentie" name_EN="language_and_ai" />
</topic>
<topic id="3790" name_NL="constitutionele_vraagstukken" name_EN="constitutional_issues">
  <USE id="1376" name_NL="constitutioneel_recht" name_EN="constitutional_law" />
</topic>
<topic id="2074" name_NL="consument_en_euro" name_EN="the_consumer_and_the_euro">
  <BT id="1378" name_NL="consumentengedrag" name_EN="consumer_behaviour" />
  <RT id="1976" name_NL="geld" name_EN="money" />
</topic>

Each topic has an ID, a Dutch name, and (if available) an English name. Relations between topics are represented by listing topics as daughter nodes of another topic. Topics can be related to each other in one of five ways:

NT The topic is a Narrower Term of the encompassing topic in the thesaurus. For example, speech technology (3185) is a daughter node of computer_linguistics (5801) in the topic hierarchy.
BT The topic is a Broader Term of the encompassing topic in the thesaurus. So in the above thesaurus example, applied_linguistics (5803) is a mother node of computer_linguistics (5801) in the topic hierarchy.
RT The topic is a Related Term of the encompassing topic in the thesaurus. For instance, money (1976) and the_consumer_and_the_euro (2074) are related topics according to the thesaurus.
USE Each topic can have multiple synonyms. A topic that is marked with USE is the preferred term to USE instead of the encompassing topic in the thesaurus. So in the above example, constitional law (1376) is preferred over constitional issues (3790).
UF Each topic can have multiple synonyms. A topic that is marked with UF is a synonym, i.e. a term that can be Used For the encompassing topic in the thesaurus. So in the above example, language_and_ai (1790) may be used instead of computer_linguistics (5801), but it is not the preferred term.

In addition to the thesaurus, we also included the topic hierarchy from the thesaurus as a list of mother-child pairs (the BT relations) in topic-hierarchy.list. In addition to this, the file topic-hierarchy.top-nodes.list contains a list of all topics that occur solely as top nodes. Also in this directory are topics.EN.list and topics.NL.list, two list of all topics in TREC format. See section 2.9 for more information.

The thesaurus contains a grand total of 1491 topics, of which only 132 are top nodes in the topic hierarchy (NT/BT). This hierarchy has an average topic chain length of 2.65 and a maximum length of 7 topics. There are 97 different cliques of connected topics in the Webwijs thesaurus. There are 446 different pairs of related topics (RT) with 96 different cliques. On average, a topic is related to 1.77 other topics. As for synonymous topics: there are 102 different groups of synonymous topics. In those groups a topic has an average of 1.27 synonymous topics (not counting the difference between USE and UF). The average number of synonyms for any Dutch topic in the entire thesaurus is 0.16 and for any English topic 0.11.


2.7 University hierarchy

Extracted: 2007.02.28
Directory: extra/university-hierarchy

A four-level hierarchy of Tilburg University was extracted from the Dutch and English expert profiles. The top level node is Tilburg University itself. The next level contains the faculties (and organizational units on the same level such as the University Office), extracted from the <faculty> tag. The third level contains all the departments (and institutes) within each faculty (extracted from the <institute> tag) and the fourth and final level contains all the experts in each department.

Each organizational unit was assigned a separate three digit ID. The top node (Tilburg University) was assigned 000 and any missing organizational unit was assigned the ID 001, i.e. when an expert has no listed organizational unit.

The Dutch hierarchy has 99 different organizational units of which 22 are faculty-level units and 78 departmental units. The English hierarchy has 90 organizational units of which 22 are faculty-level units and 69 departmental units. The difference is due to some Dutch units not having an English translation. The Dutch IDs match the English IDs where they are equivalent units. The hierarchy is available in XML format (university-hierarchy.EN.xml and university-hierarchy.NL.xml) and in a plain text format that lists the unit chain for each expert separately (university-hierarchy.EN.list and university-hierarchy.NL.list).


2.8 Lists

Directory: lists

We created a number of useful plain text lists that were used in creating the UvT Expert Collection and the expert finding and profiling experiments. All of these lists can be derived from the profiles in the collection, but are included for the sake of convenience. We included the following lists in the collection:

experts.list The list of all the expert ANRs.
experts-with-topics.EN.list A list of experts (by ANR) who selected at least one English topic.
experts-with-topics.NL.list A list of experts (by ANR) who selected at least one Dutch topic.
topics+names.list A list of all topic IDs and the Dutch (second column) and English (third column) names. A - is used if there was no English translation of the topic.
experts+topics.EN.list A list of expert ANRs and the English topics they selected (by topic ID).
experts+topics.NL.list A list of expert ANRs and the Dutch topics they selected (by topic ID).
experts+courses.list A list of ANRs and the associated course IDs taught by the experts.
topics+experts.EN.list A list of English topic IDs and the experts (by ANR) that selected them; this is an inverted version of experts+topics.EN.list.
topics+experts.NL.list A list of Dutch topic IDs and the experts (by ANR) that selected them; this is an inverted version of experts+topics.NL.list.

2.9 TREC format

Directory: 

extra/trectext
extra/qrels

For our experiments we converted the information from the profiles, course descriptions, homepages and publications to the format used in the TREC tracks. The .trectext files contain all the plain text sources in one concatenated file, the .assoc contain the associations between experts and the sources in the .trectext files. The following sources are available in TREC format:

course-descriptions The course descriptions in English and Dutch. See sections 2.2.4 and 2.3 for more information about the course descriptions.
homepages The file formats that in some way represent text (like PDF, DOC, HTML, etc.) were automatically converted to plain text into the TREC format. See section 2.5 for more information about how the homepages were collected.
publications The publications are available in 3 versions: all the English publications in pub.EN.trectext, all the Dutch publications in pub.NL.trectext, and all the publications together (Dutch, English, and unknown) in pub.all.trectext. For citations, the <title> and <inline> fields were used; for the full text publications we used these fields, the abstracts (when available) and the full text. See sections 2.2.1 and 2.4 for more information about the publications.
research-descriptions The research descriptions for all experts in English and Dutch. This information is also available in raw XML format from the expert profiles; see section 2.2.5 for more information about these research descriptions.

We also included the topic files in TREC format (topics.EN.list and topics.NL.list) in the extra/topic-hierarchy directory. We also included the files that contain the query relevance assessments for both expert finding and expert profiling tasks (see [2] and [3] for clear definitions of both tasks). These files can be found in the extra/qrels directory.


3.  Collection statistics

Table 1 below lists some descriptive statistics of the collection:

  Dutch English
     
no. of experts 1168 1168
no. of experts with ≥ 1 topic 743 727
no. of topics 1491 981
no. of expert-topic pairs 4318 3251
     
avg. no. of topics/expert 5.8 5.9
max. no. of topics/expert (no. of experts) 60 (1) 35 (1)
min. no. of topics/expert (no. of experts) 1 (74) 1 (106)
avg. no. of experts/topic 2.9 3.3
max. no. of experts/topic (no. of topics) 30 (1) 30 (1)
min. no. of experts/topic (no. of topics) 1 (615) 1 (346)
     
no. of experts with a academic homepage 318 318
no. of experts who teach ≥1 course 318 318
avg. no. of course descriptions per teaching expert 3.5 3.5
no. of experts with a research description 329 313
no. of experts with publications 734 734
avg. no. of publications (citation + full text) per expert 27.0 27.0
avg. no. of citations per expert 25.2 25.2
avg. no. of full-text publications per expert 1.8 1.8

Table 1: Descriptive statistics of the UvT Expert Collection

Figures 1-4 below show the distribution of experts per topic and topics per expert for both languages separately:


Figure 1: Topic count per expert for English

 

Figure 2: Topic count per expert for Dutch


Figure 3: Expert count per topic for English

 

Figure 4: Expert count per topic for Dutch


4.  References

[1] TextCat Language Guesser, G. van Noord. URL: http://www.let.rug.nl/~vannoord/TextCat
[2] Determining Expert Profiles (With an Application to Expert Finding), K. Balog and M. de Rijke. In: IJCAI '07: Proceedings of the Twentieth International Joint Conference on Artificial Intelligence, pp. 2657-2662, 2007.
[3] Formal Models for Expert Finding in Enterprise Corpora, K. Balog, L. Azzopardi, and M. de Rijke. In: S. Dumais, E.N. Efthimiadis, D. Hawking, and K. Järvelin, editors, 29th Annual International ACM SIGIR Conference on Research & Development on Information Retrieval, pp. 43-50, 2006.
[4] Broad Expertise Retrieval in Sparse Data Environments, K. Balog, T. Bogers, L. Azzopardi, M. de Rijke, and A. van den Bosch. In SIGIR '07: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 551-558 2007.

5.  Disclaimer

By using these resources you agree to the disclaimer.