EBI Search indexing

The EBI provides a global search service across most of the data sources available at the institute.

This section's content and merit is in need of review.

Glossary of terms used in the search guidelines

  • Document: Lucene term A document is a virtual document consisting of a set of fields. A document can have several fields with the same name.</li>
  • Field: Lucene term Part of a Document (see above). A field is a <name, content> pair. The name provides metadata, e.g. row name in a database or the different parts of a web page or email (header, body, …). The content contains the actual data. Both parts of a field are indexed but the name is only available as structural information, i.e. one can search for something with a specific field-name but a field-name usually will not appear in the search result.
  • Domain: EB-eye term A data source, most of the times a database. For example: UniProtKB, PDBe, …
  • Domain tree, hierarchy: EB-eye term All domains in the EB-eye are organised in a tree. Nodes of the tree are for example Protein Sequences or Small Molecules. Leaves of the tree are for example UniProtKB (parent Protein Sequences) or ChEBI (parent Small Molecules)
  • Data provider: EB-eye term A group/person who provides the data for a domain.
  • SERP: search engine results page
  • SEO: search engine optimisation

How do I make my data available for searching on the EBI website?

What is the EB-eye and how can it help me?

The EBI provides a global search service across most of the data sources available at the institute. The Lucene EBI search engine (aka EB-eye) is a search engine aimed at providing unified summary results for global searches over the majority of the EBI databases. The engine is responsible for indexing a meaningful subset of data from the databases and returns summary information containing links to the original databases.

The engine has been built to accommodate the vast variety of data available amongst all databases at the EBI. Most of the databases at the EBI already have flat files or XML dumps which are used by the search engine. Some databases use the XML format specifically defined for EB-eye to dump their data and make them available through the search engine.

An overview of how the EB-eye works

Automatic Updates

To make it easier to maintain, and to guarantee the most up-to-date data, EB-eye has a mechanism for updating data automatically. At least once a day, all data sources will be checked and analyzed to identify possible updates and the system will automatically re-index them.

After each update, a footprint is generated and represents a signature for the data source. A new footprint is generated before each new update and is compared to the previous one. If they are equal, i.e. the data has not changed, no update is needed. If they are different, the data gets updated.

Data Indexing

In order to index data properly the EB-eye indexer needs some information:

Information needed for the EB-eye indexer Description Example
Analyser Extracts information out of documents and transforms them into tokens, which than can be indexed. For example, dates can be written in different formats. A date analyser tries to find the used format and transforms it into an internal representation and returns this as a token. EB-eye provides several analysers. Most of them are derived from Lucene's analysers, but we also developed our own analysers, e.g. for chemical writing. If no analyser is available for a field, the standard analyser is used. This means a user can only find something if she is querying the exact term.
Store Should the data be stored. Possible values for store are: YES, NO, COMPRESSED. At a first glance a NO might be confusing but EB-eye can index data without storing it. This can be useful for example for keywords. An entry can have several keywords, these can be indexed with the entry so that the entry will be found, if searched for the keywords. However, on the result page the keywords will not appear because they was not stored.
Boost Value for the "importance" of a field. EB-eye can give fields a boost factor. The higher the boost factor the higher this entry will be ranked (actually, it is a little bit more complicated). Boost factors should be used, if at all, only for very important fields, like IDs. For more information about the boost factor, please refer to the Lucene documentation.

There are various input indexer implementations available in EB-eye, but mostly two are used. These two implementations are based on parsers which are used to describe the structure of the source files and index them.

By default, the source data are indexed in a distributed environment. The files are split into several chunks (set of entries) or grouped into set of files and are indexed in parallel by several machine. This allows us, if necessary, to index all domains in one go in less than half a day.

Parsing and Grammar

The previous section explained what information are needed in order to index data. The input indexer most of the time relies on the use of parsers which in turn rely on a grammar to extract information from input files.

The grammar is a lexical representation of the data associated with a set of actions to be executed. The grammar helps to extract the structure of an entry and its various fields from the data source. Actions can be associated with this structure, this means that from the grammar, an action can be executed for each entry and for each of its fields (i.e. dates, cross-references, authors), typically extracting the information and indexing it. A set of predefined actions using the information from the configuration files are available to ease the indexing. That's why it's important to have a detailed description of the format of the data files to make sure the parsers match properly the corresponding data structure.

Data Searching

After the data has been indexed it is available for searching through the global search engine. Several types of searches are possible. Note, the following subsections refer to the internal search mechanisms. For the user interface skip to the next section "Web Interface".

Basic Searches

The simplest search is the global search in all the fields indexed for a particular domain.

A more specific search is the field-specific search, where a query term is only searched in a particular field. This type of search is typically what an advanced search offers. Every field indexed for a domain will be available for a field-specific search, including the cross-reference fields.

Cross References

An important feature of the search engine is the ability to use cross-references to navigate between different domains by jumping from entries to entries.

During the indexing of the data, the cross-references information is extracted from the source files and stored as cross-reference fields in the index. Lucene imposes some restrictions that had to be by-passed and, as a result, the name of the reference database and the ID of the entry referenced are the only information stored.

When launching a cross-reference search, the system will try to find the cross-references by looking for exact matches for the stored IDs. This means if cross-references indexed for a particular domain don't use IDs but an accession number or another kind of identifier the system will not be able to retrieve the cross-reference. In this case the name for the cross reference needs to be specified in the configuration file.

Web Interface

The previous sections explained how the EB-eye search engine works internally, from the automatic update and the indexing of the data to the handling of several types of querys and the retrieval of search results.

Another important element of the EB-eye is its web interface. The aim for the EB-eye web interface should be a "design" as simple as possible. The following text provides some basic guidelines.

Search Form

A basic search form should be present on all EBI pages and always at the same position. The syntax of the basic search form should be:

  • Terms in a query are logically AND combined.
    A term is the atomic entity for a search field. A term is either a sequence of characters, separated by spaces or words enclosed in quotes: term1 "this is one term because it is between quotes".
    Note, because everything between quotes is considered as one term only entries containing exactly the string between quotes will match.
  • The pattern id:term will match only field with the name id and the content term.
    This allows to restrict the search to a particular field. Although not recommended it could be possible to allow something like "go:12345" not to be considered as id:term because users in some domains are used to search for names like this. It should be stressed, that this is not recommended because this will confuse all other users.
  • Backslash is escape character.
    Escape characters make it possible to search for characters used by the search syntax. For example to search for something containing a ":" by using "\:".
Result Page

Presenting search results is part of the focused group exercise and therefore likely to be subject to changes. The following text gives therefore only very rough guidelines for result pages.

When searching for some information, the way the results are displayed is essential for the user experience. Not enough detail is annoying when browsing through the results, too much information on the other hand might discourage users from using the service.

  • Results should be displayed in a list.
    In some areas it might be useful to display entries in another layout. However, before doing so, please try to somehow fit the results in a list.
  • The first line of every entry should be a link to the original data.
  • Additional text for every entry should not span more then 8 lines
  • Links should always be underlined.
  • Different fonts should be used sparsely and consistently! E.g. only links should be underlined, only searched terms should be bold, if at all.

How to get the EB-eye search engine to index your data

EB-eye can parse and index data files of different formats but also defines its own XML format (XML4dbDumps). This can be used for databases that currently don't have a flat file or an XML formatted dump and where there is no requirement to dump the whole database in a specified format.

As a rule of thumb:

An existing file format is preferable if:

  • The format is broadly known/used by the users (e.g. PDB, taxonomy, medline).
  • The file contains entries of the same kind, each entry being identified by an ID (a RDF format is NOT of this kind!)
  • The file is an XML defined by a schema/DTD/RELAX NG or is a flat file 'easily' parsed by a lexical parser, meaning that ideally you can create a Backus--Naur Form of the file format.

XML4dbDumps format is preferable if:

  • There is no existing dump of the database.
  • You want to have control over what is indexed and presented by only dumping relevant information.
  • You want to easily add information without EB-eye team having to recreate a new parser each time.
  • The file is not easily parseable by a lexical parser.

Note: Whatever file format will be used the entries can be present in one or several files. There is no restriction on the number of entries per file.

To get data index by the EB-eye search engine two things are needed:

  • Data source
  • Data syntax and semantics
Data Source

In order to ease the maintenance of the EB-eye and to guaranty the most up-to-date data, an automatic data update mechanism has been implemented. If updates are available the new data is downloaded and uncompressed if necessary and then re-indexed and redeployed to be visible to users. Additionally, metadata (release, release date, number of entries, ...) are generated from the data or a release note for verification and information purposes.

The following information is needed for this step:

  • Root source URI
    This is the root of the source files. The URI can define a path on the file system, an ftp URL or an http URL. How to choose the URI:
    • File system: best solution as it's faster and can avoid unnecessarily copying data over. Please contact eb-eye@ebi.ac.uk to be sure the file system is mounted on our servers.
    • ftp: if the data is already available via FTP or the file system is not visible from the EB-eye servers.
    • http: if the data is only available via http. This method should only be used as a last resource as it poses several drawbacks when accessing several files.
  • File pattern
    This is the regular expression of the files to download. The files can be compressed (The following formats are supported : zip, jar, tar, tgz, tbz2, gz, bz2).
  • Excluded sub directories
    By default the files are retrieved from all the sub directories of the root source. You can exclude sub directories by defining a regular expression matching these directories' names.
  • Metadata file
    The metadata file must at least contain the release number, the release date and the number of entries. This information can be retrieved from an existing release note (or from the data file itself if it appears at the beginning of the file). If such a file does not exist, a file with the following simple format has to be created:
    # Comment
    release=[release number or release date if no release defined]
    release_date=[DD-MMM-YYYY]
    entries=[number of entries]
    

    You don't need to create a metadata file if you use the XML4dbDumps format as it already contains the information.

e.g. for UniProt:
Root source URI: "/ebi/ftp/private/uniprot/4EBIES/knowledgebase/"
File pattern: "uniprot_.*\.dat\.gz"
Excluded sub dirs: ".*"
Metadata file: "/ebi/ftp/private/uniprot/4EBIES/knowledgebase/relnotes.txt"
e.g. for MSD:
Root source URI: "ftp://ftp.ebi.ac.uk/ebeye_msd"
File pattern: "MSDCHEM\.xml"
Excluded sub dirs: ".*"
Metadata file: -no need as it's a XML4dbDumps format-
e.g. for GO:
Root source URI: "http://archive.geneontology.org/latest-termdb"
File pattern: "go_daily-termdb\.rdf-xml\.gz"
Excluded sub dirs: ".*"

Data syntax and semantics

In order to index data the EB-eye search engine needs to know the format of the data (syntax) and what how to index (semantics) it. The format is needed to develop a parser and the semantics defines which fields should be stored under which names.

If a data provider decides not use EB-eye's data format XML4dbDumps, they need to provide sufficient information for their data format to write a parser for it. It is important to well define these fields and how to index them as it has a huge impact on the quality and relevance of the results.

A data provider needs to provide 3 pieces of information for each field to be indexed:

  1. STORED/NOT_STORED: STORED means that that the value of the field will be displayable, UNSTORED means that the value won't be displayable but will still be searchable if the field is INDEXED. The obvious downside of STORED is that the value is saved in the index. The index can grow dramatically depending on the number of entries and the size of the fields to store.
  2. INDEXED/NOT_INDEXED: INDEXED means that the value of the field will be searchable. NOT_INDEXED means it won't be searchable (but it can be STORED to display it in a summary for example).
  3. The type of the field's value. Depending on this type we may analyze and index the content differently. For example a field 'description' that contains english text will be indexed so that stop words ('I', 'a', 'the', 'of', ...) are not indexed as they are not relevant in this context. Other types could be a list of authors, keywords, chemical reaction, ... It's also useful to specify if the values belong to a finite set of values (e.g. species)

If you use the XML4dbDumps format some fields are already defined (id, authors,keywords, date, ...). Additional fields can be defined in:
<additional_fields>
<field name="namefield1">value1</field>
<field name="namefield2">value2</field>
...
</additional_fields>


Another type of indexed fields are cross-references to other databases.
If the XML4dbDumps format is used they are defined as:
<cross_references>
<ref dbname="db2" dbkey="abc123"/>
<ref dbname="db3" dbkey="abcdef"/>
...
</cross_references>
These cross-references can point to either internal databases that are indexed by the EB-eye (domains) or to external resources.

Note:The external xrefs are not displayed at the moment but will be in the future.

The internal xrefs defined in the data can use different database names from the ones EB-eye uses and can also use a specific field for the identifier. E.g. Databases contain xrefs to dbname="swiss-prot",dbkey="Q62594". This xref needs actually to point to the domain 'UniProtKB' and use the accession number 'Q62594'.

Note:you can add a suffix to the database name to add some 'semantics' to the cross-reference. For example if you have xrefs to Ensembl which actually are xrefs to either transcripts or genes you can name the fields as either ENSEMBL_TRANSCRIPT or ENSEMBL_GENE so that users will be able to make the difference between the two. They will both internally point to the domain Ensembl. Data providers need to go through their xrefs and establish to which database and field they point to.

Here is an example for the information the EB-eye team needs for the different fields:

Field name in data Brief field description (NOT_)INDEXED / (NOT_) STORED Type of value
(regular expression, format, semantics, ...)
[field name] [description] [(NOT_)INDEXED/(NOT_)STORED] [type of the value, specific format, list of values...]
id id of the entry INDEXED, STORED [A-Z][0-9]{4}
name name of the entry INDEXED, STORED english text
last_update last update INDEXED, STORED date

Information needed for cross references to other resources:

Cross-reference Brief xref description Domain name / external
resource referenced
Field referenced (for domains) /
URL (for external db)
Comment
swiss-prot xref to UniProtKB UniProtKB (domain) AC  
AFCS xref to AFCS AFCS (external db) id {nolink:[http://www.signaling-gateway.org/data/Y2H/cgi-bin/y2h_int.cgi?id=%{id}]

Please check for every xref whether the referenced resource is an EB-eye domain: [http://www.ebi.ac.uk/ebisearch/statistics.ebi|]

How to improve the quality of the indexing

The previous sections described how the EB-eye search engine works, explained the relationship between the configuration files and the indexing process or the web interface. The following paragraphs describe what can be done to improve the quality of the results and the user experience.

Availability of the data

The EB-eye tries to offer the most up to date data for its users. For this reason an automatic update mechanism has been developed and is running every day to make sure the indexes are updated. However, the system relies on the data providers to get these data and needs to know where the latest versions can be found. It is therefore important to define a static location where the EB-eye data for a domain is stored.

Another important stage of the update is the verification of the data. A clearly defined format for the source files is a good start. Most of the time a parser will be used to go through the data and index them. Unfortunately, the format of these data is sometimes not available or not up to date and as a result, writing the parsers becomes difficult and takes time. Providing a detailed description of the data structure, be it a description document, a DTD or an XML schema, will greatly help not only to write the parsers, but also to verify the source files.

Some data providers include release notes which can be used by the automatic update to verify the data which have been indexed (The number of entries is one of the details which are really useful to verify whether the data have been indexed correctly). Unfortunately, most of the data indexed cant be verified because this information is missing or incorrect. Making sure that such information is available and accurate helps to guaranty the quality of the indexed data.

Relevance of the data

The parsers will determine what fields and what information will be stored in the index. So, to ensure the quality of the data, the data providers should establish a list of the fields that have to be indexed, with their names and descriptions, and how they are represented in the source files. This as well as a detailed description of the data format will help writing a parser and define proper names for the fields. These names will be available in the Advanced Search, so they have to be meaningful for the user.

When using the EB-eye XML format, the names of the fields and their content must be clearly defined with the search application in mind. The additional fields section can prove to be really important to improve the quality of the search. A dump with only ids and names will never return any results when searching for common biological terms. If a description or full text additional fields are included the search engine will provide much better results.

Another aspect to consider when selecting data to export for EB-eye are cross-references. Providing a maximum number of cross-references to a wide range of databases will benefit users. By following cross references she will be able to navigate easily between and explore the different domains within the EB-eye.

However, cross-references have to be clearly identified otherwise they might not be properly recognized by the system. Ideally, the cross-references should be using a correct database name and the corresponding ID (and not an accession number), but obviously this is not always possible. If cross reverences can not be provided in the canonical way, please provide necessary information which allow the EB-eye team to update the EB-eye configuration files with the new aliases and further cross-references information.

Content of the results

EB-eye has only has 2 different result pages which can be slightly modified to improve the user experience.

The default layout displays for each entry the id, name and descriptions followed by the entry links and cross-references links. Correctly defining this information ensures coherence of the result display. Therefore name and description should be stored in the index. If no obvious name or ID can be provided data providers should define a meaningful name and ID. Data without ID and name will be only indexed, if a data provider can conclusively argue why she cannot provide them.

Links pointing to data have to be carefully defined. An important link is the ID field link which will redirect to the corresponding data provider web site. The entry links should be checked as well and reviewed by the data providers to make sure they are correct. Obviously, all links should resolve to a valid web page. EB-eye does not check whether the link to behind an ID is valid. However, EB-eye check for every cross reference whether the site behind the cross reference exists. Obviously, EB-eye cannot check whether the content is valid.

Sometimes the default layout is not appropriate to display the results. In such cases, data providers should contact the EB-eye team to discuss a possibly custom layout. For every layout default or custom, the simplicity of the layout should be one of the main objectives. Thus, only the information which is really needed to allow the user to decide whether he should visit the original site should be included.

Have a use case that's not covered?

Please open an issue in the tracker. We'll update this living pattern library with your feedback.