The whole list of supported file formats includes: .doc, .docx, .htm, .html, .ods, .pdf, .tar.bz2, .tar.gz, .tei, .tgz, .tmx, .txt, .vert, .xlf, .xliff, .xls, .xlsx, .xml, .zip.
An XML file is also possible if you upload it as plain text but it should only contain text with structural mark-up (such as document or paragraph boundaries; document metadata, etc.). More complex XML will not be processed correctly. Here is a sample of XML text that would be processed correctly:
With regards to PDF files, please bear in mind that firstly PDF files are converted into plain text in order to create a corpus. This conversion is still an unsolved problem in computer science (across various fields), especially with PDF files containing multiple columns, headings/footers or splitting words at the end of lines which may not be processed correctly.