The complete list of supported file formats includes:
.doc, .docx, .htm, .html, .tei, .tmx, .txt, .vert, .xml,
.pdf (scanned images must be OCRed before uploading)
.xls, .xlsx, .csv, .tmx, .xlf/.xliff, .ods (for parallel corpora only)
.zip, .tar.gz (to upload a large number of files at once)
An XML file is also possible if you upload it as plain text but it should only contain text with structural mark-up (such as document or paragraph boundaries; document metadata, etc.). More complex XML will not be processed correctly. Here is a sample of XML text that would be processed correctly:
With regards to PDF files, please bear in mind that firstly PDF files are converted into plain text in order to create a corpus. This conversion is still an unsolved problem in computer science (across various fields), especially with PDF files containing multiple columns, headings/footers or splitting words at the end of lines which may not be processed correctly.